SlideShare ist ein Scribd-Unternehmen logo
1 von 142
Downloaden Sie, um offline zu lesen
Spark 2
Alexey Zinovyev, Java/BigData Trainer in EPAM
About
With IT since 2007
With Java since 2009
With Hadoop since 2012
With EPAM since 2015
3Spark 2 from Zinoviev Alexey
Contacts
E-mail : Alexey_Zinovyev@epam.com
Twitter : @zaleslaw @BigDataRussia
Facebook: https://www.facebook.com/zaleslaw
vk.com/big_data_russia Big Data Russia
vk.com/java_jvm Java & JVM langs
4Spark 2 from Zinoviev Alexey
Sprk Dvlprs! Let’s start!
5Spark 2 from Zinoviev Alexey
< SPARK 2.0
6Spark 2 from Zinoviev Alexey
Modern Java in 2016Big Data in 2014
7Spark 2 from Zinoviev Alexey
Big Data in 2017
8Spark 2 from Zinoviev Alexey
Machine Learning EVERYWHERE
9Spark 2 from Zinoviev Alexey
Machine Learning vs Traditional Programming
10Spark 2 from Zinoviev Alexey
11Spark 2 from Zinoviev Alexey
Something wrong with HADOOP
12Spark 2 from Zinoviev Alexey
Hadoop is not SEXY
13Spark 2 from Zinoviev Alexey
Whaaaat?
14Spark 2 from Zinoviev Alexey
Map Reduce Job Writing
15Spark 2 from Zinoviev Alexey
MR code
16Spark 2 from Zinoviev Alexey
Hadoop Developers Right Now
17Spark 2 from Zinoviev Alexey
Iterative Calculations
10x – 100x
18Spark 2 from Zinoviev Alexey
MapReduce vs Spark
19Spark 2 from Zinoviev Alexey
MapReduce vs Spark
20Spark 2 from Zinoviev Alexey
MapReduce vs Spark
21Spark 2 from Zinoviev Alexey
SPARK INTRO
22Spark 2 from Zinoviev Alexey
Loading
val localData = (5,7,1,12,10,25)
val ourFirstRDD = sc.parallelize(localData)
val textFile = sc.textFile("hdfs://...")
23Spark 2 from Zinoviev Alexey
Loading
val localData = (5,7,1,12,10,25)
val ourFirstRDD = sc.parallelize(localData)
val textFile = sc.textFile("hdfs://...")
24Spark 2 from Zinoviev Alexey
Loading
val localData = (5,7,1,12,10,25)
val ourFirstRDD = sc.parallelize(localData)
// from file
val textFile = sc.textFile("hdfs://...")
25Spark 2 from Zinoviev Alexey
Loading
val localData = (5,7,1,12,10,25)
val ourFirstRDD = sc.parallelize(localData)
// from file
val textFile = sc.textFile("hdfs://...")
26Spark 2 from Zinoviev Alexey
Loading
val localData = (5,7,1,12,10,25)
val ourFirstRDD = sc.parallelize(localData)
// from file
val textFile = sc.textFile("hdfs://...")
27Spark 2 from Zinoviev Alexey
Word
Count
val textFile = sc.textFile("hdfs://...")
val counts = textFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
28Spark 2 from Zinoviev Alexey
Word
Count
val textFile = sc.textFile("hdfs://...")
val counts = textFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
29Spark 2 from Zinoviev Alexey
Word
Count
val textFile = sc.textFile("hdfs://...")
val counts = textFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
30Spark 2 from Zinoviev Alexey
SQL
val hive = new HiveContext(spark)
hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT,
value STRING)”)
hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO
TABLE src”)
val results = hive.hql(“FROM src SELECT key,
value”).collect()
31Spark 2 from Zinoviev Alexey
SQL
val hive = new HiveContext(spark)
hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT,
value STRING)”)
hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO
TABLE src”)
val results = hive.hql(“FROM src SELECT key,
value”).collect()
32Spark 2 from Zinoviev Alexey
SQL
val hive = new HiveContext(spark)
hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT,
value STRING)”)
hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO
TABLE src”)
val results = hive.hql(“FROM src SELECT key,
value”).collect()
33Spark 2 from Zinoviev Alexey
SQL
val hive = new HiveContext(spark)
hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT,
value STRING)”)
hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO
TABLE src”)
val results = hive.hql(“FROM src SELECT key,
value”).collect()
34Spark 2 from Zinoviev Alexey
RDD Factory
35Spark 2 from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
36Spark 2 from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
37Spark 2 from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
38Spark 2 from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
39Spark 2 from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
40Spark 2 from Zinoviev Alexey
It’s very easy to understand Graph algorithms
def pregel[A](initialMsg: A, maxIterations: Int, activeDirection:
EdgeDirection)(
vprog: (VertexID, VD, A) => VD,
sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexID,A)],
mergeMsg: (A, A) => A)
: Graph[VD, ED] = { <Your Pregel Computations> }
41Spark 2 from Zinoviev Alexey
SPARK 2.0 DISCUSSION
42Spark 2 from Zinoviev Alexey
Spark
Family
43Spark 2 from Zinoviev Alexey
Spark
Family
44Spark 2 from Zinoviev Alexey
Case #0 : How to avoid DStreams with RDD-like API?
45Spark 2 from Zinoviev Alexey
Continuous Applications
46Spark 2 from Zinoviev Alexey
Continuous Applications cases
• Updating data that will be served in real time
• Extract, transform and load (ETL)
• Creating a real-time version of an existing batch job
• Online machine learning
47Spark 2 from Zinoviev Alexey
Write Batches
48Spark 2 from Zinoviev Alexey
Catch Streaming
49Spark 2 from Zinoviev Alexey
The main concept of Structured Streaming
You can express your streaming computation the
same way you would express a batch computation
on static data.
50Spark 2 from Zinoviev Alexey
Batch
// Read JSON once from S3
logsDF = spark.read.json("s3://logs")
// Transform with DataFrame API and save
logsDF.select("user", "url", "date")
.write.parquet("s3://out")
51Spark 2 from Zinoviev Alexey
Real
Time
// Read JSON continuously from S3
logsDF = spark.readStream.json("s3://logs")
// Transform with DataFrame API and save
logsDF.select("user", "url", "date")
.writeStream.parquet("s3://out")
.start()
52Spark 2 from Zinoviev Alexey
Unlimited Table
53Spark 2 from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
54Spark 2 from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
55Spark 2 from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
56Spark 2 from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
Don’t forget
to start
Streaming
57Spark 2 from Zinoviev Alexey
WordCount with Structured Streaming
58Spark 2 from Zinoviev Alexey
Structured Streaming provides …
• fast & scalable
• fault-tolerant
• end-to-end with exactly-once semantic
• stream processing
• ability to use DataFrame/DataSet API for streaming
59Spark 2 from Zinoviev Alexey
Structured Streaming provides (in dreams) …
• fast & scalable
• fault-tolerant
• end-to-end with exactly-once semantic
• stream processing
• ability to use DataFrame/DataSet API for streaming
60Spark 2 from Zinoviev Alexey
Let’s UNION streaming and static DataSets
61Spark 2 from Zinoviev Alexey
Let’s UNION streaming and static DataSets
org.apache.spark.sql.
AnalysisException:
Union between streaming
and batch
DataFrames/Datasets is not
supported;
62Spark 2 from Zinoviev Alexey
Let’s UNION streaming and static DataSets
Go to UnsupportedOperationChecker.scala and check your
operation
63Spark 2 from Zinoviev Alexey
Case #1 : We should think about optimization in RDD terms
64Spark 2 from Zinoviev Alexey
Single
Thread
collection
65Spark 2 from Zinoviev Alexey
No perf
issues,
right?
66Spark 2 from Zinoviev Alexey
The main concept
more partitions = more parallelism
67Spark 2 from Zinoviev Alexey
Do it
parallel
68Spark 2 from Zinoviev Alexey
I’d like
NARROW
69Spark 2 from Zinoviev Alexey
Map, filter, filter
70Spark 2 from Zinoviev Alexey
GroupByKey, join
71Spark 2 from Zinoviev Alexey
Case #2 : DataFrames suggest mix SQL and Scala functions
72Spark 2 from Zinoviev Alexey
History of Spark APIs
73Spark 2 from Zinoviev Alexey
RDD
rdd.filter(_.age > 21) // RDD
df.filter("age > 21") // DataFrame SQL-style
df.filter(df.col("age").gt(21)) // Expression style
dataset.filter(_.age < 21); // Dataset API
74Spark 2 from Zinoviev Alexey
History of Spark APIs
75Spark 2 from Zinoviev Alexey
SQL
rdd.filter(_.age > 21) // RDD
df.filter("age > 21") // DataFrame SQL-style
df.filter(df.col("age").gt(21)) // Expression style
dataset.filter(_.age < 21); // Dataset API
76Spark 2 from Zinoviev Alexey
Expression
rdd.filter(_.age > 21) // RDD
df.filter("age > 21") // DataFrame SQL-style
df.filter(df.col("age").gt(21)) // Expression style
dataset.filter(_.age < 21); // Dataset API
77Spark 2 from Zinoviev Alexey
History of Spark APIs
78Spark 2 from Zinoviev Alexey
DataSet
rdd.filter(_.age > 21) // RDD
df.filter("age > 21") // DataFrame SQL-style
df.filter(df.col("age").gt(21)) // Expression style
dataset.filter(_.age < 21); // Dataset API
79Spark 2 from Zinoviev Alexey
Case #2 : DataFrame is referring to data attributes by name
80Spark 2 from Zinoviev Alexey
DataSet = RDD’s types + DataFrame’s Catalyst
• RDD API
• compile-time type-safety
• off-heap storage mechanism
• performance benefits of the Catalyst query optimizer
• Tungsten
81Spark 2 from Zinoviev Alexey
DataSet = RDD’s types + DataFrame’s Catalyst
• RDD API
• compile-time type-safety
• off-heap storage mechanism
• performance benefits of the Catalyst query optimizer
• Tungsten
82Spark 2 from Zinoviev Alexey
Structured APIs in SPARK
83Spark 2 from Zinoviev Alexey
Unified API in Spark 2.0
DataFrame = Dataset[Row]
Dataframe is a schemaless (untyped) Dataset now
84Spark 2 from Zinoviev Alexey
Define
case class
case class User(email: String, footSize: Long, name: String)
// DataFrame -> DataSet with Users
val userDS =
spark.read.json("/home/tmp/datasets/users.json").as[User]
userDS.map(_.name).collect()
userDS.filter(_.footSize > 38).collect()
ds.rdd // IF YOU REALLY WANT
85Spark 2 from Zinoviev Alexey
Read JSON
case class User(email: String, footSize: Long, name: String)
// DataFrame -> DataSet with Users
val userDS =
spark.read.json("/home/tmp/datasets/users.json").as[User]
userDS.map(_.name).collect()
userDS.filter(_.footSize > 38).collect()
ds.rdd // IF YOU REALLY WANT
86Spark 2 from Zinoviev Alexey
Filter by
Field
case class User(email: String, footSize: Long, name: String)
// DataFrame -> DataSet with Users
val userDS =
spark.read.json("/home/tmp/datasets/users.json").as[User]
userDS.map(_.name).collect()
userDS.filter(_.footSize > 38).collect()
ds.rdd // IF YOU REALLY WANT
87Spark 2 from Zinoviev Alexey
Case #3 : Spark has many contexts
88Spark 2 from Zinoviev Alexey
Spark Session
• New entry point in spark for creating datasets
• Replaces SQLContext, HiveContext and StreamingContext
• Move from SparkContext to SparkSession signifies move
away from RDD
89Spark 2 from Zinoviev Alexey
Spark
Session
val sparkSession = SparkSession.builder
.master("local")
.appName("spark session example")
.getOrCreate()
val df = sparkSession.read
.option("header","true")
.csv("src/main/resources/names.csv")
df.show()
90Spark 2 from Zinoviev Alexey
No, I want to create my lovely RDDs
91Spark 2 from Zinoviev Alexey
Where’s parallelize() method?
92Spark 2 from Zinoviev Alexey
RDD?
case class User(email: String, footSize: Long, name: String)
// DataFrame -> DataSet with Users
val userDS =
spark.read.json("/home/tmp/datasets/users.json").as[User]
userDS.map(_.name).collect()
userDS.filter(_.footSize > 38).collect()
ds.rdd // IF YOU REALLY WANT
93Spark 2 from Zinoviev Alexey
Case #4 : Spark uses Java serialization A LOT
94Spark 2 from Zinoviev Alexey
Two choices to distribute data across cluster
• Java serialization
By default with ObjectOutputStream
• Kryo serialization
Should register classes (no support of Serialazible)
95Spark 2 from Zinoviev Alexey
The main problem: overhead of serializing
Each serialized object contains the class structure as
well as the values
96Spark 2 from Zinoviev Alexey
The main problem: overhead of serializing
Each serialized object contains the class structure as
well as the values
Don’t forget about GC
97Spark 2 from Zinoviev Alexey
Tungsten Compact Encoding
98Spark 2 from Zinoviev Alexey
Encoder’s concept
Generate bytecode to interact with off-heap
&
Give access to attributes without ser/deser
99Spark 2 from Zinoviev Alexey
Encoders
100Spark 2 from Zinoviev Alexey
No custom encoders
101Spark 2 from Zinoviev Alexey
Case #5 : Not enough storage levels 
102Spark 2 from Zinoviev Alexey
Caching in Spark
• Frequently used RDD can be stored in memory
• One method, one short-cut: persist(), cache()
• SparkContext keeps track of cached RDD
• Serialized or deserialized Java objects
103Spark 2 from Zinoviev Alexey
Full list of options
• MEMORY_ONLY
• MEMORY_AND_DISK
• MEMORY_ONLY_SER
• MEMORY_AND_DISK_SER
• DISK_ONLY
• MEMORY_ONLY_2, MEMORY_AND_DISK_2
104Spark 2 from Zinoviev Alexey
Spark Core Storage Level
• MEMORY_ONLY (default for Spark Core)
• MEMORY_AND_DISK
• MEMORY_ONLY_SER
• MEMORY_AND_DISK_SER
• DISK_ONLY
• MEMORY_ONLY_2, MEMORY_AND_DISK_2
105Spark 2 from Zinoviev Alexey
Spark Streaming Storage Level
• MEMORY_ONLY (default for Spark Core)
• MEMORY_AND_DISK
• MEMORY_ONLY_SER (default for Spark Streaming)
• MEMORY_AND_DISK_SER
• DISK_ONLY
• MEMORY_ONLY_2, MEMORY_AND_DISK_2
106Spark 2 from Zinoviev Alexey
Developer API to make new Storage Levels
107Spark 2 from Zinoviev Alexey
What’s the most popular file format in BigData?
108Spark 2 from Zinoviev Alexey
Case #6 : External libraries to read CSV
109Spark 2 from Zinoviev Alexey
Easy to
read CSV
data = sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/datasets/samples/users.csv")
data.cache()
data.createOrReplaceTempView(“users")
display(data)
110Spark 2 from Zinoviev Alexey
Case #7 : How to measure Spark performance?
111Spark 2 from Zinoviev Alexey
You'd measure performance!
112Spark 2 from Zinoviev Alexey
TPCDS
99 Queries
http://bit.ly/2dObMsH
113Spark 2 from Zinoviev Alexey
114Spark 2 from Zinoviev Alexey
How to benchmark Spark
115Spark 2 from Zinoviev Alexey
Special Tool from Databricks
Benchmark Tool for SparkSQL
https://github.com/databricks/spark-sql-perf
116Spark 2 from Zinoviev Alexey
Spark 2 vs Spark 1.6
117Spark 2 from Zinoviev Alexey
Case #8 : What’s faster: SQL or DataSet API?
118Spark 2 from Zinoviev Alexey
Job Stages in old Spark
119Spark 2 from Zinoviev Alexey
Scheduler Optimizations
120Spark 2 from Zinoviev Alexey
Catalyst Optimizer for DataFrames
121Spark 2 from Zinoviev Alexey
Unified Logical Plan
DataFrame = Dataset[Row]
122Spark 2 from Zinoviev Alexey
Bytecode
123Spark 2 from Zinoviev Alexey
DataSet.explain()
== Physical Plan ==
Project [avg(price)#43,carat#45]
+- SortMergeJoin [color#21], [color#47]
:- Sort [color#21 ASC], false, 0
: +- TungstenExchange hashpartitioning(color#21,200), None
: +- Project [avg(price)#43,color#21]
: +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as
bigint)),mode=Final,isDistinct=false)], output=[color#21,avg(price)#43])
: +- TungstenExchange hashpartitioning(cut#20,color#21,200), None
: +- TungstenAggregate(key=[cut#20,color#21],
functions=[(avg(cast(price#25 as bigint)),mode=Partial,isDistinct=false)],
output=[cut#20,color#21,sum#58,count#59L])
: +- Scan CsvRelation(-----)
+- Sort [color#47 ASC], false, 0
+- TungstenExchange hashpartitioning(color#47,200), None
+- ConvertToUnsafe
+- Scan CsvRelation(----)
124Spark 2 from Zinoviev Alexey
Case #9 : Why does explain() show so many Tungsten things?
125Spark 2 from Zinoviev Alexey
How to be effective with CPU
• Runtime code generation
• Exploiting cache locality
• Off-heap memory management
126Spark 2 from Zinoviev Alexey
Tungsten’s goal
Push performance closer to the limits of modern
hardware
127Spark 2 from Zinoviev Alexey
Maybe something UNSAFE?
128Spark 2 from Zinoviev Alexey
UnsafeRowFormat
• Bit set for tracking null values
• Small values are inlined
• For variable-length values are stored relative offset into the
variablelength data section
• Rows are always 8-byte word aligned
• Equality comparison and hashing can be performed on raw
bytes without requiring additional interpretation
129Spark 2 from Zinoviev Alexey
Case #10 : Can I influence on Memory Management in Spark?
130Spark 2 from Zinoviev Alexey
Case #11 : Should I tune generation’s stuff?
131Spark 2 from Zinoviev Alexey
Cached
Data
132Spark 2 from Zinoviev Alexey
During
operations
133Spark 2 from Zinoviev Alexey
For your
needs
134Spark 2 from Zinoviev Alexey
For Dark
Lord
135Spark 2 from Zinoviev Alexey
IN CONCLUSION
136Spark 2 from Zinoviev Alexey
We have no ability…
• join structured streaming and other sources to handle it
• one unified ML API
• GraphX rethinking and redesign
• Custom encoders
• Datasets everywhere
• integrate with something important
137Spark 2 from Zinoviev Alexey
Roadmap
• Support other data sources (not only S3 + HDFS)
• Transactional updates
• Dataset is one DSL for all operations
• GraphFrames + Structured MLLib
• Tungsten: custom encoders
• The RDD-based API is expected to be removed in Spark
3.0
138Spark 2 from Zinoviev Alexey
And we can DO IT!
Real-Time Data-Marts
Batch Data-Marts
Relations Graph
Ontology Metadata
Search Index
Events & Alarms
Real-time
Dashboarding
Events & Alarms
All Raw Data backup
is stored here
Real-time Data
Ingestion
Batch Data
Ingestion
Real-Time ETL & CEP
Batch ETL & Raw Area
Scheduler
Internal
External
Social
HDFS → CFS
as an option
Time-Series Data
Titan & KairosDB
store data in Cassandra
Push Events & Alarms (Email, SNMP etc.)
139Spark 2 from Zinoviev Alexey
First Part
140Spark 2 from Zinoviev Alexey
Second
Part
141Spark 2 from Zinoviev Alexey
Contacts
E-mail : Alexey_Zinovyev@epam.com
Twitter : @zaleslaw @BigDataRussia
Facebook: https://www.facebook.com/zaleslaw
vk.com/big_data_russia Big Data Russia
vk.com/java_jvm Java & JVM langs
142Spark 2 from Zinoviev Alexey
Any questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Norikra: Stream Processing with SQL
Norikra: Stream Processing with SQLNorikra: Stream Processing with SQL
Norikra: Stream Processing with SQLSATOSHI TAGOMORI
 
Norikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In RubyNorikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In RubySATOSHI TAGOMORI
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...DataStax Academy
 
DataStax: An Introduction to DataStax Enterprise Search
DataStax: An Introduction to DataStax Enterprise SearchDataStax: An Introduction to DataStax Enterprise Search
DataStax: An Introduction to DataStax Enterprise SearchDataStax Academy
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandranickmbailey
 
Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Russell Spitzer
 
Cassandra introduction 2016
Cassandra introduction 2016Cassandra introduction 2016
Cassandra introduction 2016Duyhai Doan
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the firePatrick McFadin
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Konrad Malawski
 
Solr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for YouSolr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for YouSematext Group, Inc.
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesDuyhai Doan
 
Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Holden Karau
 
Diagnostics & Debugging webinar
Diagnostics & Debugging webinarDiagnostics & Debugging webinar
Diagnostics & Debugging webinarMongoDB
 
HBase RowKey design for Akka Persistence
HBase RowKey design for Akka PersistenceHBase RowKey design for Akka Persistence
HBase RowKey design for Akka PersistenceKonrad Malawski
 
Diagnostics and Debugging
Diagnostics and DebuggingDiagnostics and Debugging
Diagnostics and DebuggingMongoDB
 
Big data 101 for beginners devoxxpl
Big data 101 for beginners devoxxplBig data 101 for beginners devoxxpl
Big data 101 for beginners devoxxplDuyhai Doan
 
Data Wars: The Bloody Enterprise strikes back
Data Wars: The Bloody Enterprise strikes backData Wars: The Bloody Enterprise strikes back
Data Wars: The Bloody Enterprise strikes backVictor_Cr
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 

Was ist angesagt? (20)

Norikra: Stream Processing with SQL
Norikra: Stream Processing with SQLNorikra: Stream Processing with SQL
Norikra: Stream Processing with SQL
 
Norikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In RubyNorikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In Ruby
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
 
DataStax: An Introduction to DataStax Enterprise Search
DataStax: An Introduction to DataStax Enterprise SearchDataStax: An Introduction to DataStax Enterprise Search
DataStax: An Introduction to DataStax Enterprise Search
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0
 
Cassandra introduction 2016
Cassandra introduction 2016Cassandra introduction 2016
Cassandra introduction 2016
 
Python database interfaces
Python database  interfacesPython database  interfaces
Python database interfaces
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
Solr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for YouSolr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for You
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-Cases
 
Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016
 
Diagnostics & Debugging webinar
Diagnostics & Debugging webinarDiagnostics & Debugging webinar
Diagnostics & Debugging webinar
 
HBase RowKey design for Akka Persistence
HBase RowKey design for Akka PersistenceHBase RowKey design for Akka Persistence
HBase RowKey design for Akka Persistence
 
Diagnostics and Debugging
Diagnostics and DebuggingDiagnostics and Debugging
Diagnostics and Debugging
 
Big data 101 for beginners devoxxpl
Big data 101 for beginners devoxxplBig data 101 for beginners devoxxpl
Big data 101 for beginners devoxxpl
 
Data Wars: The Bloody Enterprise strikes back
Data Wars: The Bloody Enterprise strikes backData Wars: The Bloody Enterprise strikes back
Data Wars: The Bloody Enterprise strikes back
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 

Andere mochten auch

Docker 基本概念與指令操作
Docker  基本概念與指令操作Docker  基本概念與指令操作
Docker 基本概念與指令操作NUTC, imac
 
Spark Solution for Rank Product
Spark Solution for Rank ProductSpark Solution for Rank Product
Spark Solution for Rank ProductMahmoud Parsian
 
Performance in Spark 2.0, PDX Spark Meetup 8/18/16
Performance in Spark 2.0, PDX Spark Meetup 8/18/16Performance in Spark 2.0, PDX Spark Meetup 8/18/16
Performance in Spark 2.0, PDX Spark Meetup 8/18/16pdx_spark
 
使用 CLI 管理 OpenStack 平台
使用 CLI 管理 OpenStack 平台使用 CLI 管理 OpenStack 平台
使用 CLI 管理 OpenStack 平台NUTC, imac
 
Use of spark for proteomic scoring seattle presentation
Use of spark for  proteomic scoring   seattle presentationUse of spark for  proteomic scoring   seattle presentation
Use of spark for proteomic scoring seattle presentationlordjoe
 
Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Alexey Zinoviev
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Chris Fregly
 
Spark 巨量資料處理基礎教學
Spark 巨量資料處理基礎教學Spark 巨量資料處理基礎教學
Spark 巨量資料處理基礎教學NUTC, imac
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDatabricks
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Knoldus Inc.
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Robert "Chip" Senkbeil
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkDatabricks
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 

Andere mochten auch (20)

Docker 基本概念與指令操作
Docker  基本概念與指令操作Docker  基本概念與指令操作
Docker 基本概念與指令操作
 
Spark Solution for Rank Product
Spark Solution for Rank ProductSpark Solution for Rank Product
Spark Solution for Rank Product
 
Performance in Spark 2.0, PDX Spark Meetup 8/18/16
Performance in Spark 2.0, PDX Spark Meetup 8/18/16Performance in Spark 2.0, PDX Spark Meetup 8/18/16
Performance in Spark 2.0, PDX Spark Meetup 8/18/16
 
Apache Spark Essentials
Apache Spark EssentialsApache Spark Essentials
Apache Spark Essentials
 
Meetup Spark 2.0
Meetup Spark 2.0Meetup Spark 2.0
Meetup Spark 2.0
 
使用 CLI 管理 OpenStack 平台
使用 CLI 管理 OpenStack 平台使用 CLI 管理 OpenStack 平台
使用 CLI 管理 OpenStack 平台
 
Use of spark for proteomic scoring seattle presentation
Use of spark for  proteomic scoring   seattle presentationUse of spark for  proteomic scoring   seattle presentation
Use of spark for proteomic scoring seattle presentation
 
Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
 
Spark 巨量資料處理基礎教學
Spark 巨量資料處理基礎教學Spark 巨量資料處理基礎教學
Spark 巨量資料處理基礎教學
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache Spark
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 

Ähnlich wie Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with ClojureDmitry Buzdin
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with ScalaHimanshu Gupta
 
MapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupMapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupLandoop Ltd
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一scalaconfjp
 
Testing batch and streaming Spark applications
Testing batch and streaming Spark applicationsTesting batch and streaming Spark applications
Testing batch and streaming Spark applicationsŁukasz Gawron
 
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark ApplicationsFuture Processing
 
Apache Spark avec NodeJS ? Oui, c'est possible avec EclairJS !
Apache Spark avec NodeJS ? Oui, c'est possible avec EclairJS !Apache Spark avec NodeJS ? Oui, c'est possible avec EclairJS !
Apache Spark avec NodeJS ? Oui, c'est possible avec EclairJS !Bruno Bonnin
 
Automated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureAutomated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureSpark Summit
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...PROIDEA
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
 
Let's build a parser!
Let's build a parser!Let's build a parser!
Let's build a parser!Boy Baukema
 
Stream or not to Stream?

Stream or not to Stream?
Stream or not to Stream?

Stream or not to Stream?
Lukasz Byczynski
 
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghStuart Roebuck
 

Ähnlich wie Joker'16 Spark 2 (API changes; Structured Streaming; Encoders) (20)

Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
MapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupMapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London Meetup
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Testing batch and streaming Spark applications
Testing batch and streaming Spark applicationsTesting batch and streaming Spark applications
Testing batch and streaming Spark applications
 
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
 
Apache Spark avec NodeJS ? Oui, c'est possible avec EclairJS !
Apache Spark avec NodeJS ? Oui, c'est possible avec EclairJS !Apache Spark avec NodeJS ? Oui, c'est possible avec EclairJS !
Apache Spark avec NodeJS ? Oui, c'est possible avec EclairJS !
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Automated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureAutomated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative Infrastructure
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
wtf is in Java/JDK/wtf7?
wtf is in Java/JDK/wtf7?wtf is in Java/JDK/wtf7?
wtf is in Java/JDK/wtf7?
 
Let's build a parser!
Let's build a parser!Let's build a parser!
Let's build a parser!
 
Stream or not to Stream?

Stream or not to Stream?
Stream or not to Stream?

Stream or not to Stream?

 
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup Edinburgh
 

Mehr von Alexey Zinoviev

HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...Alexey Zinoviev
 
Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Alexey Zinoviev
 
JavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projectsJavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projectsAlexey Zinoviev
 
Joker'15 Java straitjackets for MongoDB
Joker'15 Java straitjackets for MongoDBJoker'15 Java straitjackets for MongoDB
Joker'15 Java straitjackets for MongoDBAlexey Zinoviev
 
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...Alexey Zinoviev
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Alexey Zinoviev
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Alexey Zinoviev
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistAlexey Zinoviev
 
First steps in Data Mining Kindergarten
First steps in Data Mining KindergartenFirst steps in Data Mining Kindergarten
First steps in Data Mining KindergartenAlexey Zinoviev
 
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...Alexey Zinoviev
 
Android Geo Apps in Soviet Russia: Latitude and longitude find you
Android Geo Apps in Soviet Russia: Latitude and longitude find youAndroid Geo Apps in Soviet Russia: Latitude and longitude find you
Android Geo Apps in Soviet Russia: Latitude and longitude find youAlexey Zinoviev
 
Keynote on JavaDay Omsk 2014 about new features in Java 8
Keynote on JavaDay Omsk 2014 about new features in Java 8Keynote on JavaDay Omsk 2014 about new features in Java 8
Keynote on JavaDay Omsk 2014 about new features in Java 8Alexey Zinoviev
 
Big data algorithms and data structures for large scale graphs
Big data algorithms and data structures for large scale graphsBig data algorithms and data structures for large scale graphs
Big data algorithms and data structures for large scale graphsAlexey Zinoviev
 
"Говнокод-шоу"
"Говнокод-шоу""Говнокод-шоу"
"Говнокод-шоу"Alexey Zinoviev
 
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"Alexey Zinoviev
 
Алгоритмы и структуры данных BigData для графов большой размерности
Алгоритмы и структуры данных BigData для графов большой размерностиАлгоритмы и структуры данных BigData для графов большой размерности
Алгоритмы и структуры данных BigData для графов большой размерностиAlexey Zinoviev
 
ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)Alexey Zinoviev
 
GDG Devfest Omsk 2013. Year of events!
GDG Devfest Omsk 2013. Year of events!GDG Devfest Omsk 2013. Year of events!
GDG Devfest Omsk 2013. Year of events!Alexey Zinoviev
 
How to port JavaScript library to Android and iOS
How to port JavaScript library to Android and iOSHow to port JavaScript library to Android and iOS
How to port JavaScript library to Android and iOSAlexey Zinoviev
 

Mehr von Alexey Zinoviev (20)

Hadoop Jungle
Hadoop JungleHadoop Jungle
Hadoop Jungle
 
HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...
 
Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15
 
JavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projectsJavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projects
 
Joker'15 Java straitjackets for MongoDB
Joker'15 Java straitjackets for MongoDBJoker'15 Java straitjackets for MongoDB
Joker'15 Java straitjackets for MongoDB
 
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
 
First steps in Data Mining Kindergarten
First steps in Data Mining KindergartenFirst steps in Data Mining Kindergarten
First steps in Data Mining Kindergarten
 
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
 
Android Geo Apps in Soviet Russia: Latitude and longitude find you
Android Geo Apps in Soviet Russia: Latitude and longitude find youAndroid Geo Apps in Soviet Russia: Latitude and longitude find you
Android Geo Apps in Soviet Russia: Latitude and longitude find you
 
Keynote on JavaDay Omsk 2014 about new features in Java 8
Keynote on JavaDay Omsk 2014 about new features in Java 8Keynote on JavaDay Omsk 2014 about new features in Java 8
Keynote on JavaDay Omsk 2014 about new features in Java 8
 
Big data algorithms and data structures for large scale graphs
Big data algorithms and data structures for large scale graphsBig data algorithms and data structures for large scale graphs
Big data algorithms and data structures for large scale graphs
 
"Говнокод-шоу"
"Говнокод-шоу""Говнокод-шоу"
"Говнокод-шоу"
 
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
 
Алгоритмы и структуры данных BigData для графов большой размерности
Алгоритмы и структуры данных BigData для графов большой размерностиАлгоритмы и структуры данных BigData для графов большой размерности
Алгоритмы и структуры данных BigData для графов большой размерности
 
ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)
 
GDG Devfest Omsk 2013. Year of events!
GDG Devfest Omsk 2013. Year of events!GDG Devfest Omsk 2013. Year of events!
GDG Devfest Omsk 2013. Year of events!
 
How to port JavaScript library to Android and iOS
How to port JavaScript library to Android and iOSHow to port JavaScript library to Android and iOS
How to port JavaScript library to Android and iOS
 

Kürzlich hochgeladen

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfrs7054576148
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 

Kürzlich hochgeladen (20)

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdf
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 

Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

  • 1. Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM
  • 2. About With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015
  • 3. 3Spark 2 from Zinoviev Alexey Contacts E-mail : Alexey_Zinovyev@epam.com Twitter : @zaleslaw @BigDataRussia Facebook: https://www.facebook.com/zaleslaw vk.com/big_data_russia Big Data Russia vk.com/java_jvm Java & JVM langs
  • 4. 4Spark 2 from Zinoviev Alexey Sprk Dvlprs! Let’s start!
  • 5. 5Spark 2 from Zinoviev Alexey < SPARK 2.0
  • 6. 6Spark 2 from Zinoviev Alexey Modern Java in 2016Big Data in 2014
  • 7. 7Spark 2 from Zinoviev Alexey Big Data in 2017
  • 8. 8Spark 2 from Zinoviev Alexey Machine Learning EVERYWHERE
  • 9. 9Spark 2 from Zinoviev Alexey Machine Learning vs Traditional Programming
  • 10. 10Spark 2 from Zinoviev Alexey
  • 11. 11Spark 2 from Zinoviev Alexey Something wrong with HADOOP
  • 12. 12Spark 2 from Zinoviev Alexey Hadoop is not SEXY
  • 13. 13Spark 2 from Zinoviev Alexey Whaaaat?
  • 14. 14Spark 2 from Zinoviev Alexey Map Reduce Job Writing
  • 15. 15Spark 2 from Zinoviev Alexey MR code
  • 16. 16Spark 2 from Zinoviev Alexey Hadoop Developers Right Now
  • 17. 17Spark 2 from Zinoviev Alexey Iterative Calculations 10x – 100x
  • 18. 18Spark 2 from Zinoviev Alexey MapReduce vs Spark
  • 19. 19Spark 2 from Zinoviev Alexey MapReduce vs Spark
  • 20. 20Spark 2 from Zinoviev Alexey MapReduce vs Spark
  • 21. 21Spark 2 from Zinoviev Alexey SPARK INTRO
  • 22. 22Spark 2 from Zinoviev Alexey Loading val localData = (5,7,1,12,10,25) val ourFirstRDD = sc.parallelize(localData) val textFile = sc.textFile("hdfs://...")
  • 23. 23Spark 2 from Zinoviev Alexey Loading val localData = (5,7,1,12,10,25) val ourFirstRDD = sc.parallelize(localData) val textFile = sc.textFile("hdfs://...")
  • 24. 24Spark 2 from Zinoviev Alexey Loading val localData = (5,7,1,12,10,25) val ourFirstRDD = sc.parallelize(localData) // from file val textFile = sc.textFile("hdfs://...")
  • 25. 25Spark 2 from Zinoviev Alexey Loading val localData = (5,7,1,12,10,25) val ourFirstRDD = sc.parallelize(localData) // from file val textFile = sc.textFile("hdfs://...")
  • 26. 26Spark 2 from Zinoviev Alexey Loading val localData = (5,7,1,12,10,25) val ourFirstRDD = sc.parallelize(localData) // from file val textFile = sc.textFile("hdfs://...")
  • 27. 27Spark 2 from Zinoviev Alexey Word Count val textFile = sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 28. 28Spark 2 from Zinoviev Alexey Word Count val textFile = sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 29. 29Spark 2 from Zinoviev Alexey Word Count val textFile = sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 30. 30Spark 2 from Zinoviev Alexey SQL val hive = new HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()
  • 31. 31Spark 2 from Zinoviev Alexey SQL val hive = new HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()
  • 32. 32Spark 2 from Zinoviev Alexey SQL val hive = new HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()
  • 33. 33Spark 2 from Zinoviev Alexey SQL val hive = new HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()
  • 34. 34Spark 2 from Zinoviev Alexey RDD Factory
  • 35. 35Spark 2 from Zinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  • 36. 36Spark 2 from Zinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  • 37. 37Spark 2 from Zinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  • 38. 38Spark 2 from Zinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  • 39. 39Spark 2 from Zinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  • 40. 40Spark 2 from Zinoviev Alexey It’s very easy to understand Graph algorithms def pregel[A](initialMsg: A, maxIterations: Int, activeDirection: EdgeDirection)( vprog: (VertexID, VD, A) => VD, sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexID,A)], mergeMsg: (A, A) => A) : Graph[VD, ED] = { <Your Pregel Computations> }
  • 41. 41Spark 2 from Zinoviev Alexey SPARK 2.0 DISCUSSION
  • 42. 42Spark 2 from Zinoviev Alexey Spark Family
  • 43. 43Spark 2 from Zinoviev Alexey Spark Family
  • 44. 44Spark 2 from Zinoviev Alexey Case #0 : How to avoid DStreams with RDD-like API?
  • 45. 45Spark 2 from Zinoviev Alexey Continuous Applications
  • 46. 46Spark 2 from Zinoviev Alexey Continuous Applications cases • Updating data that will be served in real time • Extract, transform and load (ETL) • Creating a real-time version of an existing batch job • Online machine learning
  • 47. 47Spark 2 from Zinoviev Alexey Write Batches
  • 48. 48Spark 2 from Zinoviev Alexey Catch Streaming
  • 49. 49Spark 2 from Zinoviev Alexey The main concept of Structured Streaming You can express your streaming computation the same way you would express a batch computation on static data.
  • 50. 50Spark 2 from Zinoviev Alexey Batch // Read JSON once from S3 logsDF = spark.read.json("s3://logs") // Transform with DataFrame API and save logsDF.select("user", "url", "date") .write.parquet("s3://out")
  • 51. 51Spark 2 from Zinoviev Alexey Real Time // Read JSON continuously from S3 logsDF = spark.readStream.json("s3://logs") // Transform with DataFrame API and save logsDF.select("user", "url", "date") .writeStream.parquet("s3://out") .start()
  • 52. 52Spark 2 from Zinoviev Alexey Unlimited Table
  • 53. 53Spark 2 from Zinoviev Alexey WordCount from Socket val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count()
  • 54. 54Spark 2 from Zinoviev Alexey WordCount from Socket val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count()
  • 55. 55Spark 2 from Zinoviev Alexey WordCount from Socket val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count()
  • 56. 56Spark 2 from Zinoviev Alexey WordCount from Socket val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count() Don’t forget to start Streaming
  • 57. 57Spark 2 from Zinoviev Alexey WordCount with Structured Streaming
  • 58. 58Spark 2 from Zinoviev Alexey Structured Streaming provides … • fast & scalable • fault-tolerant • end-to-end with exactly-once semantic • stream processing • ability to use DataFrame/DataSet API for streaming
  • 59. 59Spark 2 from Zinoviev Alexey Structured Streaming provides (in dreams) … • fast & scalable • fault-tolerant • end-to-end with exactly-once semantic • stream processing • ability to use DataFrame/DataSet API for streaming
  • 60. 60Spark 2 from Zinoviev Alexey Let’s UNION streaming and static DataSets
  • 61. 61Spark 2 from Zinoviev Alexey Let’s UNION streaming and static DataSets org.apache.spark.sql. AnalysisException: Union between streaming and batch DataFrames/Datasets is not supported;
  • 62. 62Spark 2 from Zinoviev Alexey Let’s UNION streaming and static DataSets Go to UnsupportedOperationChecker.scala and check your operation
  • 63. 63Spark 2 from Zinoviev Alexey Case #1 : We should think about optimization in RDD terms
  • 64. 64Spark 2 from Zinoviev Alexey Single Thread collection
  • 65. 65Spark 2 from Zinoviev Alexey No perf issues, right?
  • 66. 66Spark 2 from Zinoviev Alexey The main concept more partitions = more parallelism
  • 67. 67Spark 2 from Zinoviev Alexey Do it parallel
  • 68. 68Spark 2 from Zinoviev Alexey I’d like NARROW
  • 69. 69Spark 2 from Zinoviev Alexey Map, filter, filter
  • 70. 70Spark 2 from Zinoviev Alexey GroupByKey, join
  • 71. 71Spark 2 from Zinoviev Alexey Case #2 : DataFrames suggest mix SQL and Scala functions
  • 72. 72Spark 2 from Zinoviev Alexey History of Spark APIs
  • 73. 73Spark 2 from Zinoviev Alexey RDD rdd.filter(_.age > 21) // RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API
  • 74. 74Spark 2 from Zinoviev Alexey History of Spark APIs
  • 75. 75Spark 2 from Zinoviev Alexey SQL rdd.filter(_.age > 21) // RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API
  • 76. 76Spark 2 from Zinoviev Alexey Expression rdd.filter(_.age > 21) // RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API
  • 77. 77Spark 2 from Zinoviev Alexey History of Spark APIs
  • 78. 78Spark 2 from Zinoviev Alexey DataSet rdd.filter(_.age > 21) // RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API
  • 79. 79Spark 2 from Zinoviev Alexey Case #2 : DataFrame is referring to data attributes by name
  • 80. 80Spark 2 from Zinoviev Alexey DataSet = RDD’s types + DataFrame’s Catalyst • RDD API • compile-time type-safety • off-heap storage mechanism • performance benefits of the Catalyst query optimizer • Tungsten
  • 81. 81Spark 2 from Zinoviev Alexey DataSet = RDD’s types + DataFrame’s Catalyst • RDD API • compile-time type-safety • off-heap storage mechanism • performance benefits of the Catalyst query optimizer • Tungsten
  • 82. 82Spark 2 from Zinoviev Alexey Structured APIs in SPARK
  • 83. 83Spark 2 from Zinoviev Alexey Unified API in Spark 2.0 DataFrame = Dataset[Row] Dataframe is a schemaless (untyped) Dataset now
  • 84. 84Spark 2 from Zinoviev Alexey Define case class case class User(email: String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT
  • 85. 85Spark 2 from Zinoviev Alexey Read JSON case class User(email: String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT
  • 86. 86Spark 2 from Zinoviev Alexey Filter by Field case class User(email: String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT
  • 87. 87Spark 2 from Zinoviev Alexey Case #3 : Spark has many contexts
  • 88. 88Spark 2 from Zinoviev Alexey Spark Session • New entry point in spark for creating datasets • Replaces SQLContext, HiveContext and StreamingContext • Move from SparkContext to SparkSession signifies move away from RDD
  • 89. 89Spark 2 from Zinoviev Alexey Spark Session val sparkSession = SparkSession.builder .master("local") .appName("spark session example") .getOrCreate() val df = sparkSession.read .option("header","true") .csv("src/main/resources/names.csv") df.show()
  • 90. 90Spark 2 from Zinoviev Alexey No, I want to create my lovely RDDs
  • 91. 91Spark 2 from Zinoviev Alexey Where’s parallelize() method?
  • 92. 92Spark 2 from Zinoviev Alexey RDD? case class User(email: String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT
  • 93. 93Spark 2 from Zinoviev Alexey Case #4 : Spark uses Java serialization A LOT
  • 94. 94Spark 2 from Zinoviev Alexey Two choices to distribute data across cluster • Java serialization By default with ObjectOutputStream • Kryo serialization Should register classes (no support of Serialazible)
  • 95. 95Spark 2 from Zinoviev Alexey The main problem: overhead of serializing Each serialized object contains the class structure as well as the values
  • 96. 96Spark 2 from Zinoviev Alexey The main problem: overhead of serializing Each serialized object contains the class structure as well as the values Don’t forget about GC
  • 97. 97Spark 2 from Zinoviev Alexey Tungsten Compact Encoding
  • 98. 98Spark 2 from Zinoviev Alexey Encoder’s concept Generate bytecode to interact with off-heap & Give access to attributes without ser/deser
  • 99. 99Spark 2 from Zinoviev Alexey Encoders
  • 100. 100Spark 2 from Zinoviev Alexey No custom encoders
  • 101. 101Spark 2 from Zinoviev Alexey Case #5 : Not enough storage levels 
  • 102. 102Spark 2 from Zinoviev Alexey Caching in Spark • Frequently used RDD can be stored in memory • One method, one short-cut: persist(), cache() • SparkContext keeps track of cached RDD • Serialized or deserialized Java objects
  • 103. 103Spark 2 from Zinoviev Alexey Full list of options • MEMORY_ONLY • MEMORY_AND_DISK • MEMORY_ONLY_SER • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2
  • 104. 104Spark 2 from Zinoviev Alexey Spark Core Storage Level • MEMORY_ONLY (default for Spark Core) • MEMORY_AND_DISK • MEMORY_ONLY_SER • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2
  • 105. 105Spark 2 from Zinoviev Alexey Spark Streaming Storage Level • MEMORY_ONLY (default for Spark Core) • MEMORY_AND_DISK • MEMORY_ONLY_SER (default for Spark Streaming) • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2
  • 106. 106Spark 2 from Zinoviev Alexey Developer API to make new Storage Levels
  • 107. 107Spark 2 from Zinoviev Alexey What’s the most popular file format in BigData?
  • 108. 108Spark 2 from Zinoviev Alexey Case #6 : External libraries to read CSV
  • 109. 109Spark 2 from Zinoviev Alexey Easy to read CSV data = sqlContext.read.format("csv") .option("header", "true") .option("inferSchema", "true") .load("/datasets/samples/users.csv") data.cache() data.createOrReplaceTempView(“users") display(data)
  • 110. 110Spark 2 from Zinoviev Alexey Case #7 : How to measure Spark performance?
  • 111. 111Spark 2 from Zinoviev Alexey You'd measure performance!
  • 112. 112Spark 2 from Zinoviev Alexey TPCDS 99 Queries http://bit.ly/2dObMsH
  • 113. 113Spark 2 from Zinoviev Alexey
  • 114. 114Spark 2 from Zinoviev Alexey How to benchmark Spark
  • 115. 115Spark 2 from Zinoviev Alexey Special Tool from Databricks Benchmark Tool for SparkSQL https://github.com/databricks/spark-sql-perf
  • 116. 116Spark 2 from Zinoviev Alexey Spark 2 vs Spark 1.6
  • 117. 117Spark 2 from Zinoviev Alexey Case #8 : What’s faster: SQL or DataSet API?
  • 118. 118Spark 2 from Zinoviev Alexey Job Stages in old Spark
  • 119. 119Spark 2 from Zinoviev Alexey Scheduler Optimizations
  • 120. 120Spark 2 from Zinoviev Alexey Catalyst Optimizer for DataFrames
  • 121. 121Spark 2 from Zinoviev Alexey Unified Logical Plan DataFrame = Dataset[Row]
  • 122. 122Spark 2 from Zinoviev Alexey Bytecode
  • 123. 123Spark 2 from Zinoviev Alexey DataSet.explain() == Physical Plan == Project [avg(price)#43,carat#45] +- SortMergeJoin [color#21], [color#47] :- Sort [color#21 ASC], false, 0 : +- TungstenExchange hashpartitioning(color#21,200), None : +- Project [avg(price)#43,color#21] : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=Final,isDistinct=false)], output=[color#21,avg(price)#43]) : +- TungstenExchange hashpartitioning(cut#20,color#21,200), None : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=Partial,isDistinct=false)], output=[cut#20,color#21,sum#58,count#59L]) : +- Scan CsvRelation(-----) +- Sort [color#47 ASC], false, 0 +- TungstenExchange hashpartitioning(color#47,200), None +- ConvertToUnsafe +- Scan CsvRelation(----)
  • 124. 124Spark 2 from Zinoviev Alexey Case #9 : Why does explain() show so many Tungsten things?
  • 125. 125Spark 2 from Zinoviev Alexey How to be effective with CPU • Runtime code generation • Exploiting cache locality • Off-heap memory management
  • 126. 126Spark 2 from Zinoviev Alexey Tungsten’s goal Push performance closer to the limits of modern hardware
  • 127. 127Spark 2 from Zinoviev Alexey Maybe something UNSAFE?
  • 128. 128Spark 2 from Zinoviev Alexey UnsafeRowFormat • Bit set for tracking null values • Small values are inlined • For variable-length values are stored relative offset into the variablelength data section • Rows are always 8-byte word aligned • Equality comparison and hashing can be performed on raw bytes without requiring additional interpretation
  • 129. 129Spark 2 from Zinoviev Alexey Case #10 : Can I influence on Memory Management in Spark?
  • 130. 130Spark 2 from Zinoviev Alexey Case #11 : Should I tune generation’s stuff?
  • 131. 131Spark 2 from Zinoviev Alexey Cached Data
  • 132. 132Spark 2 from Zinoviev Alexey During operations
  • 133. 133Spark 2 from Zinoviev Alexey For your needs
  • 134. 134Spark 2 from Zinoviev Alexey For Dark Lord
  • 135. 135Spark 2 from Zinoviev Alexey IN CONCLUSION
  • 136. 136Spark 2 from Zinoviev Alexey We have no ability… • join structured streaming and other sources to handle it • one unified ML API • GraphX rethinking and redesign • Custom encoders • Datasets everywhere • integrate with something important
  • 137. 137Spark 2 from Zinoviev Alexey Roadmap • Support other data sources (not only S3 + HDFS) • Transactional updates • Dataset is one DSL for all operations • GraphFrames + Structured MLLib • Tungsten: custom encoders • The RDD-based API is expected to be removed in Spark 3.0
  • 138. 138Spark 2 from Zinoviev Alexey And we can DO IT! Real-Time Data-Marts Batch Data-Marts Relations Graph Ontology Metadata Search Index Events & Alarms Real-time Dashboarding Events & Alarms All Raw Data backup is stored here Real-time Data Ingestion Batch Data Ingestion Real-Time ETL & CEP Batch ETL & Raw Area Scheduler Internal External Social HDFS → CFS as an option Time-Series Data Titan & KairosDB store data in Cassandra Push Events & Alarms (Email, SNMP etc.)
  • 139. 139Spark 2 from Zinoviev Alexey First Part
  • 140. 140Spark 2 from Zinoviev Alexey Second Part
  • 141. 141Spark 2 from Zinoviev Alexey Contacts E-mail : Alexey_Zinovyev@epam.com Twitter : @zaleslaw @BigDataRussia Facebook: https://www.facebook.com/zaleslaw vk.com/big_data_russia Big Data Russia vk.com/java_jvm Java & JVM langs
  • 142. 142Spark 2 from Zinoviev Alexey Any questions?