SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Scala and Spark:
Coevolving Ecosystems
for Big Data
John Nestor
47 Degrees
Datapalooza Seattle
February 10-11, 2016
www.47deg.com
147deg.com
47deg.com
Outline
• Scala
• Spark
• Scala Impact on Spark
• Spark Impact on Scala
• Summary
• Questions
2
Scala
3
47deg.com
Scala History
• Scala created by Martin Odersky, EFPL Switzerland
• wrote javac, the compiler for Java
• designed Java generics
• 2004 Scala announced
• 2006 Scala 2.0
• 2011 Typesafe started, Scala 2.9
• 2012 Scala 2.10
• 2014 Scala 2.11
• 2015 Scala 2.12-RC1
• 2016 Scala now 12 years old and quite mature
4
47deg.com
Why Scala?
• Open Source
• Strong typing
• Concise elegant syntax
• Runs on JVM (Java Virtual Machine)
• Supports both object-oriented and functional
• Small simple programs (REPL) through very large
multi-server systems
• Easy to cleanly extend with new libraries and DSL’s
• Ideal for parallel and distributed systems
5
47deg.com
Scala: Strong Typing and Concise Syntax
• Strong typing like Java.
• Compile time checks
• Better modularity via strongly typed interfaces
• Easier maintenance: types make code easier to
understand
• Concise syntax like Python.
• Type inference. Compiler infers most types that had to be
explicit in Java.
• Powerful syntax that avoid much of the boilerplate of Java
code (see next slide).
• Best of both worlds: safety of strong typing with conciseness
(like Python).
6
47deg.com
Scala Case Class
• Java version



class User {

private String name;

private Int age;

public User(String name, Int age) {

this.name = name; this.age = age;

}

public getAge() { return age; }

public setAge(Int age) { this.age = age;}

}

User joe = new User(“Joe”, 30);

• Scala version



case class User(name:String, var age:Int)

val joe = User(“Joe”, 30)
7
47deg.com
Key Scala Features
• Immutable Collections
• Seq, Set, Map
• Functional Programming
• (i:Int) => i + 1
• Functions can be parameters and results
• For collections: map, filter, flatMap, groupBy, …
• Case Classes
• case class Person(name:String, age:Int)
• Define Domain Specific Languages (DSLs)
• Akka, SBT, Spray, ScalaTest and now Spark
8
47deg.com
View Sample Scala Code
• Word Count in Scala
9
Spark
10
47deg.com
Spark History
• 2009 Mesos developed at UC Berkeley AmpLab
• 2009 Matei Zaharia starts Spark as a test case for
Mesos
• 2012 Spark 0.5
• 2014 Spark 1.0, Top Level Apache Project
• 2014 Databricks started
• 2016 Spark 1.6
11
47deg.com
Why Spark?
• Support for not only batch but also (near) real-time
• Fast - keeps data in memory as much as possible
• Often 10X to 100X Hadoop speed
• A clean easy-to-use API
• A richer set of functional operations than just map and
reduce
• A foundation for a wide set of integrated data
applications
• Can recover from failures - recompute or (optional)
replication
• Scalable for very large data sets and reduced time
12
47deg.com
Spark Components
• Spark Core
• Scalable multi-node cluster
• Failure detection and recovery
• RDDs, Dataframes and functional operations
• MLLib - for machine learning
• linear regression, SVMs, clustering, collaborative
filtering, dimension reduction
• more on the way!
• GraphX - for graph computation
• Streaming - for near real-time
13
47deg.com
Spark RDDs
• RDD[T] - resilient distributed data set
• typed
• immutable
• ordered
• can be processed in parallel
• lazy evaluation - permits more global optimizations
• Rich set of functional operations ( a small sample)
• map, flatMap, reduce, …
• filter, groupBy, sortBy, take, drop, …
14
47deg.com
Spark Data Frames
• Similar to SQL tables
• can be transformed using SQL
• Focus of much of the work on Spark performance
optimization
• Unlike RDD’s, optimized knows about fields
• Dynamically typed
• Not a natural fit for Scala (more on this later)
15
47deg.com
View Sample Spark Code - RDDs
• Word Count for Spark written Scala
16
47deg.com
View Sample Spark Code - Data Frames
• Tweet language count using Data Frames written in
Scala
17
Scala Impact on Spark
18
47deg.com
Spark Impact on Scala
• Written in Scala
• Scala is primary API
• Must use Scala to extend
• Source code as documentation: Scala code
• Design from Scala
• functional programming
• immutable collections
• Spark is a Scala DSL
19
47deg.com
Spark Application Language Choices - 1
• 71% Scala, 58% Python
• Code length similar
• Scala faster for RDDs (no need to move serialized
data between JVM and Python)
• Scala is generally faster
• Scala provides strong typing
• Compile time checks
• Code is easier to understand
• Types help when writing and maintaining code
20
47deg.com
Spark Application Language Choices- 2
• In addition to Scala and Python also supports Java and R
• If you want to build scalable high performance production
code based on Spark
• R by itself is too specialized
• Python is too slow
• Java is tedious to write and maintain
• Scala is “just right”
21
47deg.com
Transformations: Parallel versus Serial
• Scala has operations that traverse sequence elements in serial
order: foldLeft, scanLeft
• Spark RDDs can only process sequence elements in parallel
• Enables full use of multiple works with multiple cores
• Serial operations like foldLeft would be slow compared to
current parallel operations
• But there are cases where foldLeft or scanLeft is really needed
• Example: time series where early event can affect later event
• Example: Sherlock, word count by story
• So should Spark add foldLeft or similar operators?
22
47deg.com
Sample Spark Internal Code
• Lets examine the internal code in Spark
• Spark Context
• RDD
23
Spark Impact on Scala
24
47deg.com
Pair RDD’s for Scala
25
Scala Sequence Key Value
Unique Set Map
Duplicate Seq
???
Odersky
Spark Sequence Key Value
Unique
Duplicate RDD Pair RDD
.distinct
47deg.com
Lazy Evaluation
• Strict evaluation (Scala)
• Each transformation is evaluated as it is seen
• Easy to understand and debug
• Lazy evaluation (Spark)
• Transformations are collected into a linage graph
• Evaluated only when a final action is applied
• Allows cross transformation optimization
26
47deg.com
Lazy Evaluation in Scala
• Scala has Stream (a kind of sequence)
• Elements are evaluated in order as needed
• Allows infinite collections
• But no cross transform optimization
• Lazy Spark evaluation does all elements at once
• Enables parallelism
• Scala may add lazy versions of its collections that are
more like Spark (Odersky)
27
47deg.com © Copyright 2015 47 Degrees
Spark Clusters - Serialization
28
Your Spark App
Spark Context
Cluster Manager
Worker Node
Executer
Task Task
Spark Driver
Worker Node
Executer
Task Task
...
Static Code
Jar File
Dynamic Code
Closures
47deg.com
Dynamic Code Serialization
• Depends on values known at run-time
• Often a function passed to Spark transformations such
as map, filter, and flatMap
• Sent from driver to workers
• Serialized on driver to byte array
• Deserialized on each work
29
47deg.com
Serialization of Closures
• Often include more than needed (or expected)
• Demo
• Scala may become smarter about including less in
closure serialization (Odersky)
30
47deg.com
Datasets
• DataFrames are becoming the focus of Spark
optimization
• DataFrames are dynamically typed
• Scala API would be nicer if there was static typing
• Datasets (new experimental in Spark 1.6) attempt to
solve this
• Sample Code
• It would be nice if DataBricks and Typesafe could work
together to produce a better solution
31
Scala and Spark in
Seattle
32
47deg.com
Scala and Spark in Seattle
• Seattle Meetups
• Scala at the Sea Meetup (over 1000 members)

http://www.meetup.com/Seattle-Scala-User-Group/
• Seattle Spark Meetup (over 1400 members)

http://www.meetup.com/Seattle-Spark-Meetup/
• 47 DegreesTraining (Seattle and Worldwide) 

http://www.47deg.com/events#training
• Typesafe Scala Training: Scala, Akka
• Spark Training: Programming Spark with Scala
• UW Scala Professional Certificate Program 

http://www.pce.uw.edu/certificates/scala-functional-reactive-programming.html
33
Summary
34
47deg.com
Summary
• Scala and Spark are great technologies for big data applications
• Both are functional
• Both have immutable Data
• Both can be used to build very large systems
• Typesafe and Databricks evolving relationship
• Databricks is a consumer of Typesafe components
• Typesafe now supports Spark in addition to the Scala
compiler and other Scala components
• They are working together to evolve toward ever more
coordinated and integrated ecosystem for big data
• Martin Odersky - Spark — The Ultimate Scala Collections
35
Questions
36

Weitere ähnliche Inhalte

Andere mochten auch

Scala: Pattern matching, Concepts and Implementations
Scala: Pattern matching, Concepts and ImplementationsScala: Pattern matching, Concepts and Implementations
Scala: Pattern matching, Concepts and ImplementationsMICHRAFY MUSTAFA
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaAlexander Dean
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFramesJen Aman
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
A Basic Hive Inspection
A Basic Hive InspectionA Basic Hive Inspection
A Basic Hive InspectionLinda Tillman
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 
Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Martin Odersky
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for TrainingBryan Yang
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Scala: functional programming for the imperative mind
Scala: functional programming for the imperative mindScala: functional programming for the imperative mind
Scala: functional programming for the imperative mindSander Mak (@Sander_Mak)
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson
 
Introduction to Functional Programming with Scala
Introduction to Functional Programming with ScalaIntroduction to Functional Programming with Scala
Introduction to Functional Programming with Scalapramode_ce
 

Andere mochten auch (19)

Fun[ctional] spark with scala
Fun[ctional] spark with scalaFun[ctional] spark with scala
Fun[ctional] spark with scala
 
Scala: Pattern matching, Concepts and Implementations
Scala: Pattern matching, Concepts and ImplementationsScala: Pattern matching, Concepts and Implementations
Scala: Pattern matching, Concepts and Implementations
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in Scala
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
A Basic Hive Inspection
A Basic Hive InspectionA Basic Hive Inspection
A Basic Hive Inspection
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Functional Programming in Scala
Functional Programming in ScalaFunctional Programming in Scala
Functional Programming in Scala
 
Scala: functional programming for the imperative mind
Scala: functional programming for the imperative mindScala: functional programming for the imperative mind
Scala: functional programming for the imperative mind
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Introduction to Functional Programming with Scala
Introduction to Functional Programming with ScalaIntroduction to Functional Programming with Scala
Introduction to Functional Programming with Scala
 

Mehr von John Nestor

LambdaFlow: Scala Functional Message Processing
LambdaFlow: Scala Functional Message Processing LambdaFlow: Scala Functional Message Processing
LambdaFlow: Scala Functional Message Processing John Nestor
 
Type Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset TransformsType Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset TransformsJohn Nestor
 
Logging in Scala
Logging in ScalaLogging in Scala
Logging in ScalaJohn Nestor
 
Messaging patterns
Messaging patternsMessaging patterns
Messaging patternsJohn Nestor
 
Experience Converting from Ruby to Scala
Experience Converting from Ruby to ScalaExperience Converting from Ruby to Scala
Experience Converting from Ruby to ScalaJohn Nestor
 
Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataJohn Nestor
 
Scala Json Features and Performance
Scala Json Features and PerformanceScala Json Features and Performance
Scala Json Features and PerformanceJohn Nestor
 

Mehr von John Nestor (9)

LambdaFlow: Scala Functional Message Processing
LambdaFlow: Scala Functional Message Processing LambdaFlow: Scala Functional Message Processing
LambdaFlow: Scala Functional Message Processing
 
LambdaTest
LambdaTestLambdaTest
LambdaTest
 
Type Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset TransformsType Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset Transforms
 
Logging in Scala
Logging in ScalaLogging in Scala
Logging in Scala
 
Messaging patterns
Messaging patternsMessaging patterns
Messaging patterns
 
Experience Converting from Ruby to Scala
Experience Converting from Ruby to ScalaExperience Converting from Ruby to Scala
Experience Converting from Ruby to Scala
 
Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big Data
 
Scala Json Features and Performance
Scala Json Features and PerformanceScala Json Features and Performance
Scala Json Features and Performance
 
Neutronium
NeutroniumNeutronium
Neutronium
 

Kürzlich hochgeladen

Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 

Kürzlich hochgeladen (20)

Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 

Scala and Spark: Coevolving Ecosystems for Big Data

  • 1. Scala and Spark: Coevolving Ecosystems for Big Data John Nestor 47 Degrees Datapalooza Seattle February 10-11, 2016 www.47deg.com 147deg.com
  • 2. 47deg.com Outline • Scala • Spark • Scala Impact on Spark • Spark Impact on Scala • Summary • Questions 2
  • 4. 47deg.com Scala History • Scala created by Martin Odersky, EFPL Switzerland • wrote javac, the compiler for Java • designed Java generics • 2004 Scala announced • 2006 Scala 2.0 • 2011 Typesafe started, Scala 2.9 • 2012 Scala 2.10 • 2014 Scala 2.11 • 2015 Scala 2.12-RC1 • 2016 Scala now 12 years old and quite mature 4
  • 5. 47deg.com Why Scala? • Open Source • Strong typing • Concise elegant syntax • Runs on JVM (Java Virtual Machine) • Supports both object-oriented and functional • Small simple programs (REPL) through very large multi-server systems • Easy to cleanly extend with new libraries and DSL’s • Ideal for parallel and distributed systems 5
  • 6. 47deg.com Scala: Strong Typing and Concise Syntax • Strong typing like Java. • Compile time checks • Better modularity via strongly typed interfaces • Easier maintenance: types make code easier to understand • Concise syntax like Python. • Type inference. Compiler infers most types that had to be explicit in Java. • Powerful syntax that avoid much of the boilerplate of Java code (see next slide). • Best of both worlds: safety of strong typing with conciseness (like Python). 6
  • 7. 47deg.com Scala Case Class • Java version
 
 class User {
 private String name;
 private Int age;
 public User(String name, Int age) {
 this.name = name; this.age = age;
 }
 public getAge() { return age; }
 public setAge(Int age) { this.age = age;}
 }
 User joe = new User(“Joe”, 30);
 • Scala version
 
 case class User(name:String, var age:Int)
 val joe = User(“Joe”, 30) 7
  • 8. 47deg.com Key Scala Features • Immutable Collections • Seq, Set, Map • Functional Programming • (i:Int) => i + 1 • Functions can be parameters and results • For collections: map, filter, flatMap, groupBy, … • Case Classes • case class Person(name:String, age:Int) • Define Domain Specific Languages (DSLs) • Akka, SBT, Spray, ScalaTest and now Spark 8
  • 9. 47deg.com View Sample Scala Code • Word Count in Scala 9
  • 11. 47deg.com Spark History • 2009 Mesos developed at UC Berkeley AmpLab • 2009 Matei Zaharia starts Spark as a test case for Mesos • 2012 Spark 0.5 • 2014 Spark 1.0, Top Level Apache Project • 2014 Databricks started • 2016 Spark 1.6 11
  • 12. 47deg.com Why Spark? • Support for not only batch but also (near) real-time • Fast - keeps data in memory as much as possible • Often 10X to 100X Hadoop speed • A clean easy-to-use API • A richer set of functional operations than just map and reduce • A foundation for a wide set of integrated data applications • Can recover from failures - recompute or (optional) replication • Scalable for very large data sets and reduced time 12
  • 13. 47deg.com Spark Components • Spark Core • Scalable multi-node cluster • Failure detection and recovery • RDDs, Dataframes and functional operations • MLLib - for machine learning • linear regression, SVMs, clustering, collaborative filtering, dimension reduction • more on the way! • GraphX - for graph computation • Streaming - for near real-time 13
  • 14. 47deg.com Spark RDDs • RDD[T] - resilient distributed data set • typed • immutable • ordered • can be processed in parallel • lazy evaluation - permits more global optimizations • Rich set of functional operations ( a small sample) • map, flatMap, reduce, … • filter, groupBy, sortBy, take, drop, … 14
  • 15. 47deg.com Spark Data Frames • Similar to SQL tables • can be transformed using SQL • Focus of much of the work on Spark performance optimization • Unlike RDD’s, optimized knows about fields • Dynamically typed • Not a natural fit for Scala (more on this later) 15
  • 16. 47deg.com View Sample Spark Code - RDDs • Word Count for Spark written Scala 16
  • 17. 47deg.com View Sample Spark Code - Data Frames • Tweet language count using Data Frames written in Scala 17
  • 18. Scala Impact on Spark 18
  • 19. 47deg.com Spark Impact on Scala • Written in Scala • Scala is primary API • Must use Scala to extend • Source code as documentation: Scala code • Design from Scala • functional programming • immutable collections • Spark is a Scala DSL 19
  • 20. 47deg.com Spark Application Language Choices - 1 • 71% Scala, 58% Python • Code length similar • Scala faster for RDDs (no need to move serialized data between JVM and Python) • Scala is generally faster • Scala provides strong typing • Compile time checks • Code is easier to understand • Types help when writing and maintaining code 20
  • 21. 47deg.com Spark Application Language Choices- 2 • In addition to Scala and Python also supports Java and R • If you want to build scalable high performance production code based on Spark • R by itself is too specialized • Python is too slow • Java is tedious to write and maintain • Scala is “just right” 21
  • 22. 47deg.com Transformations: Parallel versus Serial • Scala has operations that traverse sequence elements in serial order: foldLeft, scanLeft • Spark RDDs can only process sequence elements in parallel • Enables full use of multiple works with multiple cores • Serial operations like foldLeft would be slow compared to current parallel operations • But there are cases where foldLeft or scanLeft is really needed • Example: time series where early event can affect later event • Example: Sherlock, word count by story • So should Spark add foldLeft or similar operators? 22
  • 23. 47deg.com Sample Spark Internal Code • Lets examine the internal code in Spark • Spark Context • RDD 23
  • 24. Spark Impact on Scala 24
  • 25. 47deg.com Pair RDD’s for Scala 25 Scala Sequence Key Value Unique Set Map Duplicate Seq ??? Odersky Spark Sequence Key Value Unique Duplicate RDD Pair RDD .distinct
  • 26. 47deg.com Lazy Evaluation • Strict evaluation (Scala) • Each transformation is evaluated as it is seen • Easy to understand and debug • Lazy evaluation (Spark) • Transformations are collected into a linage graph • Evaluated only when a final action is applied • Allows cross transformation optimization 26
  • 27. 47deg.com Lazy Evaluation in Scala • Scala has Stream (a kind of sequence) • Elements are evaluated in order as needed • Allows infinite collections • But no cross transform optimization • Lazy Spark evaluation does all elements at once • Enables parallelism • Scala may add lazy versions of its collections that are more like Spark (Odersky) 27
  • 28. 47deg.com © Copyright 2015 47 Degrees Spark Clusters - Serialization 28 Your Spark App Spark Context Cluster Manager Worker Node Executer Task Task Spark Driver Worker Node Executer Task Task ... Static Code Jar File Dynamic Code Closures
  • 29. 47deg.com Dynamic Code Serialization • Depends on values known at run-time • Often a function passed to Spark transformations such as map, filter, and flatMap • Sent from driver to workers • Serialized on driver to byte array • Deserialized on each work 29
  • 30. 47deg.com Serialization of Closures • Often include more than needed (or expected) • Demo • Scala may become smarter about including less in closure serialization (Odersky) 30
  • 31. 47deg.com Datasets • DataFrames are becoming the focus of Spark optimization • DataFrames are dynamically typed • Scala API would be nicer if there was static typing • Datasets (new experimental in Spark 1.6) attempt to solve this • Sample Code • It would be nice if DataBricks and Typesafe could work together to produce a better solution 31
  • 32. Scala and Spark in Seattle 32
  • 33. 47deg.com Scala and Spark in Seattle • Seattle Meetups • Scala at the Sea Meetup (over 1000 members)
 http://www.meetup.com/Seattle-Scala-User-Group/ • Seattle Spark Meetup (over 1400 members)
 http://www.meetup.com/Seattle-Spark-Meetup/ • 47 DegreesTraining (Seattle and Worldwide) 
 http://www.47deg.com/events#training • Typesafe Scala Training: Scala, Akka • Spark Training: Programming Spark with Scala • UW Scala Professional Certificate Program 
 http://www.pce.uw.edu/certificates/scala-functional-reactive-programming.html 33
  • 35. 47deg.com Summary • Scala and Spark are great technologies for big data applications • Both are functional • Both have immutable Data • Both can be used to build very large systems • Typesafe and Databricks evolving relationship • Databricks is a consumer of Typesafe components • Typesafe now supports Spark in addition to the Scala compiler and other Scala components • They are working together to evolve toward ever more coordinated and integrated ecosystem for big data • Martin Odersky - Spark — The Ultimate Scala Collections 35