SlideShare a Scribd company logo
1 of 23
Download to read offline
1| gravyanalytics.com
Transitioning from Java to Scala for Spark
Guy DeCorte, Founder & CTO
Aaron Perrin, Senior Software Developer
March 13, 2019
2| gravyanalytics.com
Where we go is who we are.
REAL-WORLD CONSUMER BEHAVIOR
LIFE STAGES
LIFESTYLESAFFINITIES
INTERESTS
The events consumers attend,
the places they visit,
where they spend their time,
translates into intelligence
3| gravyanalytics.com
We translate the locations that consumers visit, the places they go, and the
events they attend into real-world consumer intelligence
INDUSTRY-LEADING CAPABILITIES
4| gravyanalytics.com
GRAVY SOLUTIONS
AdmitOneTM verified
Visitation, Attendance,
Event data and more for use
in unique business
applications
Gravy Insights provides
brands with in-depth
customer and competitive
intelligence
Gravy Audiences let
marketers reach engaged
consumers based on what
they do in real-life
GRAVY AUDIENCES GRAVY INSIGHTS GRAVY DAAS
• Lifestyle • Enthusiast
• In-Market • Branded • Custom
• Foot Traffic • Competitive
• Attribution
• Visitations • Attendances
• IP Address • User Agent
5| gravyanalytics.com
Gravy’s patented AdmitOne verification engine delivers the
highest-quality location and attendance data in the industry
THE GRAVY DIFFERENCE
Billions of daily location
signals from 250M+ mobile
devices
The largest events
database gives context to
millions of places and POIs
Confirmed, deterministic
consumer attendances at
places and events.
REACH EVENTS VERIFIED
6| gravyanalytics.com
SOLUTION
GEO-SIGNALS
CLOUD
Distribute
Filter & Verify Merge
Spatial Index
LCO & Attendance
Algorithm
Persona Generator
Attendances
Detail Records
Personas /
Audiences
DevicesDevice Processing
Lots of Spark jobs!
Snowflake
Datasets in S3
Zeppelin/EMR
Snowflake
SQL, R, Excel Dashboards-Sisense
Matillion
7| gravyanalytics.com
Some of the major Spark jobs that we run:
• Ingest
• Also validates, removes and/or flags data based on LDVS output
• Location and Device VerificationService (LDVS)
• Signal Merge / Device Merge
• Persona Generator
• Spatial Indexer
SUMMARY OF SPARK JOBS
8| gravyanalytics.com
What's Our Platform Look Like?
9| gravyanalytics.com
• Environment
• We currently run ~30 Spark jobs daily
• On average, per hour: ~1300 cores and ~10 TiB memory
• AWS EMR (and spot instances to control costs)
• Data storage: S3 and Snowflake
• The Code (Platform)
• ~200k lines Java, ~30k lines Scala
• Strong domain-driven-design influence
• Many jobs can be run in Spark or stand-alone
• Central orchestration application
• Custom DAG scheduler
• Responsible for job scheduling, configuring, launching,
monitoring, and failure recovery
THE CORE PLATFORM
10| gravyanalytics.com
• 2015-2016
• Targets: 25M sources, 450M events per day (5500/sec)
• Java - Microservices, DDD, AWS (Kinesis/SQS/EC2/DynamoDB/Redshift/etc)
• 2016-2017
• Targets: 100M sources, 4B events per day (40,000/sec)
• Java - Hybrid: Spark 1.6 / Microservices (experiments with storage)
• 2017-2018
• Targets: 200M sources, 10B events per day (100,000/sec)
• Java - Spark 2.0 / DynamoDB / S3 / Snowflake
• 2018-2019+
• Targets: 400M+ sources, 25B+ events per day (300,000/sec)
• Scala - Spark 2.4 / DynamoDB / S3 / Snowflake
SOFTWARE ARCHITECTURE EVOLUTION
11| gravyanalytics.com
• We started using Spark before datasets were a thing
• The original Spark code was designed around RDDs
• As data scaled, we targeted (easy) ways improve efficiency
• After Spark 2.0+, Datasets became more attractive
• What we did
• Reduced size of domain types to reduce memory overhead
• Refactored monolithic Spark jobs into specialized jobs
• Migrated JSON data to Parquet (with partitions)
• Transitioned from RDD API to Dataset API
FROM RDDs TO DATASETS AND MORE
12| gravyanalytics.com
• Transformations, aggregations, and filters
are easier with Datasets
• Improved Dataset performance from Spark
2.0 onward
• Datasets provide an abstraction layer
enabling optimized execution plans
• Easier, more fluent interface
• Dataset provide columnar optimization to
improve data and shuffling performance
• Enhanced functionality with functions._
• Support for SQL, when necessary
WHY DATASETS?
13| gravyanalytics.com
• The dataset API is available in Java so why
did we switch?
• Understanding Spark internals or modifying its
functionality was difficult without knowing Scala
• Scala is a cleanly-designed language
• We wanted to avoid the (often cumbersome) Java API
• Our initial experiments with Scala proved its ease of use
• Case classes resulted in easier serlialization and better
serialization and shuffling performance
• Immutable types provided better garbage collection
• Use of Spark REPL enabled faster prototyping
• Scala's tools and libraries have matured significantly
• Lots of best practices available
• Understanding Scala gives team deeper understanding of
the underlying Spark code
WHY SCALA?
14| gravyanalytics.com
• The switch was worth it - but it
wasn't without a cost
1. Lack of Experience
• Initially we had only one developer with
Scala experience
2. Large Amounts of Legacy Java Code
• We have taken a staged approach, still a
large effort
3. Shift in Coding Mentality
• Embracing a more functional coding style
requires changing how we think about
problems
CHALLENGES: SCALA
15| gravyanalytics.com
AN EXAMPLE: JAVA RDD
16| gravyanalytics.com
AN EXAMPLE: SCALA DATASET
17| gravyanalytics.com
UNIT TESTING
• Transitioning from JUnit to
ScalaTest
• Lack of Experience
• Another scenario where the development team
needed to ramp up on new technology
• DataMapper
• We have a homegrown library called the
DataMapper which allows us to generate test data
at runtime from annotations on our unit tests
• The Java version of this library relied on
reflection and did not play nice with case classes
• Eventually we produced a Scala / ScalaTest
compatible trait-based version
18| gravyanalytics.com
HIRING/GOING FORWARD
• Driving home the fact that we are no longer a Java-only shop, we have modified our
job listings to include Scala as a preferred language prerequisite.
• Challenging at first to evaluate candidates' Scala skills as we were novices ourselves.
• As we continue to ramp up on Scala, we have started to branch out from using it only
for Spark to using it for webservices ( play framework ) as well as to replace some of
our legacy utility libraries.
• We think we are now better positioned to quickly take advantage of newer features
coming down the spark pipeline.
19| gravyanalytics.com
DISCUSSION
QUESTIONS?
20| gravyanalytics.com
• Greatly streamlined syntax
• Easier use with Spark
• Easy, fast serialization of case classes during shuffles
• Built-in Product type encoders
• Built-in tuple types
• Built-in anonymous functions
• Options instead of nulls
• Pattern matching instead of switch statements
• IntelliJ Scala support
• Simpler Futures
• “Duck-typing”
• Advanced reflection
• Functional exception handling
• Syntactic sugar
• Lots of helpers: Option, Try, Success, Failure, Either, etc.
• Everything is a function => more flexibility
• Easier generics (less type erasure)
Extra: Scala Likes
21| gravyanalytics.com
• Untyped vals
• Lots of special symbols
• Library complexity
• Akka and typesafe libraries
• Json parsing libraries (incompatibility with Gson, complex scala libs)
• Java compatibility
• Companion object wrapping
• Bean serialization
• Default to Seq for ordered collections (instead of ideal data structure for the job)
• Gradle vs. SBT
• Overuse of implicit “magic”
• Difficult learning curve (lots to learn!!)
• Too much flexibility can create inconsistent and confusing code
• Opaque compilation errors
• Missing Named Tuple (e.g. Python)
• Enumerations are broken
Extra: Scala Dislikes
22| gravyanalytics.com
• Immutable types instead of mutable types
• Collection syntax sugar
• Chaining functions causes lots of type headaches
• Syntactic sugar
• Using recursion (with @tailrec) instead of procedural
• Pattern matching
• Using small functions to keep code readable
• Reflection, type tags, and class tags
• Curried functions
• Partial functions
• Unfamiliar type system
• OO Paradigms don’t translate well (have to research correct way of doing things)
• Lots to learn!!
Extra: Scala challenges
23| gravyanalytics.com
Aaron Perrin, Senior Software Developer
703-840-8850
aperrin@gravyanalytics.com

More Related Content

What's hot

50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...Lucas Jellema
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaJen Aman
 
Microservices in the Enterprise
Microservices in the Enterprise Microservices in the Enterprise
Microservices in the Enterprise Jesus Rodriguez
 
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For Scala
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For ScalaScala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For Scala
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For ScalaLightbend
 
The (not so) Dark Art of Atlassian Performance Tuning
The (not so) Dark Art of Atlassian Performance TuningThe (not so) Dark Art of Atlassian Performance Tuning
The (not so) Dark Art of Atlassian Performance Tuningcolleenfry
 
Cisco's MultiCloud Strategy
Cisco's MultiCloud StrategyCisco's MultiCloud Strategy
Cisco's MultiCloud StrategyMaulik Shyani
 
Agile infrastructure
Agile infrastructureAgile infrastructure
Agile infrastructureTarun Rajput
 
Business and IT agility through DevOps and microservice architecture powered ...
Business and IT agility through DevOps and microservice architecture powered ...Business and IT agility through DevOps and microservice architecture powered ...
Business and IT agility through DevOps and microservice architecture powered ...Lucas Jellema
 
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...Chocolatey Software
 
Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataJohn Nestor
 
Big ideas in small packages - How microservices helped us to scale our vision
Big ideas in small packages  - How microservices helped us to scale our visionBig ideas in small packages  - How microservices helped us to scale our vision
Big ideas in small packages - How microservices helped us to scale our visionSebastian Schleicher
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Miguel Pastor
 
Cloudstate - Towards Stateful Serverless
Cloudstate - Towards Stateful ServerlessCloudstate - Towards Stateful Serverless
Cloudstate - Towards Stateful ServerlessLightbend
 
Microservices, DevOps & SRE
Microservices, DevOps & SREMicroservices, DevOps & SRE
Microservices, DevOps & SREAraf Karsh Hamid
 
Automated Configuration & Deployment of Atlassian Applications
Automated Configuration & Deployment of Atlassian ApplicationsAutomated Configuration & Deployment of Atlassian Applications
Automated Configuration & Deployment of Atlassian Applicationscolleenfry
 
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...Lucas Jellema
 
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)Lucas Jellema
 
Agile, User Stories, Domain Driven Design
Agile, User Stories, Domain Driven DesignAgile, User Stories, Domain Driven Design
Agile, User Stories, Domain Driven DesignAraf Karsh Hamid
 
It’s All About Adoption: How Gilead Sciences Forged a Path to Accelerate Value
It’s All About Adoption: How Gilead Sciences Forged a Path to Accelerate ValueIt’s All About Adoption: How Gilead Sciences Forged a Path to Accelerate Value
It’s All About Adoption: How Gilead Sciences Forged a Path to Accelerate ValueScout RFP
 
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesYow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesAdrian Cockcroft
 

What's hot (20)

50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
 
Microservices in the Enterprise
Microservices in the Enterprise Microservices in the Enterprise
Microservices in the Enterprise
 
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For Scala
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For ScalaScala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For Scala
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For Scala
 
The (not so) Dark Art of Atlassian Performance Tuning
The (not so) Dark Art of Atlassian Performance TuningThe (not so) Dark Art of Atlassian Performance Tuning
The (not so) Dark Art of Atlassian Performance Tuning
 
Cisco's MultiCloud Strategy
Cisco's MultiCloud StrategyCisco's MultiCloud Strategy
Cisco's MultiCloud Strategy
 
Agile infrastructure
Agile infrastructureAgile infrastructure
Agile infrastructure
 
Business and IT agility through DevOps and microservice architecture powered ...
Business and IT agility through DevOps and microservice architecture powered ...Business and IT agility through DevOps and microservice architecture powered ...
Business and IT agility through DevOps and microservice architecture powered ...
 
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...
 
Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big Data
 
Big ideas in small packages - How microservices helped us to scale our vision
Big ideas in small packages  - How microservices helped us to scale our visionBig ideas in small packages  - How microservices helped us to scale our vision
Big ideas in small packages - How microservices helped us to scale our vision
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014
 
Cloudstate - Towards Stateful Serverless
Cloudstate - Towards Stateful ServerlessCloudstate - Towards Stateful Serverless
Cloudstate - Towards Stateful Serverless
 
Microservices, DevOps & SRE
Microservices, DevOps & SREMicroservices, DevOps & SRE
Microservices, DevOps & SRE
 
Automated Configuration & Deployment of Atlassian Applications
Automated Configuration & Deployment of Atlassian ApplicationsAutomated Configuration & Deployment of Atlassian Applications
Automated Configuration & Deployment of Atlassian Applications
 
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...
 
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
 
Agile, User Stories, Domain Driven Design
Agile, User Stories, Domain Driven DesignAgile, User Stories, Domain Driven Design
Agile, User Stories, Domain Driven Design
 
It’s All About Adoption: How Gilead Sciences Forged a Path to Accelerate Value
It’s All About Adoption: How Gilead Sciences Forged a Path to Accelerate ValueIt’s All About Adoption: How Gilead Sciences Forged a Path to Accelerate Value
It’s All About Adoption: How Gilead Sciences Forged a Path to Accelerate Value
 
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesYow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
 

Similar to Transitioning from Java to Scala for Spark - March 13, 2019

IncQuery Server for Teamwork Cloud - Talk at IW2019
IncQuery Server for Teamwork Cloud - Talk at IW2019IncQuery Server for Teamwork Cloud - Talk at IW2019
IncQuery Server for Teamwork Cloud - Talk at IW2019Istvan Rath
 
Whitepages Practical Experience Converting from Ruby to Reactive
Whitepages Practical Experience Converting from Ruby to ReactiveWhitepages Practical Experience Converting from Ruby to Reactive
Whitepages Practical Experience Converting from Ruby to ReactiveDragos Manolescu
 
Experience Converting from Ruby to Scala
Experience Converting from Ruby to ScalaExperience Converting from Ruby to Scala
Experience Converting from Ruby to ScalaJohn Nestor
 
Stardog 1.1: Easier, Smarter, Faster RDF Database
Stardog 1.1: Easier, Smarter, Faster RDF DatabaseStardog 1.1: Easier, Smarter, Faster RDF Database
Stardog 1.1: Easier, Smarter, Faster RDF DatabaseClark & Parsia LLC
 
Stardog 1.1: An Easier, Smarter, Faster RDF Database
Stardog 1.1: An Easier, Smarter, Faster RDF DatabaseStardog 1.1: An Easier, Smarter, Faster RDF Database
Stardog 1.1: An Easier, Smarter, Faster RDF Databasekendallclark
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Spark Summit
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleDomino Data Lab
 
Wikidsmart PM: Requirements Management within Confluence, Integrated with JIRA
Wikidsmart PM: Requirements Management within Confluence, Integrated with JIRAWikidsmart PM: Requirements Management within Confluence, Integrated with JIRA
Wikidsmart PM: Requirements Management within Confluence, Integrated with JIRAzAgile
 
Evolving IGN’s New APIs with Scala
 Evolving IGN’s New APIs with Scala Evolving IGN’s New APIs with Scala
Evolving IGN’s New APIs with ScalaManish Pandit
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Anthony Baker
 
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode
 
Play Architecture, Implementation, Shiny Objects, and a Proposal
Play Architecture, Implementation, Shiny Objects, and a ProposalPlay Architecture, Implementation, Shiny Objects, and a Proposal
Play Architecture, Implementation, Shiny Objects, and a ProposalMike Slinn
 
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...Thoughtworks
 
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...Thoughtworks
 
Sledgehammer to Fine Brush for QA
Sledgehammer to Fine Brush for QASledgehammer to Fine Brush for QA
Sledgehammer to Fine Brush for QAShelley Lambert
 
Stay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolithStay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolithMarkus Eisele
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 

Similar to Transitioning from Java to Scala for Spark - March 13, 2019 (20)

IncQuery Server for Teamwork Cloud - Talk at IW2019
IncQuery Server for Teamwork Cloud - Talk at IW2019IncQuery Server for Teamwork Cloud - Talk at IW2019
IncQuery Server for Teamwork Cloud - Talk at IW2019
 
Whitepages Practical Experience Converting from Ruby to Reactive
Whitepages Practical Experience Converting from Ruby to ReactiveWhitepages Practical Experience Converting from Ruby to Reactive
Whitepages Practical Experience Converting from Ruby to Reactive
 
Experience Converting from Ruby to Scala
Experience Converting from Ruby to ScalaExperience Converting from Ruby to Scala
Experience Converting from Ruby to Scala
 
Stardog 1.1: Easier, Smarter, Faster RDF Database
Stardog 1.1: Easier, Smarter, Faster RDF DatabaseStardog 1.1: Easier, Smarter, Faster RDF Database
Stardog 1.1: Easier, Smarter, Faster RDF Database
 
Stardog 1.1: An Easier, Smarter, Faster RDF Database
Stardog 1.1: An Easier, Smarter, Faster RDF DatabaseStardog 1.1: An Easier, Smarter, Faster RDF Database
Stardog 1.1: An Easier, Smarter, Faster RDF Database
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
 
Scala Jday 2014
Scala Jday 2014 Scala Jday 2014
Scala Jday 2014
 
Wikidsmart PM: Requirements Management within Confluence, Integrated with JIRA
Wikidsmart PM: Requirements Management within Confluence, Integrated with JIRAWikidsmart PM: Requirements Management within Confluence, Integrated with JIRA
Wikidsmart PM: Requirements Management within Confluence, Integrated with JIRA
 
Evolving IGN’s New APIs with Scala
 Evolving IGN’s New APIs with Scala Evolving IGN’s New APIs with Scala
Evolving IGN’s New APIs with Scala
 
Pig on Spark
Pig on SparkPig on Spark
Pig on Spark
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)
 
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CIT
 
Play Architecture, Implementation, Shiny Objects, and a Proposal
Play Architecture, Implementation, Shiny Objects, and a ProposalPlay Architecture, Implementation, Shiny Objects, and a Proposal
Play Architecture, Implementation, Shiny Objects, and a Proposal
 
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
 
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
 
Sledgehammer to Fine Brush for QA
Sledgehammer to Fine Brush for QASledgehammer to Fine Brush for QA
Sledgehammer to Fine Brush for QA
 
Stay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolithStay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolith
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 

Recently uploaded

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 

Recently uploaded (20)

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 

Transitioning from Java to Scala for Spark - March 13, 2019

  • 1. 1| gravyanalytics.com Transitioning from Java to Scala for Spark Guy DeCorte, Founder & CTO Aaron Perrin, Senior Software Developer March 13, 2019
  • 2. 2| gravyanalytics.com Where we go is who we are. REAL-WORLD CONSUMER BEHAVIOR LIFE STAGES LIFESTYLESAFFINITIES INTERESTS The events consumers attend, the places they visit, where they spend their time, translates into intelligence
  • 3. 3| gravyanalytics.com We translate the locations that consumers visit, the places they go, and the events they attend into real-world consumer intelligence INDUSTRY-LEADING CAPABILITIES
  • 4. 4| gravyanalytics.com GRAVY SOLUTIONS AdmitOneTM verified Visitation, Attendance, Event data and more for use in unique business applications Gravy Insights provides brands with in-depth customer and competitive intelligence Gravy Audiences let marketers reach engaged consumers based on what they do in real-life GRAVY AUDIENCES GRAVY INSIGHTS GRAVY DAAS • Lifestyle • Enthusiast • In-Market • Branded • Custom • Foot Traffic • Competitive • Attribution • Visitations • Attendances • IP Address • User Agent
  • 5. 5| gravyanalytics.com Gravy’s patented AdmitOne verification engine delivers the highest-quality location and attendance data in the industry THE GRAVY DIFFERENCE Billions of daily location signals from 250M+ mobile devices The largest events database gives context to millions of places and POIs Confirmed, deterministic consumer attendances at places and events. REACH EVENTS VERIFIED
  • 6. 6| gravyanalytics.com SOLUTION GEO-SIGNALS CLOUD Distribute Filter & Verify Merge Spatial Index LCO & Attendance Algorithm Persona Generator Attendances Detail Records Personas / Audiences DevicesDevice Processing Lots of Spark jobs! Snowflake Datasets in S3 Zeppelin/EMR Snowflake SQL, R, Excel Dashboards-Sisense Matillion
  • 7. 7| gravyanalytics.com Some of the major Spark jobs that we run: • Ingest • Also validates, removes and/or flags data based on LDVS output • Location and Device VerificationService (LDVS) • Signal Merge / Device Merge • Persona Generator • Spatial Indexer SUMMARY OF SPARK JOBS
  • 8. 8| gravyanalytics.com What's Our Platform Look Like?
  • 9. 9| gravyanalytics.com • Environment • We currently run ~30 Spark jobs daily • On average, per hour: ~1300 cores and ~10 TiB memory • AWS EMR (and spot instances to control costs) • Data storage: S3 and Snowflake • The Code (Platform) • ~200k lines Java, ~30k lines Scala • Strong domain-driven-design influence • Many jobs can be run in Spark or stand-alone • Central orchestration application • Custom DAG scheduler • Responsible for job scheduling, configuring, launching, monitoring, and failure recovery THE CORE PLATFORM
  • 10. 10| gravyanalytics.com • 2015-2016 • Targets: 25M sources, 450M events per day (5500/sec) • Java - Microservices, DDD, AWS (Kinesis/SQS/EC2/DynamoDB/Redshift/etc) • 2016-2017 • Targets: 100M sources, 4B events per day (40,000/sec) • Java - Hybrid: Spark 1.6 / Microservices (experiments with storage) • 2017-2018 • Targets: 200M sources, 10B events per day (100,000/sec) • Java - Spark 2.0 / DynamoDB / S3 / Snowflake • 2018-2019+ • Targets: 400M+ sources, 25B+ events per day (300,000/sec) • Scala - Spark 2.4 / DynamoDB / S3 / Snowflake SOFTWARE ARCHITECTURE EVOLUTION
  • 11. 11| gravyanalytics.com • We started using Spark before datasets were a thing • The original Spark code was designed around RDDs • As data scaled, we targeted (easy) ways improve efficiency • After Spark 2.0+, Datasets became more attractive • What we did • Reduced size of domain types to reduce memory overhead • Refactored monolithic Spark jobs into specialized jobs • Migrated JSON data to Parquet (with partitions) • Transitioned from RDD API to Dataset API FROM RDDs TO DATASETS AND MORE
  • 12. 12| gravyanalytics.com • Transformations, aggregations, and filters are easier with Datasets • Improved Dataset performance from Spark 2.0 onward • Datasets provide an abstraction layer enabling optimized execution plans • Easier, more fluent interface • Dataset provide columnar optimization to improve data and shuffling performance • Enhanced functionality with functions._ • Support for SQL, when necessary WHY DATASETS?
  • 13. 13| gravyanalytics.com • The dataset API is available in Java so why did we switch? • Understanding Spark internals or modifying its functionality was difficult without knowing Scala • Scala is a cleanly-designed language • We wanted to avoid the (often cumbersome) Java API • Our initial experiments with Scala proved its ease of use • Case classes resulted in easier serlialization and better serialization and shuffling performance • Immutable types provided better garbage collection • Use of Spark REPL enabled faster prototyping • Scala's tools and libraries have matured significantly • Lots of best practices available • Understanding Scala gives team deeper understanding of the underlying Spark code WHY SCALA?
  • 14. 14| gravyanalytics.com • The switch was worth it - but it wasn't without a cost 1. Lack of Experience • Initially we had only one developer with Scala experience 2. Large Amounts of Legacy Java Code • We have taken a staged approach, still a large effort 3. Shift in Coding Mentality • Embracing a more functional coding style requires changing how we think about problems CHALLENGES: SCALA
  • 17. 17| gravyanalytics.com UNIT TESTING • Transitioning from JUnit to ScalaTest • Lack of Experience • Another scenario where the development team needed to ramp up on new technology • DataMapper • We have a homegrown library called the DataMapper which allows us to generate test data at runtime from annotations on our unit tests • The Java version of this library relied on reflection and did not play nice with case classes • Eventually we produced a Scala / ScalaTest compatible trait-based version
  • 18. 18| gravyanalytics.com HIRING/GOING FORWARD • Driving home the fact that we are no longer a Java-only shop, we have modified our job listings to include Scala as a preferred language prerequisite. • Challenging at first to evaluate candidates' Scala skills as we were novices ourselves. • As we continue to ramp up on Scala, we have started to branch out from using it only for Spark to using it for webservices ( play framework ) as well as to replace some of our legacy utility libraries. • We think we are now better positioned to quickly take advantage of newer features coming down the spark pipeline.
  • 20. 20| gravyanalytics.com • Greatly streamlined syntax • Easier use with Spark • Easy, fast serialization of case classes during shuffles • Built-in Product type encoders • Built-in tuple types • Built-in anonymous functions • Options instead of nulls • Pattern matching instead of switch statements • IntelliJ Scala support • Simpler Futures • “Duck-typing” • Advanced reflection • Functional exception handling • Syntactic sugar • Lots of helpers: Option, Try, Success, Failure, Either, etc. • Everything is a function => more flexibility • Easier generics (less type erasure) Extra: Scala Likes
  • 21. 21| gravyanalytics.com • Untyped vals • Lots of special symbols • Library complexity • Akka and typesafe libraries • Json parsing libraries (incompatibility with Gson, complex scala libs) • Java compatibility • Companion object wrapping • Bean serialization • Default to Seq for ordered collections (instead of ideal data structure for the job) • Gradle vs. SBT • Overuse of implicit “magic” • Difficult learning curve (lots to learn!!) • Too much flexibility can create inconsistent and confusing code • Opaque compilation errors • Missing Named Tuple (e.g. Python) • Enumerations are broken Extra: Scala Dislikes
  • 22. 22| gravyanalytics.com • Immutable types instead of mutable types • Collection syntax sugar • Chaining functions causes lots of type headaches • Syntactic sugar • Using recursion (with @tailrec) instead of procedural • Pattern matching • Using small functions to keep code readable • Reflection, type tags, and class tags • Curried functions • Partial functions • Unfamiliar type system • OO Paradigms don’t translate well (have to research correct way of doing things) • Lots to learn!! Extra: Scala challenges
  • 23. 23| gravyanalytics.com Aaron Perrin, Senior Software Developer 703-840-8850 aperrin@gravyanalytics.com