SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
Productionalizing a Spark
application
Productionalizing an application on a frequently
evolving framework like Spark
● Shashank L
● Big data consultant and trainer at
datamantra.io
● www.shashankgowda.com
Agenda
● Financial analytics
● Requirements
● Architecture
● Initial solution
● RDD to Dataframe API
● Code quality and testing
● Architectural changes
● Future improvements
● Lookback
Financial Analytics
Financial analytics is used to predict the stock
prices for a specific company using its historical
price information
Architecture
Stocks data
(Daily basis)
Sql Server
ETL - Pipeline HDFS
Data
preprocessing
Data Analytics NoSQL
Frontend
(Dashboard)
Our team
● Data scientists
○ Coming up with the new magic
● Data engineers
○ Productionalizing the magic on large datasets
● Front end developer
○ Consumes results to make it presentable to
clients.
Requirements
● Across geography developers
● Variety of developers in team
● Better code quality
● Better testing mechanisms
● Easier team expansion
● Lesser infrastructure maintenance overhead
● Use latest libraries available
Iteration 1
Initial solution
Iteration 1
● Data scientists
○ They were well versed with Python or SQL
○ They did analysis using Python Panda dataframe code
○ Analysis were tested on only small set of data
● Data engineers
○ Using Spark - Spark 0.9
○ They used to port Python to Scala RDD API to be able to
scale the analysis to big data
○ Custom Framework with ability to write into and read from
multiple sources (File, Hive Table, S3, JDBC)
Data engineers
ArchitectureStocks data
(Daily basis)
Sql Server
ETL - Pipeline
HDFS
Data
preprocessing
Data Analytics NoSQL
Frontend
(Dashboard)
Analysis
(Python)
Data scientists
Challenges
● Framework challenges
○ Porting code from one language to another would lead
to a lot of inaccuracies
○ Differences in the language constructs and API lead to
change in code design
● Architectural challenges
○ Clusters used by the team were manually created and
maintained
○ Intermediate data was saved in a text based csv
format.
Iteration 2
RDD API to Dataframe API
Iteration 2
● Upgrade to Spark 1.3
● Data scientists
○ Dataframe API was introduced which was a better known
interface for Data scientists
○ SQL API was easier for the Data scientist to perform simple
operations
○ Zeppelin for Data scientists to prototype the analytical
algorithms
● Data engineers
○ CSV based intermediate format to Parquet
○ Amazon EMR based Hadoop cluster with Spark on it
Data science cluster
Data engineer Architecture
Stocks
data
ETL HDFS
Zeppelin
Dashboard
Data Analytics
(PySpark)
Data engineering cluster
Data
preprocessing
Data Analytics NoSQL
Challenges
● Quality challenges
○ Productionalizing multiple analysis required
expansion of Data engineering team
○ Team expansion induced code quality issues and
bugs in the code
○ Unit tests for the each functionalities were not
present
○ Review process for the changes in the code were
not present
Iteration 3
Code quality and testing
Iteration 3
● Creation of unit test cases for all the analysis
● More readable test case suite for the code using
ScalaTest (http://www.scalatest.org/)
● Test cases for unit testing small functionalities and
flow testing to test the full ETL flow on sampled data
● Review process for the changes in the code through
Github PR
● Daily build in Jenkins to test the flow and
functionalities on a daily basis
ScalaTest
class ExampleSpec extends FlatSpec with Matchers {
"A Stack" should "pop values in last-in-first-out order" in {
val stack = new Stack[Int]
stack.push(1)
stack.push(2)
stack.pop() should be (2)
stack.pop() should be (1)
}
it should "throw NoSuchElementException if an empty stack is popped" in {
val emptyStack = new Stack[Int]
a [NoSuchElementException] should be thrownBy {
emptyStack.pop()
}
}
}
Github PR
Challenges
● Architectural challenges
○ Cluster resources was a bottleneck for the teams
○ Amazon EMR clusters were not throw away
clusters as data was stored in HDFS.
○ Upgrading the Spark version on the cluster was
difficult
○ Infrastructure to run scheduled jobs was missing
as Jenkins was not the best way to schedule jobs
○ Stability issues with Zeppelin
Iteration 4
Architectural changes
Iteration 4
● Moved the data storage from HDFS to s3
● Moved to Databricks cloud environment (https:
//databricks.com/product/databricks)
● Databricks cloud provides notebook based interface
for writing Spark code in Scala, Java, Python and R
● Encourage data scientists to use Scala API
● Travis for deployment and testing
Databrick cloud
● Cluster config
○ Launch, configure, scale and terminate
Databrick cloud
● Jobs
○ Schedule complex workflows
Databrick cloud
● Notebooks
○ Explore, Visualize and Share
Improvements
● Data engineers
○ Cluster bottleneck was solved with creating multiple
throw away clusters when needed.
○ Need not stick to a cluster for a long time as primary
data storage was s3
○ Terminating cluster when not being used would be
cost efficient
○ Multiple clusters with different versions of Spark
enables the user to try out the latest feature in Spark
○ Cluster maintenance and tuning overhead
Improvements
● Data engineers
○ Lesser turnaround time in understanding bottlenecks in
the workflows
○ Databricks cloud Jobs can be used for scheduling
workflows and daily runs
○ Travis enabled strict and immediate code testing
● Data scientists
○ Data Scientists can easily share the notebooks and
results of the analysis with the team
○ Ability to write in multiple languages
DATABRICKS CLOUD
Jobs
Architecture
Dashboard
NoSQL
S3
ETL
Stocks
data
Datascience
cluster
Notebook
(R/Python)
DataEngg
cluster1
Notebook
(Scala)
DataEngg
cluster2
Notebook
(Scala)
Challenges
● Framework challenges
○ Schema is static and doesn’t change frequently
○ Dataframe doesn’t have static schema check
○ Pipeline fails in the middle of the processing if there
is any change in the data
○ Current window analysis uses Scala constructs to
load specific set of data to memory and run ML on
top of it
○ Domain object based functions are called from
inside udf currently
Iteration 5
Road ahead
Iteration 5 (Future iteration)
● Data engineers
○ Port analysis from Dataframe API into Dataset API
(in Spark 2.0)
○ With Dataset API, we get static schema check
○ Using existing Domain object based functions
● Data scientists
○ Move from Scala window based analysis to
SparkSQL window analytics
Lookback
● Spark version
○ 0.9 -> 1.6.0
● API
○ RDD -> Dataframe -> Dataset
● Deployment
○ EC2 -> EMR -> DB cloud
● Scheduling
○ Jenkins -> DB cloud Jobs
● Language
○ Scala
Lookback
● Data format
○ Text -> Parquet
● Storage
○ HDFS -> s3
● Deployment
○ Jenkins -> Travis
References
● http://go.databricks.com/databricks-community-
edition-beta-waitlist
● https://databricks.com/blog/2014/07/14/databricks-
cloud-making-big-data-easy.html
● http://shashankgowda.com/2016/02/20/introduction-
to-dataset-api-in-spark.html
Thank you

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
 
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actors
 
Improving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time SparkImproving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time Spark
 
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Anatomy of in memory processing in Spark
Anatomy of in memory processing in SparkAnatomy of in memory processing in Spark
Anatomy of in memory processing in Spark
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech Analystics
 

Andere mochten auch

Types by Adform Research, Saulius Valatka
Types by Adform Research, Saulius ValatkaTypes by Adform Research, Saulius Valatka
Types by Adform Research, Saulius Valatka
Vasil Remeniuk
 
Variance in scala
Variance in scalaVariance in scala
Variance in scala
LyleK
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 

Andere mochten auch (20)

Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
 
Q2 teenagers
Q2 teenagersQ2 teenagers
Q2 teenagers
 
Types by Adform Research, Saulius Valatka
Types by Adform Research, Saulius ValatkaTypes by Adform Research, Saulius Valatka
Types by Adform Research, Saulius Valatka
 
Variance in scala
Variance in scalaVariance in scala
Variance in scala
 
Python in real world.
Python in real world.Python in real world.
Python in real world.
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 

Ähnlich wie Productionalizing a spark application

Ähnlich wie Productionalizing a spark application (20)

The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Spark
SparkSpark
Spark
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ Indix
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadius
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
 
Bandwidth: Use Cases for Elastic Cloud on Kubernetes
Bandwidth: Use Cases for Elastic Cloud on Kubernetes Bandwidth: Use Cases for Elastic Cloud on Kubernetes
Bandwidth: Use Cases for Elastic Cloud on Kubernetes
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 

Mehr von datamantra (10)

Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 

Kürzlich hochgeladen

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 

Kürzlich hochgeladen (20)

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 

Productionalizing a spark application

  • 1. Productionalizing a Spark application Productionalizing an application on a frequently evolving framework like Spark
  • 2. ● Shashank L ● Big data consultant and trainer at datamantra.io ● www.shashankgowda.com
  • 3. Agenda ● Financial analytics ● Requirements ● Architecture ● Initial solution ● RDD to Dataframe API ● Code quality and testing ● Architectural changes ● Future improvements ● Lookback
  • 4. Financial Analytics Financial analytics is used to predict the stock prices for a specific company using its historical price information
  • 5. Architecture Stocks data (Daily basis) Sql Server ETL - Pipeline HDFS Data preprocessing Data Analytics NoSQL Frontend (Dashboard)
  • 6. Our team ● Data scientists ○ Coming up with the new magic ● Data engineers ○ Productionalizing the magic on large datasets ● Front end developer ○ Consumes results to make it presentable to clients.
  • 7. Requirements ● Across geography developers ● Variety of developers in team ● Better code quality ● Better testing mechanisms ● Easier team expansion ● Lesser infrastructure maintenance overhead ● Use latest libraries available
  • 9. Iteration 1 ● Data scientists ○ They were well versed with Python or SQL ○ They did analysis using Python Panda dataframe code ○ Analysis were tested on only small set of data ● Data engineers ○ Using Spark - Spark 0.9 ○ They used to port Python to Scala RDD API to be able to scale the analysis to big data ○ Custom Framework with ability to write into and read from multiple sources (File, Hive Table, S3, JDBC)
  • 10. Data engineers ArchitectureStocks data (Daily basis) Sql Server ETL - Pipeline HDFS Data preprocessing Data Analytics NoSQL Frontend (Dashboard) Analysis (Python) Data scientists
  • 11. Challenges ● Framework challenges ○ Porting code from one language to another would lead to a lot of inaccuracies ○ Differences in the language constructs and API lead to change in code design ● Architectural challenges ○ Clusters used by the team were manually created and maintained ○ Intermediate data was saved in a text based csv format.
  • 12. Iteration 2 RDD API to Dataframe API
  • 13. Iteration 2 ● Upgrade to Spark 1.3 ● Data scientists ○ Dataframe API was introduced which was a better known interface for Data scientists ○ SQL API was easier for the Data scientist to perform simple operations ○ Zeppelin for Data scientists to prototype the analytical algorithms ● Data engineers ○ CSV based intermediate format to Parquet ○ Amazon EMR based Hadoop cluster with Spark on it
  • 14. Data science cluster Data engineer Architecture Stocks data ETL HDFS Zeppelin Dashboard Data Analytics (PySpark) Data engineering cluster Data preprocessing Data Analytics NoSQL
  • 15. Challenges ● Quality challenges ○ Productionalizing multiple analysis required expansion of Data engineering team ○ Team expansion induced code quality issues and bugs in the code ○ Unit tests for the each functionalities were not present ○ Review process for the changes in the code were not present
  • 17. Iteration 3 ● Creation of unit test cases for all the analysis ● More readable test case suite for the code using ScalaTest (http://www.scalatest.org/) ● Test cases for unit testing small functionalities and flow testing to test the full ETL flow on sampled data ● Review process for the changes in the code through Github PR ● Daily build in Jenkins to test the flow and functionalities on a daily basis
  • 18. ScalaTest class ExampleSpec extends FlatSpec with Matchers { "A Stack" should "pop values in last-in-first-out order" in { val stack = new Stack[Int] stack.push(1) stack.push(2) stack.pop() should be (2) stack.pop() should be (1) } it should "throw NoSuchElementException if an empty stack is popped" in { val emptyStack = new Stack[Int] a [NoSuchElementException] should be thrownBy { emptyStack.pop() } } }
  • 20. Challenges ● Architectural challenges ○ Cluster resources was a bottleneck for the teams ○ Amazon EMR clusters were not throw away clusters as data was stored in HDFS. ○ Upgrading the Spark version on the cluster was difficult ○ Infrastructure to run scheduled jobs was missing as Jenkins was not the best way to schedule jobs ○ Stability issues with Zeppelin
  • 22. Iteration 4 ● Moved the data storage from HDFS to s3 ● Moved to Databricks cloud environment (https: //databricks.com/product/databricks) ● Databricks cloud provides notebook based interface for writing Spark code in Scala, Java, Python and R ● Encourage data scientists to use Scala API ● Travis for deployment and testing
  • 23. Databrick cloud ● Cluster config ○ Launch, configure, scale and terminate
  • 24. Databrick cloud ● Jobs ○ Schedule complex workflows
  • 25. Databrick cloud ● Notebooks ○ Explore, Visualize and Share
  • 26. Improvements ● Data engineers ○ Cluster bottleneck was solved with creating multiple throw away clusters when needed. ○ Need not stick to a cluster for a long time as primary data storage was s3 ○ Terminating cluster when not being used would be cost efficient ○ Multiple clusters with different versions of Spark enables the user to try out the latest feature in Spark ○ Cluster maintenance and tuning overhead
  • 27. Improvements ● Data engineers ○ Lesser turnaround time in understanding bottlenecks in the workflows ○ Databricks cloud Jobs can be used for scheduling workflows and daily runs ○ Travis enabled strict and immediate code testing ● Data scientists ○ Data Scientists can easily share the notebooks and results of the analysis with the team ○ Ability to write in multiple languages
  • 29. Challenges ● Framework challenges ○ Schema is static and doesn’t change frequently ○ Dataframe doesn’t have static schema check ○ Pipeline fails in the middle of the processing if there is any change in the data ○ Current window analysis uses Scala constructs to load specific set of data to memory and run ML on top of it ○ Domain object based functions are called from inside udf currently
  • 31. Iteration 5 (Future iteration) ● Data engineers ○ Port analysis from Dataframe API into Dataset API (in Spark 2.0) ○ With Dataset API, we get static schema check ○ Using existing Domain object based functions ● Data scientists ○ Move from Scala window based analysis to SparkSQL window analytics
  • 32. Lookback ● Spark version ○ 0.9 -> 1.6.0 ● API ○ RDD -> Dataframe -> Dataset ● Deployment ○ EC2 -> EMR -> DB cloud ● Scheduling ○ Jenkins -> DB cloud Jobs ● Language ○ Scala
  • 33. Lookback ● Data format ○ Text -> Parquet ● Storage ○ HDFS -> s3 ● Deployment ○ Jenkins -> Travis