SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Albert Franzi (@FranziCros), Schibsted Media Group
Modular Apache Spark:
Transform Your Code into
Pieces
#SAISDev3
1
2
Slides available in:
bit.ly/SparkSummit-afranzi
About me
3
2013 2015 20182014 2017
Modular Apache Spark:
Transform Your Code into Pieces
4
5
- How we reduced the test time execution by skipping
unaffected tests -- > less coffees
- How we simplified our spark code by modularizing
- How we increased our test coverage in our spark code by
using the spark-testing-base provided by Holden Karau
We will learn:
6
- How we reduced the test time execution by skipping
unaffected tests -- > less coffees
- How we simplified our spark code by modularizing
- How we increased our test coverage in our spark code by
using the spark-testing-base provided by Holden Karau
We will learn:
7
Did you play with duplicated code cross your Spark Jobs?
@ The Shining 1980
Have you ever experienced the joy of code
reviewing a never ending stream of spark chaos?
8
Please! Don’t play with duplicated code never ever!
@ The Shining 1980
Fragment the Spark Job
9
Spark Readers Transformation
JoinsAliases Formatters
Spark WritersTask Context
...
Readers / Writers
10
Spark Readers
Spark Writers
● Enforce schemas
● Use schemas to read only the fields you are going to use
● Provide Readers per Dataset & attach its sources to it
● Share schemas & sources between Readers & Writers
● GDPR compliant by design
val userBehaviourSchema: StructType = ???
val userBehaviourPath =
Path("s3://<bucket>/user_behaviour/year=2018/month=10/day=03/hour=12/gen=27/")
val userBehaviourReader = ReaderBuilder(PARQUET)
.withSchema(userBehaviourSchema)
.withPath(userBehaviourPath)
.buildReader()
val df: DataFrame = userBehaviourReader.read()
GDPR
Readers
11
Spark Readers
Readers
12
Spark Readers
val userBehaviourSchema: StructType = ???
// Path structure - s3://<bucket>/user_behaviour/[year]/[month]/[day]/[hour]/[gen]/
val userBehaviourBasePath = Path("s3://<bucket>/user_behaviour/")
val startDate: ZonedDateTime = ZonedDateTime.now(ZoneOffset.UTC)
val halfDay: Duration = Duration.ofHours(12)
val userBehaviourPaths: Seq[Path] = PathBuilder
.latestGenHourlyPaths(userBehaviourBasePath, startDate, halfDay)
val userBehaviourReader = ReaderBuilder(PARQUET)
.withSchema(userBehaviourSchema)
.withPath(userBehaviourPaths: _*)
.buildReader()
val df: DataFrame = userBehaviourReader.read()
Readers
13
val userBehaviourSchema: StructType = ???
val userBehaviourBasePath = Path("s3://<bucket>/user_behaviour/")
val startDate: ZonedDateTime = ZonedDateTime.now(ZoneOffset.UTC)
val halfDay: Duration = Duration.ofHours(12)
val userBehaviourReader = ReaderBuilder(PARQUET)
.withSchema(userBehaviourSchema)
.withHourlyPathBuilder(userBehaviourBasePath, startDate, halfDay)
.buildReader()
val df: DataFrame = userBehaviourReader.read()
Spark Readers
Readers
14
val df: DataFrame = UserBehaviourReader.read(startDate, halfDay)
Spark Readers
Transforms
15
Transformation
def transform[U](t: (Dataset[T]) ⇒ Dataset[U]): Dataset[U]
Dataset[T] ⇒ magic ⇒ Dataset[U]
Transforms
16
Transformation
def withGreeting(df: DataFrame): DataFrame = {
df.withColumn("greeting", lit("hello world"))
}
def extractFromJson(colName: String,
outputColName: String,
jsonSchema: StructType)(df: DataFrame): DataFrame = {
df.withColumn(outputColName, from_json(col(colName), jsonSchema))
}
Transforms
17
Transformation
def onlyClassifiedAds(df: DataFrame): DataFrame = {
df.filter(col("event_type") === "View")
.filter(col("object_type") === "ClassifiedAd")
}
def dropDuplicates(df: DataFrame): DataFrame = {
df.dropDuplicates()
}
def cleanedCity(df: DataFrame): DataFrame = {
df.withColumn("city", getCityUdf(col("object.location.address")))
}
val cleanupTransformations: Seq[DataFrame => DataFrame] = Seq(
dropDuplicates,
cleanedCity,
onlyClassifiedAds
)
val df: DataFrame = UserBehaviourReader.read(startDate, halfDay)
val classifiedAdsDF = df.transforms(cleanupTransformations: _*)
Transforms
18
Transformation
val cleanupTransformations: Seq[DataFrame => DataFrame] = Seq(
dropDuplicates,
cleanedCity,
onlyClassifiedAds
)
val df: DataFrame = UserBehaviourReader.read(startDate, halfDay)
val classifiedAdsDF = df.transforms(cleanupTransformations: _*)
“As a data consumer, I only need to pick up which transformations I would like to apply,
instead of coding them from scratch.”
“It’s like cooking, engineers provide manufactured ingredients (transformations) and Data
Scientists use the required ones for a successful receipt.”
Transforms - Links of Interest
“DataFrame transformations can be defined with arguments so they don’t make
assumptions about the schema of the underlying DataFrame.” - by Matthew Powers.
bit.ly/Spark-ChainingTransformations
bit.ly/Spark-SchemaIndependentTransformations
19
Transformation
github.com/MrPowers/spark-daria
“Spark helper methods to maximize developer productivity.”
20
- How we reduced the test time execution by skipping
unaffected tests -- > less coffees
- How we simplified our spark code by modularizing
- How we increased our test coverage in our spark code by
using the spark-testing-base provided by Holden Karau
We will learn:
21
Did you put untested Spark jobs into production?
“Mars Climate Orbiter destroyed
because of a Metric System Mixup (1999)”
Testing with Holden Karau
22
github.com/holdenk/spark-testing-base
“Base classes to use when writing tests with Spark.”
● Share Spark Context between tests from the same suite
● Provide methods to make tests easier
○ Fixture Readers
○ Json to DF converters
○ Extra validators
github.com/MrPowers/spark-fast-tests
“An alternative to spark-testing-base to run tests in parallel without restarting Spark Session
after each test file.”
Testing SharedSparkContext
23
package com.holdenkarau.spark.testing
import java.util.Date
import org.apache.spark._
import org.scalatest.{BeforeAndAfterAll, Suite}
/**
* Shares a local `SparkContext` between all tests in a suite
* and closes it at the end. You can share between suites by enabling
* reuseContextIfPossible.
*/
trait SharedSparkContext extends BeforeAndAfterAll with SparkContextProvider {
self: Suite =>
...
protected implicit def reuseContextIfPossible: Boolean = false
...
}
Testing
24
package com.schibsted.insights.test
trait SparkSuite extends DataFrameSuiteBase {
self: Suite =>
override def reuseContextIfPossible: Boolean = true
protected def createDF(data: Seq[Row], schema: StructType): DataFrame = {
spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
}
protected def jsonFixtureToDF(fileName: String, schema: Option[StructType] = None): DataFrame = {
val fixtureContent = readFixtureContent(fileName)
val fixtureJson = fixtureContentToJson(fixtureContent)
jsonToDF(fixtureJson, schema)
}
protected def checkSchemas(inputSchema: StructType, expectedSchema: StructType): Unit = {
assert(inputSchema.fields.sortBy(_.name).deep == expectedSchema.fields.sortBy(_.name).deep)
}
...
}
Testing
25
Spark Readers Transformation Spark WritersTask Context
Testing each piece independently helps testing all together.
Test
26
- How we reduced the test time execution by skipping
unaffected tests -- > less coffees
- How we simplified our spark code by modularizing
- How we increased our test coverage in our spark code by
using the spark-testing-base provided by Holden Karau
We will learn:
27
Are tests taking too long to execute?
Junit4Git by Raquel Pau
28
github.com/rpau/junit4git
“Junit Extensions for Test Impact Analysis.”
“This is a JUnit extension that ignores those tests that are not related with your last
changes in your Git repository.”
Junit4Git - Gradle conf
29
configurations {
agent
}
dependencies {
testCompile("org.walkmod:scalatest4git_2.11:${version}")
agent "org.walkmod:junit4git-agent:${version}"
}
test.doFirst {
jvmArgs "-javaagent:${configurations.agent.singleFile}"
}
@RunWith(classOf[ScalaGitRunner])
Junit4Git- Travis with Git notes
30
before_install:
- echo -e "machine github.comn login $GITHUB_TOKEN" >> ~/.netrc
after_script:
- git push origin refs/notes/tests:refs/notes/tests
.travis.yml
Junit4Git- Runners
31
import org.scalatest.junit.JUnitRunner
@RunWith(classOf[JUnitRunner])
abstract class UnitSpec extends FunSuite
import org.scalatest.junit.ScalaGitRunner
@RunWith(classOf[ScalaGitRunner])
abstract class IntegrationSpec extends FunSuite
Tests to run always
Tests to run on code changes
Junit4Git
32
Runner Listener Agent Test Method
testStarted
Starts the report
loadClass
testFinished
addTestClass
Closes the
report
Class
instrumentClass
Junit4Git- Notes
33
afranzi:~/Projects/insights-core$ git notes show
[
{
"test": "com.schibsted.insights.core.ContextSuite",
"method": "Context Classes with SptCommonBucketConf should contain a sptCommonBucket Field",
"classes": [
"com.schibsted.insights.core.Context",
"com.schibsted.insights.core.ContextSuite$$anonfun$1$$anon$1",
"com.schibsted.insights.core.Context$",
"com.schibsted.insights.core.SptCommonBucketConf$class",
"com.schibsted.insights.core.Context$$anonfun$getConfigParamWithLogging$1"
]
}
...
]
com.schibsted.insights.core.ContextSuite > Context Classes with SptCommonBucketConf should
contain a sptCommonBucket Field SKIPPED
com.schibsted.insights.core.ContextSuite > Context Classes with InsightsBucketConf should
contain a insightsBucket Field SKIPPED
Junit4Git - Links of Interest
34
bit.ly/SlidesJunit4Git
Real Impact Testing Analysis For JVM
bit.ly/TestImpactAnalysis
The Rise of Test Impact Analysis
Summary
35
Use Transforms as pills
Modularize your code
Don’t duplicate code between Spark Jobs
Don’t be afraid of testing everything
Share all you built as a library, so others can
reuse your code.
36

Weitere ähnliche Inhalte

Mehr von Databricks

Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 

Mehr von Databricks (20)

Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot Instances
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 

Kürzlich hochgeladen

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 

Kürzlich hochgeladen (20)

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 

Modular Apache Spark: Transform Your Code in Pieces with Albert Franzi

  • 1. Albert Franzi (@FranziCros), Schibsted Media Group Modular Apache Spark: Transform Your Code into Pieces #SAISDev3 1
  • 3. About me 3 2013 2015 20182014 2017
  • 4. Modular Apache Spark: Transform Your Code into Pieces 4
  • 5. 5 - How we reduced the test time execution by skipping unaffected tests -- > less coffees - How we simplified our spark code by modularizing - How we increased our test coverage in our spark code by using the spark-testing-base provided by Holden Karau We will learn:
  • 6. 6 - How we reduced the test time execution by skipping unaffected tests -- > less coffees - How we simplified our spark code by modularizing - How we increased our test coverage in our spark code by using the spark-testing-base provided by Holden Karau We will learn:
  • 7. 7 Did you play with duplicated code cross your Spark Jobs? @ The Shining 1980 Have you ever experienced the joy of code reviewing a never ending stream of spark chaos?
  • 8. 8 Please! Don’t play with duplicated code never ever! @ The Shining 1980
  • 9. Fragment the Spark Job 9 Spark Readers Transformation JoinsAliases Formatters Spark WritersTask Context ...
  • 10. Readers / Writers 10 Spark Readers Spark Writers ● Enforce schemas ● Use schemas to read only the fields you are going to use ● Provide Readers per Dataset & attach its sources to it ● Share schemas & sources between Readers & Writers ● GDPR compliant by design
  • 11. val userBehaviourSchema: StructType = ??? val userBehaviourPath = Path("s3://<bucket>/user_behaviour/year=2018/month=10/day=03/hour=12/gen=27/") val userBehaviourReader = ReaderBuilder(PARQUET) .withSchema(userBehaviourSchema) .withPath(userBehaviourPath) .buildReader() val df: DataFrame = userBehaviourReader.read() GDPR Readers 11 Spark Readers
  • 12. Readers 12 Spark Readers val userBehaviourSchema: StructType = ??? // Path structure - s3://<bucket>/user_behaviour/[year]/[month]/[day]/[hour]/[gen]/ val userBehaviourBasePath = Path("s3://<bucket>/user_behaviour/") val startDate: ZonedDateTime = ZonedDateTime.now(ZoneOffset.UTC) val halfDay: Duration = Duration.ofHours(12) val userBehaviourPaths: Seq[Path] = PathBuilder .latestGenHourlyPaths(userBehaviourBasePath, startDate, halfDay) val userBehaviourReader = ReaderBuilder(PARQUET) .withSchema(userBehaviourSchema) .withPath(userBehaviourPaths: _*) .buildReader() val df: DataFrame = userBehaviourReader.read()
  • 13. Readers 13 val userBehaviourSchema: StructType = ??? val userBehaviourBasePath = Path("s3://<bucket>/user_behaviour/") val startDate: ZonedDateTime = ZonedDateTime.now(ZoneOffset.UTC) val halfDay: Duration = Duration.ofHours(12) val userBehaviourReader = ReaderBuilder(PARQUET) .withSchema(userBehaviourSchema) .withHourlyPathBuilder(userBehaviourBasePath, startDate, halfDay) .buildReader() val df: DataFrame = userBehaviourReader.read() Spark Readers
  • 14. Readers 14 val df: DataFrame = UserBehaviourReader.read(startDate, halfDay) Spark Readers
  • 15. Transforms 15 Transformation def transform[U](t: (Dataset[T]) ⇒ Dataset[U]): Dataset[U] Dataset[T] ⇒ magic ⇒ Dataset[U]
  • 16. Transforms 16 Transformation def withGreeting(df: DataFrame): DataFrame = { df.withColumn("greeting", lit("hello world")) } def extractFromJson(colName: String, outputColName: String, jsonSchema: StructType)(df: DataFrame): DataFrame = { df.withColumn(outputColName, from_json(col(colName), jsonSchema)) }
  • 17. Transforms 17 Transformation def onlyClassifiedAds(df: DataFrame): DataFrame = { df.filter(col("event_type") === "View") .filter(col("object_type") === "ClassifiedAd") } def dropDuplicates(df: DataFrame): DataFrame = { df.dropDuplicates() } def cleanedCity(df: DataFrame): DataFrame = { df.withColumn("city", getCityUdf(col("object.location.address"))) } val cleanupTransformations: Seq[DataFrame => DataFrame] = Seq( dropDuplicates, cleanedCity, onlyClassifiedAds ) val df: DataFrame = UserBehaviourReader.read(startDate, halfDay) val classifiedAdsDF = df.transforms(cleanupTransformations: _*)
  • 18. Transforms 18 Transformation val cleanupTransformations: Seq[DataFrame => DataFrame] = Seq( dropDuplicates, cleanedCity, onlyClassifiedAds ) val df: DataFrame = UserBehaviourReader.read(startDate, halfDay) val classifiedAdsDF = df.transforms(cleanupTransformations: _*) “As a data consumer, I only need to pick up which transformations I would like to apply, instead of coding them from scratch.” “It’s like cooking, engineers provide manufactured ingredients (transformations) and Data Scientists use the required ones for a successful receipt.”
  • 19. Transforms - Links of Interest “DataFrame transformations can be defined with arguments so they don’t make assumptions about the schema of the underlying DataFrame.” - by Matthew Powers. bit.ly/Spark-ChainingTransformations bit.ly/Spark-SchemaIndependentTransformations 19 Transformation github.com/MrPowers/spark-daria “Spark helper methods to maximize developer productivity.”
  • 20. 20 - How we reduced the test time execution by skipping unaffected tests -- > less coffees - How we simplified our spark code by modularizing - How we increased our test coverage in our spark code by using the spark-testing-base provided by Holden Karau We will learn:
  • 21. 21 Did you put untested Spark jobs into production? “Mars Climate Orbiter destroyed because of a Metric System Mixup (1999)”
  • 22. Testing with Holden Karau 22 github.com/holdenk/spark-testing-base “Base classes to use when writing tests with Spark.” ● Share Spark Context between tests from the same suite ● Provide methods to make tests easier ○ Fixture Readers ○ Json to DF converters ○ Extra validators github.com/MrPowers/spark-fast-tests “An alternative to spark-testing-base to run tests in parallel without restarting Spark Session after each test file.”
  • 23. Testing SharedSparkContext 23 package com.holdenkarau.spark.testing import java.util.Date import org.apache.spark._ import org.scalatest.{BeforeAndAfterAll, Suite} /** * Shares a local `SparkContext` between all tests in a suite * and closes it at the end. You can share between suites by enabling * reuseContextIfPossible. */ trait SharedSparkContext extends BeforeAndAfterAll with SparkContextProvider { self: Suite => ... protected implicit def reuseContextIfPossible: Boolean = false ... }
  • 24. Testing 24 package com.schibsted.insights.test trait SparkSuite extends DataFrameSuiteBase { self: Suite => override def reuseContextIfPossible: Boolean = true protected def createDF(data: Seq[Row], schema: StructType): DataFrame = { spark.createDataFrame(spark.sparkContext.parallelize(data), schema) } protected def jsonFixtureToDF(fileName: String, schema: Option[StructType] = None): DataFrame = { val fixtureContent = readFixtureContent(fileName) val fixtureJson = fixtureContentToJson(fixtureContent) jsonToDF(fixtureJson, schema) } protected def checkSchemas(inputSchema: StructType, expectedSchema: StructType): Unit = { assert(inputSchema.fields.sortBy(_.name).deep == expectedSchema.fields.sortBy(_.name).deep) } ... }
  • 25. Testing 25 Spark Readers Transformation Spark WritersTask Context Testing each piece independently helps testing all together. Test
  • 26. 26 - How we reduced the test time execution by skipping unaffected tests -- > less coffees - How we simplified our spark code by modularizing - How we increased our test coverage in our spark code by using the spark-testing-base provided by Holden Karau We will learn:
  • 27. 27 Are tests taking too long to execute?
  • 28. Junit4Git by Raquel Pau 28 github.com/rpau/junit4git “Junit Extensions for Test Impact Analysis.” “This is a JUnit extension that ignores those tests that are not related with your last changes in your Git repository.”
  • 29. Junit4Git - Gradle conf 29 configurations { agent } dependencies { testCompile("org.walkmod:scalatest4git_2.11:${version}") agent "org.walkmod:junit4git-agent:${version}" } test.doFirst { jvmArgs "-javaagent:${configurations.agent.singleFile}" } @RunWith(classOf[ScalaGitRunner])
  • 30. Junit4Git- Travis with Git notes 30 before_install: - echo -e "machine github.comn login $GITHUB_TOKEN" >> ~/.netrc after_script: - git push origin refs/notes/tests:refs/notes/tests .travis.yml
  • 31. Junit4Git- Runners 31 import org.scalatest.junit.JUnitRunner @RunWith(classOf[JUnitRunner]) abstract class UnitSpec extends FunSuite import org.scalatest.junit.ScalaGitRunner @RunWith(classOf[ScalaGitRunner]) abstract class IntegrationSpec extends FunSuite Tests to run always Tests to run on code changes
  • 32. Junit4Git 32 Runner Listener Agent Test Method testStarted Starts the report loadClass testFinished addTestClass Closes the report Class instrumentClass
  • 33. Junit4Git- Notes 33 afranzi:~/Projects/insights-core$ git notes show [ { "test": "com.schibsted.insights.core.ContextSuite", "method": "Context Classes with SptCommonBucketConf should contain a sptCommonBucket Field", "classes": [ "com.schibsted.insights.core.Context", "com.schibsted.insights.core.ContextSuite$$anonfun$1$$anon$1", "com.schibsted.insights.core.Context$", "com.schibsted.insights.core.SptCommonBucketConf$class", "com.schibsted.insights.core.Context$$anonfun$getConfigParamWithLogging$1" ] } ... ] com.schibsted.insights.core.ContextSuite > Context Classes with SptCommonBucketConf should contain a sptCommonBucket Field SKIPPED com.schibsted.insights.core.ContextSuite > Context Classes with InsightsBucketConf should contain a insightsBucket Field SKIPPED
  • 34. Junit4Git - Links of Interest 34 bit.ly/SlidesJunit4Git Real Impact Testing Analysis For JVM bit.ly/TestImpactAnalysis The Rise of Test Impact Analysis
  • 35. Summary 35 Use Transforms as pills Modularize your code Don’t duplicate code between Spark Jobs Don’t be afraid of testing everything Share all you built as a library, so others can reuse your code.
  • 36. 36