SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Programming in Spark
Lessons Learned in OpenAire project
Łukasz Dumiszewski, ICM, University of Warsaw, 10.2016
Duration: 1h, Requirements: knowledge of Apache Spark
Goals of work in OpenAire (IIS)
 Rewriting of OpenAire (IIS) from MR/ Pig to Spark (several
modules i.e. citation-matching)
 Improvement of project structure
 Enhancement of integration tests
 Creation of new modules: matching of publication
affiliations, IIS execution report etc.
Problems and solutions
 Programming language
 Coding standards
 Data serialization and cache
 Code serialization
 Data storage format
 Accumulators
 Piping to external programs
 Testing
Programming language
Java8: no problems encountered, friendly Java Spark API,
readable code.
Coding standards
Standard programming practices (low-coupling, high-
cohesion). Possible use of Spring for dependency injection.
Pros: code readability and reliability, easy development and
testing
See:
AffMatchingService
AffMatchingJob
Data serialization and cache
KryoSerialization – fast and efficient
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
Problem with deseralization of Avro collections (Avro specific list
implementation).
Solved by implementing a custom Kryo registrator:
conf.set("spark.kryo.registrator",
"eu.dnetlib.iis.wf.affmatching.AffMatchingJob$AvroCompatibleKryoRegistrator");
The registrator can be found at: https://github.com/CeON/spark-utils/
Code serialization
Big misunderstanding in books and on the Internet (?)
„Learning Spark” in the paragraph describing the change of spark.serializer to Kryo:
Whether using Kryo or Java’s serializer, you may encounter a NotSerializableException if
your code refers to a class that does not extend Java’s Serializable interface …
The setting spark.serializer does not refer to code serialization but to data
serialization.
It is spark.closure.serializer that corresponds to code serialization, and it uses java
serialization by default (it is not recommended that one change it due to the small
amount of data serialized/ sent in this case). For this reason classes have to
implement Serializable (or Externalizable). Otherwise we get
NotSerializableException.
Code serialization or mapPartitions?
It does not make sense to write functions that operate on partitions (like
mapPartitions) and to create service beans in these partitions only to avoid the
serialization and sending of the code between nodes. Serializing and copying the
code does not have a big influence on the efficiency of an application. Using
mapPartitions complicates the code and makes it difficult to write unit tests.
void execute() {
rdd.mapPartitons(iter -> {
SomeService service = new SomeService();
service.generate…
...
return someCollection;
})
}
Data storage format
To read an avro file you can use a standard Hadoop API:
JavaPairRDD<AvroKey<T>, NullWritable> inputRecords = (JavaPairRDD<AvroKey<T>,
NullWritable>) sc.newAPIHadoopFile(avroDatastorePath, AvroKeyInputFormat.class,
avroRecordClass, NullWritable.class, job.getConfiguration());
Data storage format
Problem: when using the standard hadoop API in Spark, you can come across
unpredictable errors, because the hadoop record reader reuses the same
Writable object for all records read.
This is not a problem in the case of MapReduce jobs where each record is processed
separately. In Spark, however, it can sometimes lead to undesired effects. For example, in
the case of caching an rdd only the last object read will be cached (multiple times, equal to
the number of all records read). This probably has something in common with creating
multiple references to the same object.
To eliminate this phenomenon, one should clone each avro record after it has been read.
See: spark-utils/SparkAvroLoader
JavaRDD<DocumentToProject> docProjects = avroLoader.loadJavaRDD(sc, inputPath,
DocumentToProject.class);
Usage of accumulators
At first IIS execution report based on accumulators, then just rdd counts.
Use wisely (if you have to), and only in actions.
Main disadvantages of accumulators:
 They allow one to store data in custom structures. Naive usage can lead
to memory problems (accumulators on every node, sending to drivers)
 When used in transformations (map, filter etc.) - repeating tasks (in case
of a node failure or memory space deficiency) can lead to incorrect
accumulator values (they are calculated and increased as many times as a
given transformation has been repeated).
More:
http://imranrashid.com/posts/Spark-Accumulators/
http://stackoverflow.com/questions/29494452/when-are-accumulators-truly-reliable
Piping to external programs
To use an external script in a Spark job, one must upload it to every node:
SparkContext.addFile(path)
To refer to it one should use (it is advised to do so in the comment to
addFile):
SparkFiles.get(fileName)
It is just that… it only works in local mode! In cluster mode the path is
different - the script files are in the working directory of each node.
Experience: many non-repeatable errors (everything was fine when a
node was on the same server as the driver).
For solution see: DocumentClassificationJob
Unit tests
 Write unit tests as for any other java code. It is not difficult if the code
is written properly (just as it is in the case of non-distributed computing).
 Mocking JavaRDD is not a problem.
 Testing functions (lambda expressions) is a bit tedious.
See:
AffMatchingServiceTest
Testing spark jobs
Helpful classes that facilitate the testing of spark jobs:
spark-utils/test
Spark as an action in Oozie workflow
Only one jar can be passed to ‘spark submit’ in an oozie
action.
Use maven shade plugin or similar tool to merge many jars
into one.
Integration tests of oozie workflows
While working on IIS, one can fire the oozie workflow integration tests
from IDE (Eclipse, NetBeans).
Proper code creates oozie packages, sends them to a server, polls for the
job status and compares the results with those expected.
See: AbstractOozieWorkflowTestCase.java
Conclusions
 It is easier to write and to test a Spark job than an equivalent chain of
M-R jobs.
 Efficiency: after it has been rewritten from MR to Spark the
CitationMatching module execution time fell from 28h to 10h (the
comparison is far from perfect because each version was run on a
different cluster).
 Debugging is difficult.
 Easy integration with Oozie. Oozie workflows are less complex, a lot of
logic has been moved to Spark.

Weitere ähnliche Inhalte

Was ist angesagt?

Graal Tutorial at CGO 2015 by Christian Wimmer
Graal Tutorial at CGO 2015 by Christian WimmerGraal Tutorial at CGO 2015 by Christian Wimmer
Graal Tutorial at CGO 2015 by Christian Wimmer
Thomas Wuerthinger
 
Java compilation
Java compilationJava compilation
Java compilation
Mike Kucera
 
JAVA 8 Parallel Stream
JAVA 8 Parallel StreamJAVA 8 Parallel Stream
JAVA 8 Parallel Stream
Tengwen Wang
 

Was ist angesagt? (20)

Jan Stepien - Introducing structure in Clojure - Codemotion Milan 2017
Jan Stepien - Introducing structure in Clojure - Codemotion Milan 2017Jan Stepien - Introducing structure in Clojure - Codemotion Milan 2017
Jan Stepien - Introducing structure in Clojure - Codemotion Milan 2017
 
Smart Migration to JDK 8
Smart Migration to JDK 8Smart Migration to JDK 8
Smart Migration to JDK 8
 
Build, logging, and unit test tools
Build, logging, and unit test toolsBuild, logging, and unit test tools
Build, logging, and unit test tools
 
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
 
Javantura v3 - ES6 – Future Is Now – Nenad Pečanac
Javantura v3 - ES6 – Future Is Now – Nenad PečanacJavantura v3 - ES6 – Future Is Now – Nenad Pečanac
Javantura v3 - ES6 – Future Is Now – Nenad Pečanac
 
Gatling
Gatling Gatling
Gatling
 
Testing in Scala. Adform Research
Testing in Scala. Adform ResearchTesting in Scala. Adform Research
Testing in Scala. Adform Research
 
Solid And Sustainable Development in Scala
Solid And Sustainable Development in ScalaSolid And Sustainable Development in Scala
Solid And Sustainable Development in Scala
 
Java byte code & virtual machine
Java byte code & virtual machineJava byte code & virtual machine
Java byte code & virtual machine
 
JavaCro'14 - Is there Kotlin after Java 8 – Ivan Turčinović and Igor Buzatović
JavaCro'14 - Is there Kotlin after Java 8 – Ivan Turčinović and Igor BuzatovićJavaCro'14 - Is there Kotlin after Java 8 – Ivan Turčinović and Igor Buzatović
JavaCro'14 - Is there Kotlin after Java 8 – Ivan Turčinović and Igor Buzatović
 
camel-scala.pdf
camel-scala.pdfcamel-scala.pdf
camel-scala.pdf
 
JVM++: The Graal VM
JVM++: The Graal VMJVM++: The Graal VM
JVM++: The Graal VM
 
What is-java
What is-javaWhat is-java
What is-java
 
Advanced Production Debugging
Advanced Production DebuggingAdvanced Production Debugging
Advanced Production Debugging
 
ApacheCon North America 2018: Creating Spark Data Sources
ApacheCon North America 2018: Creating Spark Data SourcesApacheCon North America 2018: Creating Spark Data Sources
ApacheCon North America 2018: Creating Spark Data Sources
 
Java 8 parallel stream
Java 8 parallel streamJava 8 parallel stream
Java 8 parallel stream
 
Graal VM: Multi-Language Execution Platform
Graal VM: Multi-Language Execution PlatformGraal VM: Multi-Language Execution Platform
Graal VM: Multi-Language Execution Platform
 
Graal Tutorial at CGO 2015 by Christian Wimmer
Graal Tutorial at CGO 2015 by Christian WimmerGraal Tutorial at CGO 2015 by Christian Wimmer
Graal Tutorial at CGO 2015 by Christian Wimmer
 
Java compilation
Java compilationJava compilation
Java compilation
 
JAVA 8 Parallel Stream
JAVA 8 Parallel StreamJAVA 8 Parallel Stream
JAVA 8 Parallel Stream
 

Andere mochten auch

Types by Adform Research, Saulius Valatka
Types by Adform Research, Saulius ValatkaTypes by Adform Research, Saulius Valatka
Types by Adform Research, Saulius Valatka
Vasil Remeniuk
 
Variance in scala
Variance in scalaVariance in scala
Variance in scala
LyleK
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 

Andere mochten auch (20)

Code Review and other aspects of project organization
Code Review and other aspects of project organizationCode Review and other aspects of project organization
Code Review and other aspects of project organization
 
df: Dataframe on Spark
df: Dataframe on Sparkdf: Dataframe on Spark
df: Dataframe on Spark
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Spark Scala project
Spark Scala project Spark Scala project
Spark Scala project
 
Elaboration on world war 2
Elaboration on world war 2Elaboration on world war 2
Elaboration on world war 2
 
Q2 teenagers
Q2 teenagersQ2 teenagers
Q2 teenagers
 
Types by Adform Research, Saulius Valatka
Types by Adform Research, Saulius ValatkaTypes by Adform Research, Saulius Valatka
Types by Adform Research, Saulius Valatka
 
Variance in scala
Variance in scalaVariance in scala
Variance in scala
 
Python in real world.
Python in real world.Python in real world.
Python in real world.
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
 
Python programming - Everyday(ish) Examples
Python programming - Everyday(ish) ExamplesPython programming - Everyday(ish) Examples
Python programming - Everyday(ish) Examples
 
Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programming
 
Neo, Titan & Cassandra
Neo, Titan & CassandraNeo, Titan & Cassandra
Neo, Titan & Cassandra
 
Lets learn Python !
Lets learn Python !Lets learn Python !
Lets learn Python !
 
Apache spark with Machine learning
Apache spark with Machine learningApache spark with Machine learning
Apache spark with Machine learning
 
Titan: Scaling Graphs and TinkerPop3
Titan: Scaling Graphs and TinkerPop3Titan: Scaling Graphs and TinkerPop3
Titan: Scaling Graphs and TinkerPop3
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
 
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
 

Ähnlich wie Programming in Spark - Lessons Learned in OpenAire project

Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemts
siddharth30121
 

Ähnlich wie Programming in Spark - Lessons Learned in OpenAire project (20)

Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
 
Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemts
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
Java 8 Overview
Java 8 OverviewJava 8 Overview
Java 8 Overview
 
Alberto Paro - Hands on Scala.js
Alberto Paro - Hands on Scala.jsAlberto Paro - Hands on Scala.js
Alberto Paro - Hands on Scala.js
 
Scala Italy 2015 - Hands On ScalaJS
Scala Italy 2015 - Hands On ScalaJSScala Italy 2015 - Hands On ScalaJS
Scala Italy 2015 - Hands On ScalaJS
 
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
 
Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...
Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...
Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark Performance
 
Angular JS in 2017
Angular JS in 2017Angular JS in 2017
Angular JS in 2017
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Spark core
Spark coreSpark core
Spark core
 
Viridians on Rails
Viridians on RailsViridians on Rails
Viridians on Rails
 
Rapid, Scalable Web Development with MongoDB, Ming, and Python
Rapid, Scalable Web Development with MongoDB, Ming, and PythonRapid, Scalable Web Development with MongoDB, Ming, and Python
Rapid, Scalable Web Development with MongoDB, Ming, and Python
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 

Kürzlich hochgeladen

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 

Kürzlich hochgeladen (20)

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 

Programming in Spark - Lessons Learned in OpenAire project

  • 1. Programming in Spark Lessons Learned in OpenAire project Łukasz Dumiszewski, ICM, University of Warsaw, 10.2016 Duration: 1h, Requirements: knowledge of Apache Spark
  • 2. Goals of work in OpenAire (IIS)  Rewriting of OpenAire (IIS) from MR/ Pig to Spark (several modules i.e. citation-matching)  Improvement of project structure  Enhancement of integration tests  Creation of new modules: matching of publication affiliations, IIS execution report etc.
  • 3. Problems and solutions  Programming language  Coding standards  Data serialization and cache  Code serialization  Data storage format  Accumulators  Piping to external programs  Testing
  • 4. Programming language Java8: no problems encountered, friendly Java Spark API, readable code.
  • 5. Coding standards Standard programming practices (low-coupling, high- cohesion). Possible use of Spring for dependency injection. Pros: code readability and reliability, easy development and testing See: AffMatchingService AffMatchingJob
  • 6. Data serialization and cache KryoSerialization – fast and efficient conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"); Problem with deseralization of Avro collections (Avro specific list implementation). Solved by implementing a custom Kryo registrator: conf.set("spark.kryo.registrator", "eu.dnetlib.iis.wf.affmatching.AffMatchingJob$AvroCompatibleKryoRegistrator"); The registrator can be found at: https://github.com/CeON/spark-utils/
  • 7. Code serialization Big misunderstanding in books and on the Internet (?) „Learning Spark” in the paragraph describing the change of spark.serializer to Kryo: Whether using Kryo or Java’s serializer, you may encounter a NotSerializableException if your code refers to a class that does not extend Java’s Serializable interface … The setting spark.serializer does not refer to code serialization but to data serialization. It is spark.closure.serializer that corresponds to code serialization, and it uses java serialization by default (it is not recommended that one change it due to the small amount of data serialized/ sent in this case). For this reason classes have to implement Serializable (or Externalizable). Otherwise we get NotSerializableException.
  • 8. Code serialization or mapPartitions? It does not make sense to write functions that operate on partitions (like mapPartitions) and to create service beans in these partitions only to avoid the serialization and sending of the code between nodes. Serializing and copying the code does not have a big influence on the efficiency of an application. Using mapPartitions complicates the code and makes it difficult to write unit tests. void execute() { rdd.mapPartitons(iter -> { SomeService service = new SomeService(); service.generate… ... return someCollection; }) }
  • 9. Data storage format To read an avro file you can use a standard Hadoop API: JavaPairRDD<AvroKey<T>, NullWritable> inputRecords = (JavaPairRDD<AvroKey<T>, NullWritable>) sc.newAPIHadoopFile(avroDatastorePath, AvroKeyInputFormat.class, avroRecordClass, NullWritable.class, job.getConfiguration());
  • 10. Data storage format Problem: when using the standard hadoop API in Spark, you can come across unpredictable errors, because the hadoop record reader reuses the same Writable object for all records read. This is not a problem in the case of MapReduce jobs where each record is processed separately. In Spark, however, it can sometimes lead to undesired effects. For example, in the case of caching an rdd only the last object read will be cached (multiple times, equal to the number of all records read). This probably has something in common with creating multiple references to the same object. To eliminate this phenomenon, one should clone each avro record after it has been read. See: spark-utils/SparkAvroLoader JavaRDD<DocumentToProject> docProjects = avroLoader.loadJavaRDD(sc, inputPath, DocumentToProject.class);
  • 11. Usage of accumulators At first IIS execution report based on accumulators, then just rdd counts. Use wisely (if you have to), and only in actions. Main disadvantages of accumulators:  They allow one to store data in custom structures. Naive usage can lead to memory problems (accumulators on every node, sending to drivers)  When used in transformations (map, filter etc.) - repeating tasks (in case of a node failure or memory space deficiency) can lead to incorrect accumulator values (they are calculated and increased as many times as a given transformation has been repeated). More: http://imranrashid.com/posts/Spark-Accumulators/ http://stackoverflow.com/questions/29494452/when-are-accumulators-truly-reliable
  • 12. Piping to external programs To use an external script in a Spark job, one must upload it to every node: SparkContext.addFile(path) To refer to it one should use (it is advised to do so in the comment to addFile): SparkFiles.get(fileName) It is just that… it only works in local mode! In cluster mode the path is different - the script files are in the working directory of each node. Experience: many non-repeatable errors (everything was fine when a node was on the same server as the driver). For solution see: DocumentClassificationJob
  • 13. Unit tests  Write unit tests as for any other java code. It is not difficult if the code is written properly (just as it is in the case of non-distributed computing).  Mocking JavaRDD is not a problem.  Testing functions (lambda expressions) is a bit tedious. See: AffMatchingServiceTest
  • 14. Testing spark jobs Helpful classes that facilitate the testing of spark jobs: spark-utils/test
  • 15. Spark as an action in Oozie workflow Only one jar can be passed to ‘spark submit’ in an oozie action. Use maven shade plugin or similar tool to merge many jars into one.
  • 16. Integration tests of oozie workflows While working on IIS, one can fire the oozie workflow integration tests from IDE (Eclipse, NetBeans). Proper code creates oozie packages, sends them to a server, polls for the job status and compares the results with those expected. See: AbstractOozieWorkflowTestCase.java
  • 17. Conclusions  It is easier to write and to test a Spark job than an equivalent chain of M-R jobs.  Efficiency: after it has been rewritten from MR to Spark the CitationMatching module execution time fell from 28h to 10h (the comparison is far from perfect because each version was run on a different cluster).  Debugging is difficult.  Easy integration with Oozie. Oozie workflows are less complex, a lot of logic has been moved to Spark.