This document discusses lessons learned from rewriting parts of the OpenAire project to use Apache Spark. It covers choosing Java and Kryo serialization for efficiency, understanding that spark.closure.serializer controls code serialization, using accumulators carefully, and testing Spark jobs including unit tests and integration with Oozie workflows. The rewrite resulted in faster execution times for some modules like CitationMatching.
Programming in Spark - Lessons Learned in OpenAire project
1. Programming in Spark
Lessons Learned in OpenAire project
Łukasz Dumiszewski, ICM, University of Warsaw, 10.2016
Duration: 1h, Requirements: knowledge of Apache Spark
2. Goals of work in OpenAire (IIS)
Rewriting of OpenAire (IIS) from MR/ Pig to Spark (several
modules i.e. citation-matching)
Improvement of project structure
Enhancement of integration tests
Creation of new modules: matching of publication
affiliations, IIS execution report etc.
3. Problems and solutions
Programming language
Coding standards
Data serialization and cache
Code serialization
Data storage format
Accumulators
Piping to external programs
Testing
5. Coding standards
Standard programming practices (low-coupling, high-
cohesion). Possible use of Spring for dependency injection.
Pros: code readability and reliability, easy development and
testing
See:
AffMatchingService
AffMatchingJob
6. Data serialization and cache
KryoSerialization – fast and efficient
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
Problem with deseralization of Avro collections (Avro specific list
implementation).
Solved by implementing a custom Kryo registrator:
conf.set("spark.kryo.registrator",
"eu.dnetlib.iis.wf.affmatching.AffMatchingJob$AvroCompatibleKryoRegistrator");
The registrator can be found at: https://github.com/CeON/spark-utils/
7. Code serialization
Big misunderstanding in books and on the Internet (?)
„Learning Spark” in the paragraph describing the change of spark.serializer to Kryo:
Whether using Kryo or Java’s serializer, you may encounter a NotSerializableException if
your code refers to a class that does not extend Java’s Serializable interface …
The setting spark.serializer does not refer to code serialization but to data
serialization.
It is spark.closure.serializer that corresponds to code serialization, and it uses java
serialization by default (it is not recommended that one change it due to the small
amount of data serialized/ sent in this case). For this reason classes have to
implement Serializable (or Externalizable). Otherwise we get
NotSerializableException.
8. Code serialization or mapPartitions?
It does not make sense to write functions that operate on partitions (like
mapPartitions) and to create service beans in these partitions only to avoid the
serialization and sending of the code between nodes. Serializing and copying the
code does not have a big influence on the efficiency of an application. Using
mapPartitions complicates the code and makes it difficult to write unit tests.
void execute() {
rdd.mapPartitons(iter -> {
SomeService service = new SomeService();
service.generate…
...
return someCollection;
})
}
9. Data storage format
To read an avro file you can use a standard Hadoop API:
JavaPairRDD<AvroKey<T>, NullWritable> inputRecords = (JavaPairRDD<AvroKey<T>,
NullWritable>) sc.newAPIHadoopFile(avroDatastorePath, AvroKeyInputFormat.class,
avroRecordClass, NullWritable.class, job.getConfiguration());
10. Data storage format
Problem: when using the standard hadoop API in Spark, you can come across
unpredictable errors, because the hadoop record reader reuses the same
Writable object for all records read.
This is not a problem in the case of MapReduce jobs where each record is processed
separately. In Spark, however, it can sometimes lead to undesired effects. For example, in
the case of caching an rdd only the last object read will be cached (multiple times, equal to
the number of all records read). This probably has something in common with creating
multiple references to the same object.
To eliminate this phenomenon, one should clone each avro record after it has been read.
See: spark-utils/SparkAvroLoader
JavaRDD<DocumentToProject> docProjects = avroLoader.loadJavaRDD(sc, inputPath,
DocumentToProject.class);
11. Usage of accumulators
At first IIS execution report based on accumulators, then just rdd counts.
Use wisely (if you have to), and only in actions.
Main disadvantages of accumulators:
They allow one to store data in custom structures. Naive usage can lead
to memory problems (accumulators on every node, sending to drivers)
When used in transformations (map, filter etc.) - repeating tasks (in case
of a node failure or memory space deficiency) can lead to incorrect
accumulator values (they are calculated and increased as many times as a
given transformation has been repeated).
More:
http://imranrashid.com/posts/Spark-Accumulators/
http://stackoverflow.com/questions/29494452/when-are-accumulators-truly-reliable
12. Piping to external programs
To use an external script in a Spark job, one must upload it to every node:
SparkContext.addFile(path)
To refer to it one should use (it is advised to do so in the comment to
addFile):
SparkFiles.get(fileName)
It is just that… it only works in local mode! In cluster mode the path is
different - the script files are in the working directory of each node.
Experience: many non-repeatable errors (everything was fine when a
node was on the same server as the driver).
For solution see: DocumentClassificationJob
13. Unit tests
Write unit tests as for any other java code. It is not difficult if the code
is written properly (just as it is in the case of non-distributed computing).
Mocking JavaRDD is not a problem.
Testing functions (lambda expressions) is a bit tedious.
See:
AffMatchingServiceTest
15. Spark as an action in Oozie workflow
Only one jar can be passed to ‘spark submit’ in an oozie
action.
Use maven shade plugin or similar tool to merge many jars
into one.
16. Integration tests of oozie workflows
While working on IIS, one can fire the oozie workflow integration tests
from IDE (Eclipse, NetBeans).
Proper code creates oozie packages, sends them to a server, polls for the
job status and compares the results with those expected.
See: AbstractOozieWorkflowTestCase.java
17. Conclusions
It is easier to write and to test a Spark job than an equivalent chain of
M-R jobs.
Efficiency: after it has been rewritten from MR to Spark the
CitationMatching module execution time fell from 28h to 10h (the
comparison is far from perfect because each version was run on a
different cluster).
Debugging is difficult.
Easy integration with Oozie. Oozie workflows are less complex, a lot of
logic has been moved to Spark.