GeoTrellis is a geographic data processing engine for high performance applications. This presentation is focused on how Spark RDD partitioning scheme can influent the whole Spark application behaviour.
7. • RDD
(a basic core spark type from the
past (no))
• Manual partitioning control
• DATASET
• Query planning
optimizations, more related
to already well partitioned
and structured data.
PARTITIONING SCHEME
SPECIAL BROWN COLORED FUNCTIONS
• Join
• groupByKey
• reduceByKey
• combineByKey
• Repartition
• Each function that has no
preservePartitioning flag or
can accept partitioner as an
argument, probably map?
13. WAT?!
• Load data into Spark memory according to some
partitioning scheme
• Ahead of shuffle: smaller chunks are better for
Spark (as the max shuffle block size is only 2GBs)
• Are we dependent on the input data type? (yes)
• Window reading (what’s the desired / perfect
window size?)
14. SPARK SHUFFLE BLOCK FEATURE
• ~ 128mb per partition (rule of a thumb)
• if(partitionsNumber ~ 2000) repartition(> 2000)
16. WINDOWED READS
• Here we have a sort of some crop
function by grid bounds on each
element: tiff.crop(gridBounds) (it is
the meaning of rr.readWindows func)
17. WINDOWED READS
• 13 GB loads not efficient into
memory of three AWS m3.xlarge
instances .
19. WINDOWED READS
• The solution is to pack segments into
desired windows based on the input
format requirements
• After all the main idea is to leverage
the gains by having a good partion
scheme
20. READ / WRITE
• SFC index and parallelism level control
• Cassandra and range queries example
(range queries and compare it to spark Cassandra connector, queries parallelism
inside Spark partitions)
22. API & SPARK PROBLEMS
• Spark has its limitations
• It’s not required for a small data amount
(In the real time case even milliseconds are important, otherwise we have to live
somehow with the Spark slow responses)
• The second API in addition to the RDD API is
the answer?
(Collections API; does it make any sense to abstract over RDDs and Collections?)