Uber developed an new Spark ingestion system, Marmaray, for data ingestion from various sources. It’s designed to ingest billions of Kafka messages every 30 minutes. The amount of data handled by the pipeline is of the order hundreds of TBs. Omar details how to tackle such scale and insights into the optimizations techniques. Some key highlights are how to understand bottlenecks in Spark applications, to cache or not to cache your Spark DAG to avoid rereading your input data, how to effectively use accumulators to avoid unnecessary Spark actions, how to inspect your heap and nonheap memory usage across hundreds of executors, how you can change the layout of data to save long-term storage cost, how to effectively use serializers and compression to save network and disk traffic, and how to reduce amortize the cost of your application by multiplexing your jobs, different techniques for reducing memory footprint, runtime, and on-disk usage. CGI was able to significantly (~10%–40%) reduce memory footprint, runtime, and disk usage.
Speaker: Omkar Joshi (Uber)
Omkar Joshi is a senior software engineer on Uber’s Hadoop platform team, where he’s architecting Marmaray. Previously, he led object store and NFS solutions at Hedvig and was an initial contributor to Hadoop’s YARN scheduler.
2. Omkar Joshi
omkar@uber.com
● Software engineer on Hadoop
Data Platform team
● 2.5 years at Uber
● Architected Object store & NFS
solutions at Hedvig
● Hadoop Yarn committer
● Enjoy gardening & hiking in free
time
8. Common Use Cases
● Memory Deep Dive: Heap, RSS, Direct Buffer, Mapped Buffer Pool
○ Right-size Container : Reduced Memory from 7G to 5G for 1000+ Executors
○ Off-Heap/Native Memory Tuning
● Hive On Spark Efficiency
○ ~70% queries
use <80%
allocated memory resource
9. Advanced Use Cases
● Trace arbitrary Java method argument
e.g. org.apache.hadoop.hdfs.protocolPB.
ClientNamenodeProtocolTranslatorPB.addBlock.1
Access Auditing:
Read/Write HDFS Files from individual Spark App
10. Advanced Use Cases
● Trace arbitrary Java method during
runtime
○ For example:
org.apache.hadoop.hdfs.protocolPB.Client
NamenodeProtocolTranslatorPB.getBlock
Locations
● Flamegraph
○ Trace CPU Hotspot
○ Helped finding serialization take
significant time during writing shuffle
data
11. Auto Tune (Work in Progress)
● Problem: data scientist using team level Spark conf template
● Known Daily Applications
○ Use historical run to set Spark configurations (memory, vcore, etc.) *
● Ad-hoc Repeated (Daily) Applications
○ Use Machine Learning to predict resource usage
○ Challenge: Feature Engineering
■ Execution Plan
Project [user_id, product_id, price]
Filter (date = 2019-01-31 and product_id = xyz)
UnresolvedRelation datalake.user_purchase
* Done by our colleague Abhishek Modi
12. Open Sourced in September 2018
https://github.com/uber/marmaray
Blog Post:
https://eng.uber.com/marmaray-hado
op-ingestion-open-source/
13. High-Level Architecture
Chain of converters
Schema Service
Input
Storage
System
Source
Connector
M3 Monitoring System
Work
Unit
Calculator
Metadata Manager
(Checkpoint store)
Converter1 Converter 2
Sink
Connector
Output
Storage
System
Error Tables
Datafeed Config Store
Spark execution framework
16. Effective data layout (parquet)
● Parquet uses columnar compression
● Columnar compression savings outperform gz or snappy compression
savings
● Align records such that adjacent rows have identical column values
○ Eg. For state column California.
● Sort records with increasing value of cardinality.
● Generalization sometimes is not possible; If it is a framework provide custom
sorting.
17. User Id First Name City State Rider score
abc123011101 John New York City New York 4.95
abc123011102 Michael San Francisco California 4.92
abc123011103 Andrea Seattle Washington 4.93
abc123011104 Robert Atlanta City Georgia 4.95
User Id First Name City State Rider score
cas123032203 Sheetal Atlanta City Georgia 4.97
dsc123022320 Nikki Atlanta City Georgia 4.95
ssd012320212 Dhiraj Atlanta City Georgia 4.94
abc123011104 Robert Atlanta City Georgia 4.95
22. ● Why Kryo?
○ Lesser memory footprint than Java serializer.
○ Faster and supports custom serializer
● Bug fix to truly enable kryo serialization
○ Spark kryo config prefix change.
● What all is needed to take advantage of that
○ Set “spark.serializer” to “org.apache.spark.serializer.KryoSerializer”
○ Registering avro schemas to spark conf (sparkConf.registerAvroSchemas())
■ Useful if you are using Avro GenericRecord (Schema + Data)
○ Register all classes which are needed while doing spark transformation
■ Use “spark.kryo.classesToRegister”
■ Use “spark.kryo.registrationRequired” to find missing classes
Kryo serialization
23. Kryo serialization contd..
Data (128)
Data (128)
Data (128)
Data (128)
Data (128)
Avro Schema (4K)
Avro Schema (4K)
Avro Schema (4K)
Avro Schema (4K)
Avro Schema (4K)
Data (128)
Data (128)
Data (128)
Data (128)
Data (128)
Schema identifier(4)
Schema identifier(4)
Schema identifier(4)
Schema identifier(4)
Schema identifier(4)
1 Record = 4228 Bytes 1 Record = 132 Bytes (97% savings)
24. Reduce ser/deser time by restructuring payload
1. @AllArgsConstructor
2. @Getter
3. private class SparkPayload {
4. private final String sortingKey;
5. // Map with 1000+ entries.
6. private final Map<String, GenericRecord>
data;
7. }
8.
1. @Getter
2. private class SparkPayload {
3. private final String sortingKey;
4. // Map with 1000+ entries.
5. private final byte[] serializedData;
6.
7. public SparkPayload(final String sortingKey,
Map<String, GenericRecord> data) {
8. this.sortingKey = sortingKey;
9. this.serializedData =
KryoSerializer.serialize(data);
10. }
11.
12. public Map<String, GenericRecord> getData() {
13. return
KryoSerializer.deserialize(this.serializedData,
Map.class);
14. }
15. }
25. Parallelizing spark’s iterator
jsc.mapPartitions (iterator -> while (i.hasnext) { parquetWriter.write(i.next); })
MapPartitions Stage (~45min) MapPartitions Stage (~25min)
Reading from disk
Fetching record from
Spark’s Iterator
Writing to Parquet
Writer in memory
Parquet columnar
compression
FileSystem write
Reading from disk
Fetching record from
Spark’s Iterator
Writing to Parquet
Writer in memory
Parquet columnar
compression
FileSystem write
Spark’s Thread
New writer
Thread
Memory
buffer
Spark’s Thread