Matthew Powers gave a presentation on optimizing Delta and Parquet data lakes. He discussed the benefits of using Delta lakes such as built-in time travel, compacting, and vacuuming capabilities. Delta lakes provide these features for free on top of Parquet files and a transaction log. Powers demonstrated how to create, compact, vacuum, partition, filter, and update Delta lakes in Spark. He showed that partitioning data significantly improves query performance by enabling data skipping and filtering at the partition level.
11. Why does compaction speed up
lakes?
• Parquet: files need to be listed before they are
read. Listing is expensive in object stores.
• Delta: Data is read via the transaction log.
• Easier for Spark to read partitioned lakes into
memory partitions.
11
22. Delta Lake Vacuum
• Files marked for removal older than the retention
period
• Default retention period is 7 days
• Not going to improve performance
22
28. Optimal number of partitions (parquet)
28
https://github.com/MrPowers/spark-daria/blob/master/src/main/scala/com/github/
mrpowers/spark/daria/utils/DirHelpers.scala
29. Why partition data lakes?
• Data skipping
• Massively improve query performance
• I’ve seen queries run 50-100 times faster on
partitioned lakes
29
37. Directly grabbing the partitions is
faster for Parquet lakes…
37
Directly grabbing partitions was 83 times faster than relying on partition
filters for a simple query
38. Real partitioned data lake
• Updates every 3 hours
• Has 5 million files
• 15,000 files are being added every day
• Still great for a lot of queries
38