Apache Spark 2.2 shipped with a state-of-art cost-based optimization framework that collects and leverages a variety of per-column data statistics (e.g., cardinality, number of distinct values, NULL values, max/min, avg/max length, etc.) to improve the quality of query execution plans. Skewed data distributions are often inherent in many real world applications. In order to deal with skewed distributions effectively, we added equal-height histograms to Apache Spark 2.3. Leveraging reliable statistics and histogram helps Spark make better decisions in picking the most optimal query plan for real world scenarios.
In this talk, we’ll take a deep dive into how Spark’s Cost-Based Optimizer estimates the cardinality and size of each database operator. Specifically, for skewed distribution workload such as TPC-DS, we will show histogram’s impact on query plan change, hence leading to performance gain.