Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS Glue

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 108 Anzeige

202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS Glue

AWS Black Belt Online Seminarの最新コンテンツ: https://aws.amazon.com/jp/aws-jp-introduction/#new
過去に開催されたオンラインセミナーのコンテンツ一覧: https://aws.amazon.com/jp/aws-jp-introduction/aws-jp-webinar-service-cut/

AWS Black Belt Online Seminarの最新コンテンツ: https://aws.amazon.com/jp/aws-jp-introduction/#new
過去に開催されたオンラインセミナーのコンテンツ一覧: https://aws.amazon.com/jp/aws-jp-introduction/aws-jp-webinar-service-cut/

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie 202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS Glue (20)

Anzeige

Weitere von Amazon Web Services Japan (20)

Aktuellste (20)

Anzeige

202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS Glue

  1. 1. © 2022, Amazon Web Services, Inc. or its Affiliates. Chiei Hayashida Solution Architect 2022/01 AWS Glue Spark Performance Tuning
  2. 2. © 2022, Amazon Web Services, Inc. or its Affiliates. Self Introduction Chie Hayashida(Chie Hayashida) Amazon Web Services Japan solution architect
  3. 3. © 2022, Amazon Web Services, Inc. or its Affiliates. The target of this slide o People who have done AWS glue tutorial or have equivalent knowledge. o People who have written Apache Spark applications. o People who would like to improve their existing AWS Glue jobs o The Code at this slide are all by PySpark because a lot of AWS Glue users choose PySpark o This slides are for Glue 2.0(Spark 2.4.3) and Glue 3.0(Spark 3.1.1)
  4. 4. © 2022, Amazon Web Services, Inc. or its Affiliates. Agenda • Architecture of AWS Glue and Apache Spark • AWS Glue Functions related to performance • How to proceed AWS Glue Spark performance tuning • AWS Glue Spark performance tuning patterns
  5. 5. © 2022, Amazon Web Services, Inc. or its Affiliates. Agenda • Architecture of Apache Spark • AWS Glue specific features (performance related) • How to proceed with performance tuning of AWS Glue Spark • Basic strategy for tuning AWS Glue Spark jobs • Tuning Patterns for AWS Glue Spark Jobs
  6. 6. © 2022, Amazon Web Services, Inc. or its Affiliates. AWS Glue and Apache Spark Data source crawler data catalog Serverless Engine (1) Crawl data 2) Manage metadata AWS Glue 3) Triggered manually / by schedule / by event 5) Run transfomation job and load data to target data source 4) Extract data from the input data source scheduler Data source Other AWS Services Managed Service of Apache Spark
  7. 7. © 2022, Amazon Web Services, Inc. or its Affiliates. Architecture of Apache Spark
  8. 8. © 2022, Amazon Web Services, Inc. or its Affiliates. Architecture of Apache Spark (cluster mode) • Cluster Manager divides a job into one or more tasks and assigns them to executors • On a single worker node, more than one executor is started. • More than one task can be executed on a single executor. (a) 3) Driver Program SparkContext Cluster Manager Worker Node Executor Task Task Cache Worker Node Executor Task Task Cache Executor Executor
  9. 9. © 2022, Amazon Web Services, Inc. or its Affiliates. Architecture of Spark (cluster mode) 1) Request the resources needed by the application. 2) Launch the Executor required for the job on each worker. 3) Divide the process into tasks and assign them to Executor. 4) Assign a task to each Executor. When the task is completed, the Executor informs the Driver Program about it. Exchange data between tasks as necessary. 5) After 2) and 3) are repeated several times, the result of the process is returned. (a) 2) 3) 3) Driver Program SparkContext Cluster Manager Worker Node Executor Task Task Cache Worker Node Executor Task Task Cache 1) 2) 4) 3) Executor Executor 4) 5)
  10. 10. © 2021, Amazon Web Services, Inc. or its Affiliates. 10 How data is processed • The data being processed is defined as a distributed collection called an RDD • An RDD is made up of one or more “partitions” • A partition is processed by a single "task” • The actual Spark code is often written through interfaceswhich called DataFrame that can treat data as typed table. S3 file file file RDD Partition Partition Partition RDD Partition Partition Partition RDD Partition Partition Partition S3 file file file
  11. 11. © 2022, Amazon Web Services, Inc. or its Affiliates. Components of Apache Spark Spark Core (RDD) DataFrame/Catalyst Optimizer Spark ML Structured Streaming GraphX Spark SQL
  12. 12. © 2021, Amazon Web Services, Inc. or its Affiliates. 12 RDD and DataFrame RDD data architecture image [ [1, Bob, 24], [2, Alice, 48], [3, Ken, 10], … ] DataFrame data architecture image col1 col2 col3 1 Bob 24 2 Alice 48 3 Ken 10 … … … With both interfaces, the code is written as if the data is processed as a list/table, but the actual data is distributed across multiple servers. DataFrame is a high-level API for RDDs, and processing described in a DataFrame is internally executed as an RDD.
  13. 13. © 2021, Amazon Web Services, Inc. or its Affiliates. 13 Lazy evaluation • There are two types of Spark processing: “action” and “transformation” • When an “action” is executed, all the previous processing necessary for the action is performed • as series of processes excuted by an “action” is called a “job” • Note that “Job” here is different from a Glue job. >>> df1 = spark.read.csv(…) >>> df2 = spark.read.json(…) >>> df3 = df1.filter(…) >>> df4 = spark.read.csv(…) >>> df5 = df2.join(df3, …) >>> df5.count() Action Up to this point, no actual processing won’t be done. At this point, the previous process is executed for the first time. The df4 process is not a dependency of df5.count(), so it will not be executed in this action.
  14. 14. © 2022, Amazon Web Services, Inc. or its Affiliates. Examples of transformations and actions Transform: data generation and processing • select() • Selecting a column • read • Loading data • filter() • Data Filtering • groupBy() • Aggregation by group • sort() • Sorting data Action: Output the processing result • count() • Counting the number of records • write • Exporting to the file system • collect() • Collect all data on Driver node • show() • View a sample of the data • describe() • View data statistics
  15. 15. © 2022, Amazon Web Services, Inc. or its Affiliates. Spark Applications o A Spark application consists of multiple jobs. glueContext = GlueContext(SparkContext.getOrCreate(conf)) spark = glueContext.spark_session df1 = spark.read.json(...) df1.show() # job1 df1.filter(df1.col1='a').write.parquet(...) # job2 df1.filter(df1.col2='b').write.parquet(...) # job3 An application is a set of processes executed in a single GlueContext (or SparkContext).
  16. 16. © 2020, Amazon Web Services, Inc. or its Affiliates. Shuffle and Stage Stage1 Stage2 Partition Partition Partition Partition Partition Partition Partition Partition Partition Partition Partition Partition Partition Partition df2 = df1.filter(“price”>500).groupBy(“item”).sum().withColumn(“bargain”, price*0.8) • Stages are divided by shuffling • Multiple tasks are processed in concurrently in one stage task
  17. 17. © 2021, Amazon Web Services, Inc. or its Affiliates. 17 Processing with and without shuffling (exchange of data between tasks) Example of no shuffling df2 = df1.groupBy(‘item’).sum() df2 = df1.filter(price>500) item price beef 1300 pork 200 chicken 700 fish 400 item price beef 1300 chicken 700 df1 df2 item num beef 2 pork 3 beef 4 pork 5 item num beef 6 pork 9 df1 df2 Example of shuffling
  18. 18. © 2022, Amazon Web Services, Inc. or its Affiliates. Processing with and without shuffling (exchange of data between tasks) Processing without shuffling • read • filter() • withColumn() • UDF • coalesce() Processing with shuffling • join() • groupBy() • orderBy() • repartition()
  19. 19. © 2022, Amazon Web Services, Inc. or its Affiliates. Optimization with Catalyst Query Optimizer • Processes described in DataFrame are converted into optimized RDDs by the optimizer and before the process is executed. df1 = spark.read.csv('s3://path/to/data1') df2 = spark.read.parquet('s3://path/to/data2') df3 = df1.join(df2, df1.col1 == df2.col1) df4 = df3.filter(df3.col2 == 'meat').select(df3.col3, df3.col5) df4.show() data1 data2 join filter Logical Plan Physical Plan scan (data1) scan (data2) filter join scan (data1) optimized scan (data2) join Physical Plan (with storage side optimization) 10GB 50GB 60GB 20GB 10GB 50GB 20GB 20GB 20GB 10GB 20GB data volume predicate pushdown and column pruning
  20. 20. © 2022, Amazon Web Services, Inc. or its Affiliates. Architecture of PySpark Driver Python Spark Context Worker Executor Task Task Python Worker Python Worker Executor Worker Worker • Processes written in PySpark Dataframe are converted to Java code. • UDF code written by Python is executed as Python at each Python worker per tasks Py4J
  21. 21. © 2022, Amazon Web Services, Inc. or its Affiliates. Introduction to AWS Glue specific features
  22. 22. © 2021, Amazon Web Services, Inc. or its Affiliates. Data Catalog • Data Catalog has metadata(tablenames, column names, S3 paths and so on) necessary to access data sources such as S3 and databases from Glue, Athena, Redshift Spectrum, etc. • There are three ways to create metadata in the data catalog: crawler, Glue API, and DDL (Athena/EMR/Redshift Spectrum). • Amazon DynamoDB, Amazon S3, Amazon Redshift, Amazon RDS, JDBC connectable DB, Kinesis, HDFS, etc. can be specified as data sources. • No need to manage metastore database DynamoDB S3 Redshift RDS JDBC接続可能なDB Data Source Glue ETL Athena Redshift Spectrum EMR Connected services Hive alternative appliations Save metadata Data catalog Crawler Image of data catalog usage ①Metadata Access ②Data access
  23. 23. © 2022, Amazon Web Services, Inc. or its Affiliates. DynamicFrame AWS Glue functions to absorb the inherent complexity of ETL for raw data • As a component, it is located in the same hierarchy as DataFrame. They can be used by converting each other (fromDF and toDF functions). • Leave the possibility of multiple types to be determined later (Choice type) • DynamicFrame refers to the entire data, DynamicRecord refers to a refers to a single row of data. Spark Core: RDDs Spark DataFrame/ Catalyst Optimizer AWS Glue DynamicFrame Data structure image DynamicFrame in Apache Spark architecture DataFrame DynamicFrame Similar to semi-structured tables record
  24. 24. © 2022, Amazon Web Services, Inc. or its Affiliates. struct type Choice Type DynamicFrame is able to have both types when multiple types are found in a column • ResolveChoice method can be used to resolve types root |-- uuid: string | |-- device id: choice |-- long |-- string Example of data structure of type choice The device id column has both long and string data. (e.g. "1234" in the device id column is confused 1234with "1234" in the string column) project (Discard the mold) cast (Cast to a single type) make_cols (Keep all types in a separate column) Example of ResolveChoice execution deviceid: choice type long type string type long type long type long type string type long type deviceid deviceid deviceid deviceid_long deviceid_string long type deviceid make_struct (Map conversion to struct type) string type ColA ColB ColC 1 2 ... 1,000,000 "1000001." "1000002." With DataFrame, if more than one type is present, processing may be interrupted and have to be reprocessed.
  25. 25. © 2022, Amazon Web Services, Inc. or its Affiliates. DataFrame ETL processing that takes advantage of the characteristics of DynamicFrame and DataFrame • DynamicFrame is good for ETL processing, DataFrame is good for table processing • Data input/output and associated ETL processing is performed by DynamicFrame, while table manipulation is performed by DataFrame. DynamicFrame toDF function Table Operations Output result file ETL job data loading input data By using DynamicFrame when loading, it is possible to optimize data loading using AWS Glue catalog, load differential data, and process semi-structured data using Choice type. DynamicFrame Table operations such as JOIN are performed in DataFrame. Use the toDF and fromDF functions for mutual conversion between DynamicFrame and DataFrame. (No data copying is done. conversion cost is within a few milliseconds) Output the result fromDF function
  26. 26. © 2022, Amazon Web Services, Inc. or its Affiliates. Bookmark function Function to process only the delta data when performing steady-state ETL • Use file timestamps to process only the data that has not been processed in the previous job to prevent duplicate processing and duplicate data. Processed data (Not loaded) Unprocessed data (load target) df = spark.read.parquet('s3://path/to/data') s3://path/to/data
  27. 27. © 2022, Amazon Web Services, Inc. or its Affiliates. How to proceed with performance tuning of AWS Glue ETL
  28. 28. © 2022, Amazon Web Services, Inc. or its Affiliates. The Performance Tuning Cycle 1. Determine performance goals. 2. Measure the metrics 3. Identify bottlenecks. 4. Reduce the impact of bottlenecks 5. Repeat steps 2. to 4. until the goal is achieved. 6. Achieving performance goals
  29. 29. © 2022, Amazon Web Services, Inc. or its Affiliates. Understand the characteristics of AWS Glue/Apache Spark • distributed processing • There are tuning patterns such as "shuffling" and "data bias" that are not found in single-process applications. • Delayed evaluation • Spark processing is lazy evaluation, so the error may not be caused by the API that was executed just before the error occurred, but by the API that was written before the error occurred. • Impact of optimizer • Since Spark processing is optimized internally, it can be difficult to determine which part of the script is the actual processing you can see in the Spark UI. You need to check multiple metrics to find the cause.
  30. 30. © 2022, Amazon Web Services, Inc. or its Affiliates. Spark parameters in AWS Glue • Essentially, Spark has a means of tuning by parameters at the time of job execution, but AWS Glue is a serverless service, so before looking at Spark's parameters, we first tune it according to AWS Glue's best practices. • User should test thoroughly when changing the value of the Spark parameter.
  31. 31. © 2022, Amazon Web Services, Inc. or its Affiliates. Metrics to check • Spark UI(Spark History Server) • It shows the details of Spark processing. • Executor Log and Driver Log • It shows stdout/stderr logs of executors and a driver • AWS Glue Job metrics • It shows the CPU, memory, and other resource usage status of each executor and driver node • Statistics obtained from the Spark API • It shows samples and statistical values of intermediate data
  32. 32. © 2022, Amazon Web Services, Inc. or its Affiliates. Setting up a job to do the tuning In order to get the logs needed for tuning, you need to check the box to use Monitoring Options in the Add job screen.
  33. 33. © 2022, Amazon Web Services, Inc. or its Affiliates. Spark UI
  34. 34. © 2022, Amazon Web Services, Inc. or its Affiliates. Spark History Server Spark Event Logs can be visualized by running the Spark History Server There are several ways to launch the Spark History Server. • Using Cloud Formation • Using docker to launch it on a local PC • Download Apache Spark on your local PC or EC2 and start the Spark History Server. • Using an EMR cluster https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html
  35. 35. © 2022, Amazon Web Services, Inc. or its Affiliates. Example of launching with Docker o Once the Docker container has started, access http://localhost:18080 in your browser $ docker build -t glue/sparkui:latest . $ docker run -itd -e SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=s3a://path_to_eventlog -Dspark.hadoop.fs.s3a.access.key=AWS_ACCESS_KEY_ID -Dspark.hadoop.fs.s3a.secret.key=AWS_SECRET_ACCESS_KEY" -p 18080:18080 glue/sparkui:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer” Specify the S3 path of Spark history logs
  36. 36. © 2022, Amazon Web Services, Inc. or its Affiliates. Spark History Server Click and chek the details of each application Check duration of each application
  37. 37. © 2022, Amazon Web Services, Inc. or its Affiliates. List of jobs executed by the application Completed Jobs Failed Jobs Check jobs which takes long time
  38. 38. © 2022, Amazon Web Services, Inc. or its Affiliates. Checking the contents of a job Identify Stages that are failing or taking a long time. Identify Stages that are failing or taking a long time.
  39. 39. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the contents of the Stage. If there is a difference in the line length, it means that skew occurs and it isnʼt distributed sufficiently. Check the data size in the Stage.
  40. 40. © 2022, Amazon Web Services, Inc. or its Affiliates. Checking the contents of the Stage (continued) the details of what's taking so long. Task Time for each Executor If there is a spill on the disk, select a worker node with larger memory to solve it. In addition to the Event Timeline above, data skew can be seen in Summary Metrics and Tasks.
  41. 41. © 2022, Amazon Web Services, Inc. or its Affiliates. View details of tasks that are Failing Click on details to learn more. View the log of the Executor where the Fail is occurring. For AWS Glue ETL, check the Executor ID and check from CloudWatch Log groups
  42. 42. © 2022, Amazon Web Services, Inc. or its Affiliates. Environmental settings for Spark application runtime o Spark options, dependencies, and so on.
  43. 43. © 2022, Amazon Web Services, Inc. or its Affiliates. List of Driver and Executor nodes If the cluster is running and accessible from the History Server, the logs of each Driver/Executor can be seen.
  44. 44. © 2022, Amazon Web Services, Inc. or its Affiliates. Checking the Spark SQL Query Execution Plan You can see the actual execution plan, which is more accurate than the explain API.
  45. 45. © 2022, Amazon Web Services, Inc. or its Affiliates. Executor and Driver logs
  46. 46. © 2022, Amazon Web Services, Inc. or its Affiliates. Check Log groups in CloudWatch Executor Logs Driver Log
  47. 47. © 2022, Amazon Web Services, Inc. or its Affiliates. AWS Glue Job Metrics
  48. 48. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the resource usage of Executor and Driver You can also create a Dashboard for CloudWatch and add other metrics.
  49. 49. © 2022, Amazon Web Services, Inc. or its Affiliates. Spark API
  50. 50. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the trend of data during processing with commands o Use the following commands to check the trend of the intermediate data during processing and use it for tuning strategy. o Note that processing with actions (red text) will make the job slow down the if you use too many times. • count() • printSchema() • show() • describe([cols*]).show() • explain() • df.agg(approx_count_distinct(df.col))
  51. 51. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the trend of data during processing with APIs df.count() o Check the number of records. o Use df.groupBy('col_name').count() to check for skew.
  52. 52. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the trend of data during processing with APIs df.printSchema() o Check the schema information of the DataFrame.
  53. 53. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the trend of data during processing with APIs df.show() o The number of records to be displayed can be specified by using df.show(5).
  54. 54. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the trend of data during processing with APIs df.describe([cols*]).show() o The statistics for each column can be seen.
  55. 55. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the trend of data during processing with APIs df.explain() o Check the execution plan which optimizer created.
  56. 56. © 2022, Amazon Web Services, Inc. or its Affiliates. Check the trend of data during processing with APIs df.agg(approx_count_distinct(df.col)) o Check the cardinality in the columns o Fast because HyperLogLog is used
  57. 57. © 2022, Amazon Web Services, Inc. or its Affiliates. AWS Glue ETL Performance Tuning Pattern
  58. 58. © 2022, Amazon Web Services, Inc. or its Affiliates. Basic strategy for tuning AWS Glue ETL jobs • Use the new version • Reduce the data I/O load. • Minimize shuffling. • Speed up per-task processing. • Parallelize
  59. 59. © 2022, Amazon Web Services, Inc. or its Affiliates. Use the new version
  60. 60. © 2022, Amazon Web Services, Inc. or its Affiliates. Use the new version o When the Spark application is slow, simply replacing the job execution environment with the latest version may speed up the process. o Both AWS Glue and Apache Spark are evolving in every way, not just in performance. Use the newest version possible. https://aws.amazon.com/jp/blogs/big-data/introducing-aws-glue-3-0-with-optimized-apache-spark-3-1-runtime-for-faster-data-integration/
  61. 61. © 2022, Amazon Web Services, Inc. or its Affiliates. Using AWS Glue 2.0 and 3.0 to reduce startup time Dramatically reduced the time required to launch AWS Glue ETL jobs • Cold start used to take about 85 minutes in AWS Glue 0.9/1.0. • In the new AWS Glue 2.0, less than a 1minute. time Start-up time Execution time Execution time AWS Glue 1.0 AWS Glue 2.0
  62. 62. © 2021, Amazon Web Services, Inc. or its Affiliates. 2. Submit job to virtual cluster AWS Glue 2.0+ integrated scheduling and provisioning 1. Run AWS Glue job 3. Spark schedules tasks to executors Job manager 4. Dynamically grow virtual clusters 5. Spark utilizes new executors AZ1 AZ2 Job starts when first executor is ready Reduced start time Reduced start variance Jobs run on reduced capacity Graceful degradation
  63. 63. © 2022, Amazon Web Services, Inc. or its Affiliates. Minimize data I/O
  64. 64. © 2022, Amazon Web Services, Inc. or its Affiliates. How to minimize the data I/O load • Read only the data you need. • Control the amount of data read in one task. • Choose the right compression format.
  65. 65. © 2022, Amazon Web Services, Inc. or its Affiliates. Using Apache Parquet • Column-oriented format for data arrangement suitable for analytical applications • Data type is preserved • Compressed effectively • Aggregation by skipping unnecessary data and using metadata • The Spark engine can efficiently use Apache Parquet Integration is in place https://parquet.apache.org/documentation/latest/
  66. 66. © 2022, Amazon Web Services, Inc. or its Affiliates. Partition Filtering and Filter Pushdown Reduce the amount of data to be read • Partition Filtering • The ability to read only the files in the partition specified by the filter or Where clause. or Where clause. • Available in Text/CSV/JSON/ORC/Parquet • Filter Pushdown • Ability to read only blocks that hit the filter or where clause for columns that are not used in the partition column. • AWS Glue automatically applies this when Parquet is used.
  67. 67. © 2022, Amazon Web Services, Inc. or its Affiliates. Partition Filter and Filter Pushdown Filter Pushdown Partition Filtering
  68. 68. © 2022, Amazon Web Services, Inc. or its Affiliates. Partition Filter and Filter Pushdown Partition Filter can be used when a partition directory has been created; for Dataframe and DynamicFrame writes, the partition directory can be created by using the partitionBy option as follows df.write.parquet(path=path, partitionBy='col_name') It may be more performance efficient to partition columns that are used more frequently in the filter clause into higher partitions.
  69. 69. © 2022, Amazon Web Services, Inc. or its Affiliates. Using push_down_predicate in DynamicFrame Read only the files included in the partition where the data specified in the filter or where clause is stored when reading a DynamicFrame from the AWS Glue data catalog. partitionPredicate ="(product_category == 'Video')" datasource = glue_context.create_dynamic_frame.from_catalog( database = "githubarchive_month", table_name = "data", push_down_predicate = partitionPredicate)
  70. 70. © 2022, Amazon Web Services, Inc. or its Affiliates. Choose a compression codec based on your application • Compression codec can be selected at data writing. • Trade-off between compression rate and compression/decompression speed • Files compressed with bzip2, lzo, and snappy can be split and processed when read, but files compressed with gzip(*) cannot be split. • Uncompressed files does not require compression/decompression time, but data transfer cost may be a bottleneck • If processing speed is important to you, choose snappy or lzo. ex. df.write.csv("path/to/csv", compression="gzip") Parquet can also be split by gzip. gzip bzip2 lzo snappy file extension .gzip .bz2 .lzo .snappy Compression Level High Highest Average Average Speed Medium Slow Fast Fast CPU usage Medium High Low Low Is Splittable No(*) No Yes, if indexed No
  71. 71. © 2022, Amazon Web Services, Inc. or its Affiliates. Store data in appropriate file sizes. • Data read/write tasks are basically tied to a single file. (If the file is splitable, one file can be split into multiple tasks.) • The recommended file size for AWS Glue is 128MB-512MB. When the data is too small • Overhead due to large number of small tasks When there is large unspilitable data in one file • Data is not fully loaded into memory on a single node. • No distributed processing task task task ... ... task Executor Executor Executor ... Not used
  72. 72. © 2022, Amazon Web Services, Inc. or its Affiliates. Using Bounded Execution with DynamicFrame When there is a lot of data to be read, Bounded Execution can be used at the same time as Job Bookmarking to divide the process without reading all the unprocessed data at once. glueContext.create_dynamic_frame.from_catalog( database = "database", tableName = "table_name", redshift_tmp_dir = "", transformation_ctx = "datasource0", additional_options = { "boundedFiles" : "500", # need to be string # "boundedSize" : "1000000000" unit is byte } )
  73. 73. © 2022, Amazon Web Services, Inc. or its Affiliates. Using DynamicFrame's groupFiles and groupSize • Eliminate overhead by reading small files together in a 1single task. • Useful for processing data that is output every few minutes by Kinesis Data Firehose. • Use the groupFiles option to group the data in the S3 partition, and the groupSize option to specify the size of the group to be read. df = glueContext.create_dynamic_frame_from_options( 's3', {'paths': ['s3://s3path/'], ' recurse'':True, 'groupFiles'': 'inPartition'', 'groupSize'': ''1048576}, format='json') note: groupFiles is supported for DynamicFrames created from the following data formats: csv, ion, grokLog, json, and xml. This option is not supported for avro, parquet, and orc.
  74. 74. © 2022, Amazon Web Services, Inc. or its Affiliates. Number of files and processing time for DataFrame and DynamicFrame
  75. 75. © 2022, Amazon Web Services, Inc. or its Affiliates. Using DynamicFrame S3ListImplementation • If there are a lot of small files, a large number of tasks can cause Driver OOM. • When S3ListImplementation is True, the results of S3 list are read and processed in one batch at a 1000time, which prevents driver memory from being overloaded by S3 listing. datasource = glue_context.create_dynamic_frame.from_catalog( database = "my_database", table_name = "my_table", push_down_predicate = partitionPredicate, additional_options = {"useS3ListImplementation":True} )
  76. 76. © 2022, Amazon Web Services, Inc. or its Affiliates. Set the Partition Index When reading a DataFrame from AWS Glue catalog from a data source with many partitions consisting of multiple partition keys, setting the Partition Index will reduce the time to fetch the read partition if there are filter or where clauses for the target partition. This will reduce the time to fetch the partitions. https://aws.amazon.com/jp/blogs/big-data/improve-query-performance-using-aws-glue-partition-indexes/
  77. 77. © 2022, Amazon Web Services, Inc. or its Affiliates. The difference of query planning time between using Partition Index and not using Partition Index https://aws.amazon.com/jp/blogs/big-data/improve-query-performance-using-aws-glue-partition-indexes/
  78. 78. © 2022, Amazon Web Services, Inc. or its Affiliates. Parallel Data Reading in DataFrame JDBC Connections • spark.read.jdbc() only allows one Executor to access the target database by default. • For parallel reading, partitionColumn, lowBound, upperBound, and numPartitions must be specified. The partitionColumn must be one of the following types: numeric, date, or timestamp. df = spark.read.jdbc( url=jdbcUrl, table="sample", partitionColumn="col1", lowerBound=1L, upperBound=100000L, numPartitions=100, fetchsize=1000, connectionProperties=connectionProperties)
  79. 79. © 2022, Amazon Web Services, Inc. or its Affiliates. Parallel data reading in DynamicFrame JDBC connection o If you want to read data from a JDBC connection as a DynamicFrame, you need to specify hashfield/hashexpression. o In hashfield, strings and other columns can also be used as partition columns. glueContext.create_dynamic_frame.from_catalog( database = "my_database", tableName = "my_table_name", transformation_ctx = "my_transformation_context", additional_options = { ''hashfield': 'customer_name', 'hashpartitions': '5' } ) https://docs.aws.amazon.com/glue/latest/dg/run-jdbc-parallel-read-job.html
  80. 80. © 2022, Amazon Web Services, Inc. or its Affiliates. Minimize shuffling
  81. 81. © 2022, Amazon Web Services, Inc. or its Affiliates. Minimize shuffling • Make good use of cache • Perform filter processing in the first stage as much as possible. • Devise the order of joins to keep the data small. • Optimize join strategy • Remove data skew • Use Window processing instead of data processing by self join
  82. 82. © 2022, Amazon Web Services, Inc. or its Affiliates. Minimize shuffling The processing described in the DataFrame is optimized by Catalyst Optimizer. However, it is not perfect in the following aspects • If there is a cache() in between, optimization including before and after will not work. • Spark 2.4.3, used in AWS Glue 1.0 and 2.0, disables the cost-based optimizer by default Shuffling can be reduced by manually changing the order and strategy of filters and joins.
  83. 83. © 2022, Amazon Web Services, Inc. or its Affiliates. Make good use of cache • When branching the processing of a single Dataframe to perform multiple outputs, you can prevent recalculation by inserting cache() just before the branch. • Note that it may be faster not to use cache, and that too much use of cache will use local disk space. df1 df2 df3 df5 df4 df1 df2 df3 df5 df4 cache() The process up to the creation of df2 is executed only once. The process to create df2 will be executed twice.
  84. 84. © 2022, Amazon Web Services, Inc. or its Affiliates. Make good use of cache • By default, cached data will be stored in the memory initially allocated for caching, and what is not in memory will be stored on the local disk. • Users can choose to save to memory only or disk only. Example of cache only on memory: df.cache(MEMORY_ONLY)
  85. 85. © 2022, Amazon Web Services, Inc. or its Affiliates. Delete cache that is no longer in use • A cached Dataframe will continue to occupy memory and local disk space. • Save memory and disk space by deleting the Dataframe cache when it is no longer needed. df.unpersist()
  86. 86. © 2022, Amazon Web Services, Inc. or its Affiliates. Perform filter processing in the first stage as much as possible The filter process can be placed before cache() to reduce the amount of data during processing.
  87. 87. © 2022, Amazon Web Services, Inc. or its Affiliates. Work out the order of join • The end result is the same, but the data size of the DataFrame in the middle is different. • In Glue 3.0 (Spark 3.1.1), the cost-based optimizer takes into account the amount of data and optimizes the order of joins. df1 (4000 rows) df2 (1000 rows) df4 ( 4000rows) df3 (10 rows) df5 (10 rows) df1 (4000 rows) df3 (10 rows) df4 ( 10rows) df2 (1000 rows) df5 (10 rows) left join join left join join
  88. 88. © 2022, Amazon Web Services, Inc. or its Affiliates. Using join in different ways Sort Merge Join • Distribute the two tables to be joined by their respective keys, sort them, and then join them. • Suitable for joining large tables together. Broadcast Join • Transfer one table to all Executors, and distribute the other table to all Executors and join them. • Suitable for when one table is smaller than the other. Shuffle Hash Join • Distribute the two tables to be joined and join them without sorting. • Suitable for joins between tables that are not so large.
  89. 89. © 2022, Amazon Web Services, Inc. or its Affiliates. Using join in different ways • By default, if the table size is less than or equal to the value specified in spark.sql.autoBroadcastJoinThreshold (default 10MB), Broadcast Join will be used. • The Join strategy in use can be seen in the Spark UI or by using explain(). • If join performance is the bottleneck, changing the join strategy manually may improve performance. df1.join( broadcast(df2), df1("col1") <=> df2("col1") ).expand() == Physical Plan == BroadcastHashJoin [coalesce(col1#6, )], [coalesce(col1#21, )], Inner, BuildRight, (col1#6 <=> col1#21) :- LocalTableScan [first_name#5, col1#6]. +- BroadcastExchangeHashedRelationBroadcastMode(List(coalesce(input[0, string, true], ))) +- LocalTableScan [col1#21, col2#22, population#23] https://spark.apache.org/docs/2.4.0/sql-performance-tuning.html#broadcast-hint-for-sql-queries
  90. 90. © 2022, Amazon Web Services, Inc. or its Affiliates. coalesce For the following reasons, partitions may be split into smaller pieces during processing. • Load a large number of small files. • Perform groupBy on columns with high cardinality In such a case, it is better to merge the partitions before the next process to reduce the overhead of the subsequent process. Since repartition involves shuffling, it may be desirable to use coalesce. However, since it is a simple merge, the data after coalesce may be biased. Glue 3.0 has a new feature called Adaptive Query Execution that automatically optimizes the number of partitions by coalesce. partition partition partition partition partition partition partition partition partition partition partition partition df.repartition(2) df.coalesce(2)
  91. 91. © 2022, Amazon Web Services, Inc. or its Affiliates. Use Window processing instead of self join and data aggregation o When the process of creating aggregate data from a single log data and joining it is being performed, the load of joining can be eliminated by performing Window processing. df_agg = df.groupBy('gender', 'age').agg( F.mean('height').alias('avg_height')), F.mean('weight').alias('avg_weight')) df = df.join(df_agg, on=['gender', 'age']) w = Window.partitionBy('gender', 'age') df = df.withColumn( 'avg_height', F.mean(col('height')).over(w) ).withColumn('avg_weight', F.mean(col('weight')).over(w))
  92. 92. © 2022, Amazon Web Services, Inc. or its Affiliates. Speed up per-task processing
  93. 93. © 2022, Amazon Web Services, Inc. or its Affiliates. Using Scala Most of the DataFrame operations can be written in PySpark and internally converted to Java and run on a JVM, but the following operations are slower when using Python. If the above is the bottleneck, Scala will speed up the process. • The part that describes the process in RDD • Processing written in RDDs is not optimized by the optimizer. • The part that uses UDF • later mention
  94. 94. © 2022, Amazon Web Services, Inc. or its Affiliates. Task Avoid UDF in PySpark Performance issues • Serialization and piping to Python process occurs for each Iterator • The memory of the Python process is not controlled by the JVM. Python Worker JVM Physical Operator Python Runner batch of Rows Invoke UDF Deserialize Serialize Pipe
  95. 95. © 2022, Amazon Web Services, Inc. or its Affiliates. Using PandasUDF over PythonUDF Python UDF • Serialization/deserialization is done by Pickling • Data is fetched block by block, but UDF processing is performed row by row Pandas UDF • Serialization/deserialization is done by Apache Arrow • Both data fetch and UDF processing are performed on multiple lines at once.
  96. 96. © 2022, Amazon Web Services, Inc. or its Affiliates. Performance differences between Python UDF, Pandas UDF, and Spark API in AWS Glue ETL Execution Time(s)
  97. 97. © 2022, Amazon Web Services, Inc. or its Affiliates. Parallelize
  98. 98. © 2022, Amazon Web Services, Inc. or its Affiliates. Dealing with Skewness in Data If there is data variation between partitions, the load will be unevenly distributed to only some tasks that process large partitions, causing processing delays. Data is biased to only some partitions, causing a bottleneck in processing time. When does it happen? • If the file size to be read is uneven • When joining with data that has a difference in the number of records for each join key • Variation in the number of records per key when df.groupBy() is performed
  99. 99. © 2022, Amazon Web Services, Inc. or its Affiliates. Addressing data bias • Ensure that the file size is uniform when creating input files. • To repartition • broadcast join • salting
  100. 100. © 2022, Amazon Web Services, Inc. or its Affiliates. Dealing with Skewness Repartition If the subsequent process is not a key-by-key process (partitioning and storing data by date, window processing by key, etc.), repartition will resolve the skew. df.repartition(200) Partition Partition Partition Partition Partition Partition ... 200 partitions 3 partitions
  101. 101. © 2021, Amazon Web Services, Inc. or its Affiliates. 101 Dealing with Skewness broadcast join If one DataFrame is small enough to fit all the data in one Executor, and the other DataFrame has huge data with skewed join key columns, you can improve processing performance by using Broadcast Join as the Join strategy. item n beef 1 beef 3 ... beef 2 item n pork 1 item price beef 100 item price pork 500 1000 lines join join Partition Partition Sort Merge Join item n beef 1 beef 3 ... item n pork 1 beef 2 ... item price beef 100 pork 500 item price beef 100 pork 500 500 lines join join Partition Partition Broadcast Join 500 lines
  102. 102. © 2022, Amazon Web Services, Inc. or its Affiliates. Dealing with Skewness Salting In the case of a join between two sufficiently large data sets that have a skew on one side, "Salt" can be used to eliminate the load bias. Table A Table B Partition with Skew partition Clone the partition corresponding to the partition with Skew.
  103. 103. © 2022, Amazon Web Services, Inc. or its Affiliates. Dealing with Skewness Salting o In Glue 2.0 (Spark 2.4.3), user need to write the Salting code manually. o In Glue 3.0 (Spark 3.1.1), a new feature called Adaptive Query Execution will dynamically perform Skew Join. 103 # Salting the skewed column df_big = df_big.withColumn('shop_salt', F.concat(df['shop'], F.lit('_'), F.lit(F.floor(F.rand() * numPartition) + 1)))) # Explode the column df_medium = df_medium.withColumn('shop_exploded', F.explode(F.array([F.lit(i) for i in range(1,numPartition+1)])))) df_medium = df_medium.withColumn( 'shop_exploded', F.concat(df_medium['shop'], F.lit('_')), df_medium['shop_exploded'])) # Joining df_join = df_big.join(df_medium df_big.'shop_salted' == df_medium.shop_exploded) https://spark.apache.org/docs/latest/sql-performance-tuning.html#optimizing-skew-join
  104. 104. © 2022, Amazon Web Services, Inc. or its Affiliates. Assigning Incremental IDs with Performance in Mind • If the Window function row_number() is used to assign successive incremental IDs to all records, the process will be slow because of the aggregation for all records. • monotonically_increasing_id() can give an incremental ID without aggregation by allowing it to be non-contiguous across partitions. 1 2 3 4 5 6 7 8 9 10 11 1 2 3 6 7 8 9 10 13 14 15 df. withColumn(F.rowNumber().over(Window.partitionBy("col1"). orderBy("col2")) df. withColumn(monotonically_increasing_id)
  105. 105. © 2022, Amazon Web Services, Inc. or its Affiliates. Selecting a Worker Type for AWS Glue • The processing power allocated at the time of job execution is called DPU (Data Processing Unit). • 1DPU = 4vCPU, 16GB memory • Each Worker Type has different resource capacity and configuration. Worker Type Number of DPUs/1 Worker Number of Executor s/1Work er Number of memory/ 1Worker Disk/1W orker Standard 1 2 16GB 50GB G.1X 1 1 16GB 64GB G.2X 2 1 32GB 128GB Worker Type List Worker Type configuration image Standard Executor Worker Executor DPU G.1X Worker DPU G.2X Worker DPU DPU Executor Executor
  106. 106. © 2022, Amazon Web Services, Inc. or its Affiliates. Ideal resource usage • It is desirable that resources are used evenly and without waste by all Executors. • If not, there's likely room for tuning. • Select the first Worker Type based on some prediction of resource tendency based on processing contents. • Example. • CPU usage is likely to be high when there are complex UDF and other processing operations. • Memory usage is likely to be high when the shuffle size becomes large, such as when joining large amounts of data.
  107. 107. © 2022, Amazon Web Services, Inc. or its Affiliates. Trade-off between number of workers and job execution time Job execution time can be reduced without increasing cost by increasing the number of workers as long as the number of parallelism is sufficient to ensure effective resource utilization. 5 10 5 10 Number of workers AWS Glue ETL job execution time 5 10 5 10 Number of workers AWS Glue ETL job execution time
  108. 108. © 2022, Amazon Web Services, Inc. or its Affiliates. summary • Introduced tuning patterns for AWS Glue Spark ETL jobs. • AWS Glue can process large amounts of data with high performance as-is, but it can be tuned to achieve even higher performance and scalability. • Tuning requires checking metrics to identify bottlenecks and eliminate their causes. 108

×