Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Query your data in S3 with SQL and optimize for cost and performance

83 Aufrufe

Veröffentlicht am

Streaming services allow you to ingest and analyze events continuously in real time. One of Big Data's principles is to store raw data as long as possible - to be able to answer future questions. If the data is permanently stored in Amazon Simple Storage Service (S3), it can be queried at any time with Amazon Athena without spinning up a database.
This session shows step by step how the data should be structured so that both costs and response times are reduced when using Athena. The details and effects of compression, partitions, and column storage formats are compared. Finally, AWS Glue is used as a fully managed service for Extract Transform Load (ETL) to derive optimized views from the raw data for frequently issued queries.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Query your data in S3 with SQL and optimize for cost and performance

  1. 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Steffen Grunwald, AWS Solutions Architect, @steffeng AWS Pop-up Loft Berlin, 17. October 2018 Query your data in S3 with SQL and optimize for cost and performance
  2. 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What you will learn from this Session • Benefits of raw Data in Amazon Simple Storage Service • Query on S3 with Amazon Athena • Optimize your Data Structure • Compression • Partitioning • Columnar Formats • Derive Views from raw Data for frequent Queries
  3. 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Example Application: New York Taxi Data Ingestion Amazon Kinesis Streams Amazon Kinesis Analytics Amazon Kinesis Streams AWS Lambda Amazon CloudWatch Amazon Kinesis Firehose Amazon QuickSight AWS Glue Amazon S3 Amazon Athena Instance
  4. 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of raw Data in Amazon Simple Storage Service (S3) • Highly durable and cost-effective object store • Limitlessly scalable • Pay for what you use - in GB per month • Decouple storage from compute • Widely supported API by many consumers • Well integrated into other AWS services Use S3 as long term storage to answer yet unknown questions of tomorrow.
  5. 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ingest Data with Amazon Kinesis Firehose • Stores stream of records as files in a bucket • Path: <Optional Prefix> + "YYYY/MM/DD/HH“ (Ingestion Time, UTC) • Optionally compress (GZIP, ZIP, Snappy) • Optionally store as columnar format (ORC, Parquet) • Optionally transform records with AWS Lambda Amazon Kinesis Firehose Amazon S3 Bucket
  6. 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Athena is an interactive query service that makes it easy to analyze data directly from Amazon S3 using Standard SQL
  7. 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Query Data Directly from Amazon S3 • No loading of data • Query data in its raw format • No Extract, Transform, and Load (ETL) required • Stream data directly from Amazon S3
  8. 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presto SQL • ANSI SQL compliant • Complex joins, nested queries & window functions • Complex data types (arrays, structs, maps) • Partitioning of data by any key • date, time, custom keys • Presto built-in functions
  9. 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Athena Supports Multiple Data Formats • Text files, e.g., CSV, raw logs • Apache Web Logs, TSV files • JSON (simple, nested) • Compressed files • Columnar formats such as Parquet & ORC • AVRO support
  10. 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Athena is Cost Effective • Pay per query • $5 per TB scanned from S3 • DDL Queries and failed queries are free
  11. 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demo: Query files from Amazon Kinesis Firehose with Amazon Athena and AWS Glue
  12. 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Example Data • NYC Taxi & Limousine Commission rides • Data is generated by kinesis-taxi-stream- producer available at [1]: java -jar kinesis-taxi-stream-producer.jar -speedup 400 -statisticsFrequency 10000 -stream nyctlc-ingestion –noWatermark -region eu-central-1 -adaptTime ingestion • ~2GB/h of raw data, 11 days, 487 GB total [1] https://github.com/aws-samples/flink-stream- processing-refarch
  13. 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Test Setup: Ingesting Data with different Settings Amazon Kinesis Streams Amazon S3 Instance Firehose (gzip) Firehose (raw) Firehose (orc) Firehose (parquet) (max Amazon Kinesis Firehose buffering hints: 128MB & 900s)
  14. 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Photo by Glen Noble on Unsplash
  15. 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Example Query I Show some rides on 2nd September 10-11h: SELECT * FROM "128mb" WHERE pickup_datetime BETWEEN '2018-09-02T10' AND '2018-09-02T11' LIMIT 10 Run time: 3.53 seconds, Data scanned: 4.62GB
  16. 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Example Query II (gzip) Show some rides on 2nd September 10-11h: SELECT * FROM "128mbgz" WHERE pickup_datetime BETWEEN '2018-09-02T10' AND '2018-09-02T11' LIMIT 10 Run time: 3.53 seconds, Data scanned: 4.62GB Run time: 2.45 seconds, Data scanned: 303.04KB gzip reduces 487GB to 76GB.
  17. 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Example Query III (without LIMIT 10) What was the distribution of passenger load on 2nd September 10-11h? SELECT passenger_count, count(*) count FROM "128mbgz" WHERE pickup_datetime BETWEEN '2018-09-02T10' AND '2018-09-02T11' GROUP BY passenger_count Run time: 50.36 seconds, Data scanned: 76.5GB
  18. 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Photo by Tang Junwen on Unsplash
  19. 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Partitions to the Rescue AWS Glue crawler adds partitions based on file prefixes/ dirs
  20. 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Example Query IV What was the distribution of passenger load on 2nd September 10-11h? SELECT passenger_count, count(*) count FROM "128mbgz" WHERE pickup_datetime BETWEEN '2018-09-02T10' AND '2018-09-02T11' AND partition_0 || partition_1 || partition_2 || partition_3 BETWEEN '2018090210' AND '2018090215' GROUP BY passenger_count Run time: 27.6 seconds, Data scanned: 25.5GB Run time: 5.59 seconds, Data scanned: 1.77GB
  21. 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Log S3 Athena Data Catalog Schema Lookup Create table partitions Glue Crawl Partitions with AWS Glue Query data Why? Just schedule the crawler, no need to code! Deals with schema evolution. Crawl data
  22. 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use Hive-style File Format in S3 Move/ copy: YYYY/MM/DD/HH/file year=YYYY/month=MM/day=DD/hours=HH/file Make Athena reload partitions by: msck repair table Why? Format easy to create on write, easy to move.
  23. 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Log S3 Athena Data Catalog Schema Lookup Add table partition Lambda Creating Partitions with AWS Lambda Query data New File Trigger Why? Add partitions instantly, just AWS Lambda cost.
  24. 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Populate Partitions if paths are known Issue Statements with Amazon Athena: ALTER TABLE mytable ADD PARTITION (year='2015',month='01',day='01') LOCATION 's3://[...]/2015/01/01/' Why? Easy for predictable paths. Can be prepopulated.
  25. 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Columnar Formats
  26. 26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Last_Name Label Le Fleming Lisciandro Minghi Jime Age 34 25 45 63 22 Gender Fem Fem Fem Mal Mal Flat File Sample Layout First_Name Tootsie Miriam Blakeley Ernst Brew
  27. 27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Last_Name Label Le Fleming Lisciandro Minghi Jime MIN: Jime MAX: Minghi Age 34 25 45 63 22 MIN: 22 MAX: 63 Gender Fem Fem Fem Mal Mal MIN: Fem MAX: Mal First_Name Tootsie Miriam Blakeley Ernst Brew MIN: Blakeley MAX: Tootsie Columnar Formats Layout (Parquet & ORC)
  28. 28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Last_Name Label Le Fleming Lisciandro Minghi Jime MIN: Jime MAX: Minghi Age 34 25 45 63 22 MIN: 22 MAX: 63 Gender Fem Fem Fem Mal Mal MIN: Fem MAX: Mal First_Name Tootsie Miriam Blakeley Ernst Brew MIN: Blakeley MAX: Tootsie Benefit 1: Predicate Pushdown SELECT * FROM ... WHERE Age > 30
  29. 29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Last_Name Label Le Fleming Lisciandro Minghi Jime MIN: Jime MAX: Minghi Age 34 25 45 63 22 MIN: 22 MAX: 63 Gender Fem Fem Fem Mal Mal MIN: Fem MAX: Mal First_Name Tootsie Miriam Blakeley Ernst Brew MIN: Blakeley MAX: Tootsie Benefit 2: Projection Pushdown/ Column Pruning SELECT First_Name FROM ... WHERE Age > 30
  30. 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefit 3: Compression & Encoding • RLE (& Bit Packing) for numbers • Dictionary for string repetitions (+RLE) • Delta encoding for increasing numbers • Delta Strings (for string with a identical prefix) • Plain encoding for varied strings https://github.com/apache/parquet-format/blob/master/Encodings.md
  31. 31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. More on Dictionary Encoding • Builds list of unique strings, assigns numeric ID to each • If the dictionary size over 1MB (configurable) or number of distinct values too high, will fall back to Plain encoding. • The data itself is later represented as numbers and is further encoded using RLE https://github.com/apache/parquet-format/blob/master/Encodings.md
  32. 32. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demo: Parquet/ ORC with Amazon Kinesis Firehose (new!)
  33. 33. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Example Query V (parquet) What was the distribution of passenger load on 2nd September 10-11h? SELECT passenger_count, count(*) count FROM "128mbparquet" WHERE pickup_datetime BETWEEN '2018-09-02T10' AND '2018-09-02T11' AND partition_0 || partition_1 || partition_2 || partition_3 BETWEEN '2018090210' AND '2018090215' GROUP BY passenger_count Run time: 5.59 seconds, Data scanned: 1.77GB Run time: 3.21 seconds, Data scanned: 300.7MB
  34. 34. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Analyzing Parquet File • parquet-tools • head – view data in file • meta – get metadata summary • dump -d -n – get detailed metadata down to page level stats included
  35. 35. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Schema Information Row Count Total Byte Size Size in Bytes Value Count Encoding Download and build [1]. $ java -jar parquet-tools.jar meta <parquetfile> [1] https://github.com/apache/parquet-mr/
  36. 36. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. parquet-tools dump: Encoding & Statistics total_amount: - DOUBLE SNAPPY DO:0 FPO:4155231 SZ:329324/338501/1.03 [more]... ST:[min: -76.8, max: 1121.3, num_nulls: 0] dropoff_datetime: - BINARY SNAPPY DO:0 FPO:3315979 SZ:839131/5540639/6.60 [more]... ST:[no stats for this column] Use (unix epoch) or partition by timestamp for time series data.
  37. 37. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Example Query VI (ORC) What was the distribution of passenger load on 2nd September 10-11h? SELECT passenger_count, count(*) count FROM "128mborc" WHERE pickup_datetime BETWEEN '2018-09-02T10' AND '2018-09-02T11' AND partition_0 || partition_1 || partition_2 || partition_3 BETWEEN '2018090210' AND '2018090215' GROUP BY passenger_count Run time: 3.21 seconds, Data scanned: 300.7MB Run time: 3.61 seconds, Data scanned: 303.38MB
  38. 38. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Analyzing ORC: orcdumpfile Spin up a single node/ master EMR Cluster and use the hive command: hive --orcfiledump file://<absolutepath>/file.orc […] Column 7: count: 210141 hasNull: false min: - 76.96324157714844 max: 0.0 sum: - 1.5329986951126099E7 Column 8: count: 210141 hasNull: false min: 2018-08-30T00:13:48.573Z max: 2018-08- 30T00:28:49.564Z sum: 5043384 […]
  39. 39. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Log S3 Athena Data Catalog Schema Lookup Write table partitions Glue ETL with AWS Glue For Frequent Queries Query data Read/ Write
  40. 40. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demo: ETL with AWS Glue
  41. 41. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Example Zeppelin/ AWS Glue Notebook https://gist.github.com/steffeng/ 5b841a99230ba8377f161f5545 3d49d0
  42. 42. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Example Query VII (repartitioned) What was the distribution of passenger load on 2nd September 10-11h? SELECT passenger_count, count(*) count FROM "partitioned_by_hour" WHERE year = 2018 AND month = 9 AND day = 2 AND hour = 10 GROUP BY passenger_count Run time: 3.21 seconds, Data scanned: 300.7MB Run time: 2.42 seconds, Data scanned: 2.06MB
  43. 43. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Example Query VIII (aggregated) What was the distribution of passenger load on 2nd September 10-11h? SELECT passenger_count, trip_count FROM "aggregates_by_hour" WHERE year = 2018 AND month = 9 AND day = 2 AND hour = 10 Run time: 2.42 seconds, Data scanned: 2.06MB Run time: 1.85 seconds, Data scanned: 0.37KB
  44. 44. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Recently announced and relevant...
  45. 45. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Photo by Benjamin Davies on Unsplash I applied these simple tricks when storing data for Amazon Athena and you won‘t believe what happened next...
  46. 46. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Measure. Then optimize. There‘s no silver bullet. Photo by Cesar Carlevarino Aragon on Unsplash
  47. 47. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Optimize for Cost and Performance 1/2 • Use Athena in the region of your buckets. • Compress your data for less storage & query cost. • Use LIMIT in queries for faster results. • Partition your data based on data access patterns. • Use partitions in your queries. • Add partitions by crawling or S3 triggers.
  48. 48. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Optimize for Cost and Performance 2/2 • Columnar formats as ORC & parquet reduce scanned data: faster, less cost • Pick format depending on data, access patterns, clients • Inspect/ verify the resulting files • Create aggregates for frequent queries • Shorten turnaround times for Glue job development: • Use a provisioned development endpoint • Use small subset of your data (think KB!)
  49. 49. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The AWS Free Tier allows you to get hands on experience with AWS Glue and S3. Try it today!
  50. 50. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Questions? Ask the Architect downstairs!

×