Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks

2.180 Aufrufe

Veröffentlicht am

Learning Objectives:
- Discover dark data that you are currently not analyzing.
- Analyze dark data without moving it into your data warehouse.
- Visualize the results of your dark data analytics.

  • Loggen Sie sich ein, um Kommentare anzuzeigen.

Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Prajakta Damle, Sr Product Manager – AWS Glue Ben Snively, Specialist SA – Data and Analytics September 14, 2017 Tackle Your Dark Data Challenge with AWS Glue
  2. 2. Agenda • What is Dark Data? • Automatically discovering your Dark Data • Understating the Dark Data • Analyzing, processing and transforming your Dark Data • Demonstration • Conclusion
  3. 3. What is Dark Data? “Dark data” is data that is collected and stored by an organization, but it is not used by processes or analytics. • Therefore, dark data is currently providing very little value. Organizations, however, believe that their dark data can provide value, so they want to: • Discover the dark data that they have • Query / analyze it to drive additional insights to move the business forward
  4. 4. AWS Glue Automatically discovers and categorizes your dark data to make it immediately searchable and queryable Generates code to clean, enrich, and reliably move data between data stores; you can also use their favorite tools to build ETL jobs Runs your jobs on a serverless, fully managed, scale-out environment without needing to provision or manage compute resources Discover Develop Deploy
  5. 5. AWS Glue: Components Data Catalog  Apache Hive Metastore compatible with enhanced functionality  Crawlers automatically extract metadata and create tables  Integrated with Amazon Athena, Amazon Redshift Spectrum Job Execution  Runs jobs on a serverless Apache Spark environment  Provides flexible scheduling  Handles dependency resolution, monitoring, and alerting Job Authoring  Auto-generates ETL code  Built on open frameworks – Python and Apache Spark  Developer-centric – editing, debugging, sharing
  6. 6. AWS Glue Data Catalog Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single categorized list that is searchable
  7. 7. Glue Data Catalog Data Catalog automatically populated through Crawlers (can also populate using Apache Hive DDL or bulk import script) Manage table metadata through an Apache Hive metastore API or Apache Hive SQL (supported by tools like Apache Hive, Presto, Apache Spark etc.) We added a few extensions:  Search over metadata for data discovery  Connection info – JDBC URLs, credentials  Classification for identifying and parsing files  Versioning of table metadata as schemas evolve and other metadata are updated
  8. 8. Glue Data Catalog: Crawlers  Automatically discover new data and extract schema definitions • Detect schema changes and version tables • Detect Apache Hive style partitions on Amazon S3  Built-in classifiers for popular data types • Custom classifiers using Grok expressions  Run ad hoc or on a schedule; serverless – only pay when crawler runs Crawlers automatically build your Data Catalog and keep it in sync
  9. 9. Crawlers: Classifiers IAM Role Glue Crawler Data Lakes Data Warehouse Databases Amazon RDS Amazon Redshift Amazon S3 JDBC Connection Object Connection Built-In Classifiers MySQL MariaDB PostreSQL Aurora Redshift Avro Parquet ORC JSON & BJSON Logs (Apache, Linux, MS, Ruby, Redis, and many others) Delimited (comma, pipe, tab, semicolon) Compressed Formats (ZIP, BZIP, GZIP, LZ4, Snappy) Create additional Custom Classifiers with Grok!
  10. 10. Crawler: Detecting partitions file 1 file N… file 1 file N… date=10 date=15… month=No v S3 bucket hierarchy Table definition Estimate schema similarity among files at each level to handle semi-structured logs, schema evolution… sim=.99 sim=.95 sim=.93 month date col 1 col 2 str str int float Column Type
  11. 11. Glue Data Catalog: Table details Table schema Table properties Data statistics Nested fields
  12. 12. Glue Data Catalog: Version control List of table versionsCompare schema versions
  13. 13. Understand your dark data
  14. 14. Analyzing and Processing your dark data
  15. 15. Transforming your dark data
  16. 16. Job authoring in AWS Glue  Python code generated by AWS Glue  Connect a notebook or IDE to AWS Glue  Existing code brought into AWS Glue You have choices on how to get started
  17. 17. 1. Customize the mappings 2. Glue generates transformation graph and Python code 3. Connect your notebook to development endpoints to customize your code Job authoring: Automatic code generation
  18. 18.  Human-readable, editable, and portable PySpark code  Flexible: Glue’s ETL library simplifies manipulating complex, semi-structured data  Customizable: Use native PySpark, import custom libraries, and/or leverage Glue’s libraries  Collaborative: share code snippets via GitHub, reuse code across jobs Job authoring: ETL code
  19. 19. Job Authoring: Glue Dynamic Frames Dynamic frame schema A C D [ ] X Y B1 B2 Like Apache Spark’s Data Frames, but better for: • Cleaning and (re)-structuring semi-structured data sets, e.g. JSON, Avro, Apache logs ... No upfront schema needed: • Infers schema on-the-fly, enabling transformations in a single pass Easy to handle the unexpected: • Tracks new fields, and inconsistent changing data types with choices, e.g. integer or string • Automatically mark and separate error records
  20. 20. Job Authoring: Glue transforms ResolveChoice() B B B project B cast B separate into cols B B Apply Mapping() A X Y A X Y Adaptive and flexible C
  21. 21. Job authoring: Relationalize() transform Semi-structured schema Relational schema F K A B B C.X C. Y P K Valu e Offs et A C D [ ] X Y B B • Transforms and adds new columns, types, and tables on-the-fly • Tracks keys and foreign keys across runs • SQL on the relational schema is orders of magnitude faster than JSON processing
  22. 22. Job authoring: Glue transforms  Prebuilt transformation: Click and add to your job with simple configuration  Spigot writes sample data from DynamicFrame to S3 in JSON format  Expanding… more transformations to come
  23. 23. Job authoring: Write your own scripts Import custom libraries required by your code Convert to Apache Spark Data Frame for complex SQL-based ETL Convert back to Glue Dynamic Frame for semi-structured processing and AWS Glue connectors
  24. 24. Job authoring: Developer endpoints  Environment to iteratively develop and test ETL code.  Connect your IDE or notebook (e.g. Zeppelin) to a Glue development endpoint.  When you are satisfied with the results you can create an ETL job that runs your code. Glue Apache Spark environment Remote interpreter Interpreter server
  25. 25. Demonstration
  26. 26. Conclusion Data Catalog  Apache Hive Metastore compatible with enhanced functionality  Crawlers automatically extract metadata and create tables  Integrated with Amazon Athena, Amazon Redshift Spectrum Job Execution  Runs jobs on a serverless Apache Spark environment  Provides flexible scheduling  Handles dependency resolution, monitoring, and alerting Job Authoring  Auto-generates ETL code  Built on open frameworks – Python and Apache Spark  Developer-centric – editing, debugging, sharing
  27. 27. Thank you! https://aws.amazon.com/glue/developer-resources/

×