SlideShare ist ein Scribd-Unternehmen logo
1 von 27
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Prajakta Damle, Sr Product Manager – AWS Glue
Ben Snively, Specialist SA – Data and Analytics
September 14, 2017
Tackle Your Dark Data
Challenge with AWS Glue
Agenda
• What is Dark Data?
• Automatically discovering your Dark Data
• Understating the Dark Data
• Analyzing, processing and transforming your Dark Data
• Demonstration
• Conclusion
What is Dark Data?
“Dark data” is data that is collected and stored by an organization, but it is not
used by processes or analytics.
• Therefore, dark data is currently providing very little value.
Organizations, however, believe that their dark data can provide value, so
they want to:
• Discover the dark data that they have
• Query / analyze it to drive additional insights to move the business forward
AWS Glue
Automatically discovers and categorizes your dark data to make it
immediately searchable and queryable
Generates code to clean, enrich, and reliably move data between data
stores; you can also use their favorite tools to build ETL jobs
Runs your jobs on a serverless, fully managed, scale-out environment
without needing to provision or manage compute resources
Discover
Develop
Deploy
AWS Glue: Components
Data Catalog
 Apache Hive Metastore compatible with enhanced functionality
 Crawlers automatically extract metadata and create tables
 Integrated with Amazon Athena, Amazon Redshift Spectrum
Job Execution
 Runs jobs on a serverless Apache Spark environment
 Provides flexible scheduling
 Handles dependency resolution, monitoring, and alerting
Job Authoring
 Auto-generates ETL code
 Built on open frameworks – Python and Apache Spark
 Developer-centric – editing, debugging, sharing
AWS Glue Data Catalog
Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single
categorized list that is searchable
Glue Data Catalog
Data Catalog automatically populated through Crawlers
(can also populate using Apache Hive DDL or bulk import script)
Manage table metadata through an Apache Hive metastore API or Apache
Hive SQL
(supported by tools like Apache Hive, Presto, Apache Spark etc.)
We added a few extensions:
 Search over metadata for data discovery
 Connection info – JDBC URLs, credentials
 Classification for identifying and parsing files
 Versioning of table metadata as schemas evolve and other metadata are updated
Glue Data Catalog: Crawlers
 Automatically discover new data and extract schema definitions
• Detect schema changes and version tables
• Detect Apache Hive style partitions on Amazon S3
 Built-in classifiers for popular data types
• Custom classifiers using Grok expressions
 Run ad hoc or on a schedule; serverless – only pay when crawler runs
Crawlers automatically build your Data Catalog and keep it in sync
Crawlers: Classifiers
IAM Role
Glue Crawler
Data Lakes
Data Warehouse
Databases
Amazon
RDS
Amazon
Redshift
Amazon S3
JDBC Connection
Object Connection
Built-In Classifiers
MySQL
MariaDB
PostreSQL
Aurora
Redshift
Avro
Parquet
ORC
JSON & BJSON
Logs
(Apache, Linux, MS, Ruby, Redis, and many others)
Delimited
(comma, pipe, tab, semicolon)
Compressed Formats
(ZIP, BZIP, GZIP, LZ4, Snappy)
Create additional Custom
Classifiers with Grok!
Crawler: Detecting partitions
file 1 file N… file 1 file N…
date=10 date=15…
month=No
v
S3 bucket hierarchy Table definition
Estimate schema similarity among files at each level to
handle semi-structured logs, schema evolution…
sim=.99 sim=.95
sim=.93
month
date
col 1
col 2
str
str
int
float
Column Type
Glue Data Catalog: Table details
Table schema
Table properties
Data statistics
Nested fields
Glue Data Catalog: Version control
List of table versionsCompare schema versions
Understand your dark data
Analyzing and Processing your dark data
Transforming your dark data
Job authoring in AWS Glue
 Python code generated by AWS Glue
 Connect a notebook or IDE to AWS Glue
 Existing code brought into AWS Glue
You have choices on
how to get started
1. Customize the mappings
2. Glue generates transformation graph and Python code
3. Connect your notebook to development endpoints to customize your code
Job authoring: Automatic code generation
 Human-readable, editable, and portable PySpark code
 Flexible: Glue’s ETL library simplifies manipulating complex, semi-structured data
 Customizable: Use native PySpark, import custom libraries, and/or leverage Glue’s libraries
 Collaborative: share code snippets via GitHub, reuse code across jobs
Job authoring: ETL code
Job Authoring: Glue Dynamic Frames
Dynamic frame schema
A C D [ ]
X Y
B1 B2
Like Apache Spark’s Data Frames, but better for:
• Cleaning and (re)-structuring semi-structured
data sets, e.g. JSON, Avro, Apache logs ...
No upfront schema needed:
• Infers schema on-the-fly, enabling transformations
in a single pass
Easy to handle the unexpected:
• Tracks new fields, and inconsistent changing data
types with choices, e.g. integer or string
• Automatically mark and separate error records
Job Authoring: Glue transforms
ResolveChoice() B B B
project
B
cast
B
separate into cols
B B
Apply Mapping() A
X Y
A X Y
Adaptive and flexible
C
Job authoring: Relationalize() transform
Semi-structured schema Relational schema
F
K
A B B C.X C.
Y
P
K
Valu
e
Offs
et
A C D [ ]
X Y
B B
• Transforms and adds new columns, types, and tables on-the-fly
• Tracks keys and foreign keys across runs
• SQL on the relational schema is orders of magnitude faster than JSON processing
Job authoring: Glue transforms
 Prebuilt transformation: Click and
add to your job with simple
configuration
 Spigot writes sample data from
DynamicFrame to S3 in JSON format
 Expanding… more transformations
to come
Job authoring: Write your own scripts
Import custom libraries required by your code
Convert to Apache Spark Data Frame
for complex SQL-based ETL
Convert back to Glue Dynamic Frame
for semi-structured processing and
AWS Glue connectors
Job authoring: Developer endpoints
 Environment to iteratively develop and test ETL code.
 Connect your IDE or notebook (e.g. Zeppelin) to a Glue development endpoint.
 When you are satisfied with the results you can create an ETL job that runs your code.
Glue Apache Spark environment
Remote
interpreter
Interpreter
server
Demonstration
Conclusion
Data Catalog
 Apache Hive Metastore compatible with enhanced functionality
 Crawlers automatically extract metadata and create tables
 Integrated with Amazon Athena, Amazon Redshift Spectrum
Job Execution
 Runs jobs on a serverless Apache Spark environment
 Provides flexible scheduling
 Handles dependency resolution, monitoring, and alerting
Job Authoring
 Auto-generates ETL code
 Built on open frameworks – Python and Apache Spark
 Developer-centric – editing, debugging, sharing
Thank you!
https://aws.amazon.com/glue/developer-resources/

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftAmazon Web Services
 
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기Amazon Web Services Korea
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueAmazon Web Services
 
Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)Daniel Toomey
 
Using AWS Purpose-Built Databases to Modernize your Applications
Using AWS Purpose-Built Databases to Modernize your ApplicationsUsing AWS Purpose-Built Databases to Modernize your Applications
Using AWS Purpose-Built Databases to Modernize your ApplicationsAmazon Web Services
 
Amazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
Amazon RDS: Deep Dive - SRV310 - Chicago AWS SummitAmazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
Amazon RDS: Deep Dive - SRV310 - Chicago AWS SummitAmazon Web Services
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSAmazon Web Services
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceAmazon Web Services
 
Introduction to Azure Data Factory
Introduction to Azure Data FactoryIntroduction to Azure Data Factory
Introduction to Azure Data FactorySlava Kokaev
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!Chris Taylor
 
Azure data factory
Azure data factoryAzure data factory
Azure data factoryDavid Giard
 
Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019
Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019
Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019Amazon Web Services
 
21- Self-Hosted Integration Runtime in Azure Data Factory.pptx
21- Self-Hosted Integration Runtime in Azure Data Factory.pptx21- Self-Hosted Integration Runtime in Azure Data Factory.pptx
21- Self-Hosted Integration Runtime in Azure Data Factory.pptxBRIJESH KUMAR
 

Was ist angesagt? (20)

Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF Loft
 
Introduction to AWS Glue
Introduction to AWS Glue Introduction to AWS Glue
Introduction to AWS Glue
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Using AWS Purpose-Built Databases to Modernize your Applications
Using AWS Purpose-Built Databases to Modernize your ApplicationsUsing AWS Purpose-Built Databases to Modernize your Applications
Using AWS Purpose-Built Databases to Modernize your Applications
 
Amazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
Amazon RDS: Deep Dive - SRV310 - Chicago AWS SummitAmazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
Amazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWS
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database Service
 
Introduction to Azure Data Factory
Introduction to Azure Data FactoryIntroduction to Azure Data Factory
Introduction to Azure Data Factory
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!
 
Azure data factory
Azure data factoryAzure data factory
Azure data factory
 
AWS Elastic Compute Cloud (EC2)
AWS Elastic Compute Cloud (EC2) AWS Elastic Compute Cloud (EC2)
AWS Elastic Compute Cloud (EC2)
 
Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019
Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019
Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019
 
21- Self-Hosted Integration Runtime in Azure Data Factory.pptx
21- Self-Hosted Integration Runtime in Azure Data Factory.pptx21- Self-Hosted Integration Runtime in Azure Data Factory.pptx
21- Self-Hosted Integration Runtime in Azure Data Factory.pptx
 
Introduction to Amazon Aurora
Introduction to Amazon AuroraIntroduction to Amazon Aurora
Introduction to Amazon Aurora
 
Intro to AWS Lambda
Intro to AWS Lambda Intro to AWS Lambda
Intro to AWS Lambda
 

Andere mochten auch

管理程式對AWS LAMBDA持續交付
管理程式對AWS LAMBDA持續交付管理程式對AWS LAMBDA持續交付
管理程式對AWS LAMBDA持續交付Amazon Web Services
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Web Services
 
可靠分布式系统基础 Paxos的直观解释
可靠分布式系统基础 Paxos的直观解释可靠分布式系统基础 Paxos的直观解释
可靠分布式系统基础 Paxos的直观解释Yanpo Zhang
 
初探AWS 平台上的 NoSQL 雲端資料庫服務
初探AWS 平台上的 NoSQL 雲端資料庫服務初探AWS 平台上的 NoSQL 雲端資料庫服務
初探AWS 平台上的 NoSQL 雲端資料庫服務Amazon Web Services
 
BDA303 Serverless big data architectures: Design patterns and best practices
BDA303 Serverless big data architectures: Design patterns and best practicesBDA303 Serverless big data architectures: Design patterns and best practices
BDA303 Serverless big data architectures: Design patterns and best practicesAmazon Web Services
 
GLOA:A New Job Scheduling Algorithm for Grid Computing
GLOA:A New Job Scheduling Algorithm for Grid ComputingGLOA:A New Job Scheduling Algorithm for Grid Computing
GLOA:A New Job Scheduling Algorithm for Grid ComputingLINE+
 
電子商務資料分析 上課投影片
電子商務資料分析 上課投影片電子商務資料分析 上課投影片
電子商務資料分析 上課投影片Ethan Yin-Hao Tsui
 
(Slides) Task scheduling algorithm for multicore processor system for minimiz...
(Slides) Task scheduling algorithm for multicore processor system for minimiz...(Slides) Task scheduling algorithm for multicore processor system for minimiz...
(Slides) Task scheduling algorithm for multicore processor system for minimiz...Naoki Shibata
 
Connecting Your Data Analytics Pipeline
Connecting Your Data Analytics PipelineConnecting Your Data Analytics Pipeline
Connecting Your Data Analytics PipelineAmazon Web Services
 
大數據運算媒體業案例分享 (Big Data Compute Case Sharing for Media Industry)
大數據運算媒體業案例分享 (Big Data Compute Case Sharing for Media Industry)大數據運算媒體業案例分享 (Big Data Compute Case Sharing for Media Industry)
大數據運算媒體業案例分享 (Big Data Compute Case Sharing for Media Industry)Amazon Web Services
 
Cephfs架构解读和测试分析
Cephfs架构解读和测试分析Cephfs架构解读和测试分析
Cephfs架构解读和测试分析Yang Guanjun
 
The Power of Big Data - AWS Summit Bahrain 2017
The Power of Big Data - AWS Summit Bahrain 2017The Power of Big Data - AWS Summit Bahrain 2017
The Power of Big Data - AWS Summit Bahrain 2017Amazon Web Services
 
淺談系統監控與 AWS CloudWatch 的應用
淺談系統監控與 AWS CloudWatch 的應用淺談系統監控與 AWS CloudWatch 的應用
淺談系統監控與 AWS CloudWatch 的應用Rick Hwang
 
如何利用 Amazon EMR 及Athena 打造高成本效益的大數據環境
如何利用 Amazon EMR 及Athena 打造高成本效益的大數據環境如何利用 Amazon EMR 及Athena 打造高成本效益的大數據環境
如何利用 Amazon EMR 及Athena 打造高成本效益的大數據環境Amazon Web Services
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Cloudera, Inc.
 
唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pub唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pubChao Zhu
 
Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)
Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)
Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)Kuo-Chun Su
 
The Power of Big Data - Transformation Day Public Sector London 2017
The Power of Big Data - Transformation Day Public Sector London 2017The Power of Big Data - Transformation Day Public Sector London 2017
The Power of Big Data - Transformation Day Public Sector London 2017Amazon Web Services
 

Andere mochten auch (20)

管理程式對AWS LAMBDA持續交付
管理程式對AWS LAMBDA持續交付管理程式對AWS LAMBDA持續交付
管理程式對AWS LAMBDA持續交付
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview
 
可靠分布式系统基础 Paxos的直观解释
可靠分布式系统基础 Paxos的直观解释可靠分布式系统基础 Paxos的直观解释
可靠分布式系统基础 Paxos的直观解释
 
初探AWS 平台上的 NoSQL 雲端資料庫服務
初探AWS 平台上的 NoSQL 雲端資料庫服務初探AWS 平台上的 NoSQL 雲端資料庫服務
初探AWS 平台上的 NoSQL 雲端資料庫服務
 
BDA303 Serverless big data architectures: Design patterns and best practices
BDA303 Serverless big data architectures: Design patterns and best practicesBDA303 Serverless big data architectures: Design patterns and best practices
BDA303 Serverless big data architectures: Design patterns and best practices
 
GLOA:A New Job Scheduling Algorithm for Grid Computing
GLOA:A New Job Scheduling Algorithm for Grid ComputingGLOA:A New Job Scheduling Algorithm for Grid Computing
GLOA:A New Job Scheduling Algorithm for Grid Computing
 
電子商務資料分析 上課投影片
電子商務資料分析 上課投影片電子商務資料分析 上課投影片
電子商務資料分析 上課投影片
 
(Slides) Task scheduling algorithm for multicore processor system for minimiz...
(Slides) Task scheduling algorithm for multicore processor system for minimiz...(Slides) Task scheduling algorithm for multicore processor system for minimiz...
(Slides) Task scheduling algorithm for multicore processor system for minimiz...
 
Connecting Your Data Analytics Pipeline
Connecting Your Data Analytics PipelineConnecting Your Data Analytics Pipeline
Connecting Your Data Analytics Pipeline
 
大數據運算媒體業案例分享 (Big Data Compute Case Sharing for Media Industry)
大數據運算媒體業案例分享 (Big Data Compute Case Sharing for Media Industry)大數據運算媒體業案例分享 (Big Data Compute Case Sharing for Media Industry)
大數據運算媒體業案例分享 (Big Data Compute Case Sharing for Media Industry)
 
Cephfs架构解读和测试分析
Cephfs架构解读和测试分析Cephfs架构解读和测试分析
Cephfs架构解读和测试分析
 
The Power of Big Data - AWS Summit Bahrain 2017
The Power of Big Data - AWS Summit Bahrain 2017The Power of Big Data - AWS Summit Bahrain 2017
The Power of Big Data - AWS Summit Bahrain 2017
 
淺談系統監控與 AWS CloudWatch 的應用
淺談系統監控與 AWS CloudWatch 的應用淺談系統監控與 AWS CloudWatch 的應用
淺談系統監控與 AWS CloudWatch 的應用
 
如何利用 Amazon EMR 及Athena 打造高成本效益的大數據環境
如何利用 Amazon EMR 及Athena 打造高成本效益的大數據環境如何利用 Amazon EMR 及Athena 打造高成本效益的大數據環境
如何利用 Amazon EMR 及Athena 打造高成本效益的大數據環境
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pub唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pub
 
Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)
Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)
Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)
 
Micro service
Micro serviceMicro service
Micro service
 
The Power of Big Data - Transformation Day Public Sector London 2017
The Power of Big Data - Transformation Day Public Sector London 2017The Power of Big Data - Transformation Day Public Sector London 2017
The Power of Big Data - Transformation Day Public Sector London 2017
 

Ähnlich wie Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks

201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine LearningMark Tabladillo
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Amazon Web Services
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)Amazon Web Services Korea
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWSPaolo latella
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Michael Rys
 
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudAmazon Web Services
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCMark Smith
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
AWS Certified Solutions Architect Professional Course S15-S18
AWS Certified Solutions Architect Professional Course S15-S18AWS Certified Solutions Architect Professional Course S15-S18
AWS Certified Solutions Architect Professional Course S15-S18Neal Davis
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Precisely
 
Adding Search to Amazon DynamoDB
Adding Search to Amazon DynamoDBAdding Search to Amazon DynamoDB
Adding Search to Amazon DynamoDBAmazon Web Services
 
Adding Search to Amazon DynamoDB
Adding Search to Amazon DynamoDBAdding Search to Amazon DynamoDB
Adding Search to Amazon DynamoDBAmazon Web Services
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data PlatformShu-Jeng Hsieh
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Amazon Web Services
 
Big data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The NetherlandsBig data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The NetherlandsMarek Kuczynski
 
AWS re:Invent 2016: Workshop: Converting Your Oracle or Microsoft SQL Server ...
AWS re:Invent 2016: Workshop: Converting Your Oracle or Microsoft SQL Server ...AWS re:Invent 2016: Workshop: Converting Your Oracle or Microsoft SQL Server ...
AWS re:Invent 2016: Workshop: Converting Your Oracle or Microsoft SQL Server ...Amazon Web Services
 

Ähnlich wie Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks (20)

201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
 
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS Cloud
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
AWS Certified Solutions Architect Professional Course S15-S18
AWS Certified Solutions Architect Professional Course S15-S18AWS Certified Solutions Architect Professional Course S15-S18
AWS Certified Solutions Architect Professional Course S15-S18
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Adding Search to Amazon DynamoDB
Adding Search to Amazon DynamoDBAdding Search to Amazon DynamoDB
Adding Search to Amazon DynamoDB
 
Adding Search to Amazon DynamoDB
Adding Search to Amazon DynamoDBAdding Search to Amazon DynamoDB
Adding Search to Amazon DynamoDB
 
AWS Big Data Landscape
AWS Big Data LandscapeAWS Big Data Landscape
AWS Big Data Landscape
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
 
Big data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The NetherlandsBig data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The Netherlands
 
AWS re:Invent 2016: Workshop: Converting Your Oracle or Microsoft SQL Server ...
AWS re:Invent 2016: Workshop: Converting Your Oracle or Microsoft SQL Server ...AWS re:Invent 2016: Workshop: Converting Your Oracle or Microsoft SQL Server ...
AWS re:Invent 2016: Workshop: Converting Your Oracle or Microsoft SQL Server ...
 

Mehr von Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Prajakta Damle, Sr Product Manager – AWS Glue Ben Snively, Specialist SA – Data and Analytics September 14, 2017 Tackle Your Dark Data Challenge with AWS Glue
  • 2. Agenda • What is Dark Data? • Automatically discovering your Dark Data • Understating the Dark Data • Analyzing, processing and transforming your Dark Data • Demonstration • Conclusion
  • 3. What is Dark Data? “Dark data” is data that is collected and stored by an organization, but it is not used by processes or analytics. • Therefore, dark data is currently providing very little value. Organizations, however, believe that their dark data can provide value, so they want to: • Discover the dark data that they have • Query / analyze it to drive additional insights to move the business forward
  • 4. AWS Glue Automatically discovers and categorizes your dark data to make it immediately searchable and queryable Generates code to clean, enrich, and reliably move data between data stores; you can also use their favorite tools to build ETL jobs Runs your jobs on a serverless, fully managed, scale-out environment without needing to provision or manage compute resources Discover Develop Deploy
  • 5. AWS Glue: Components Data Catalog  Apache Hive Metastore compatible with enhanced functionality  Crawlers automatically extract metadata and create tables  Integrated with Amazon Athena, Amazon Redshift Spectrum Job Execution  Runs jobs on a serverless Apache Spark environment  Provides flexible scheduling  Handles dependency resolution, monitoring, and alerting Job Authoring  Auto-generates ETL code  Built on open frameworks – Python and Apache Spark  Developer-centric – editing, debugging, sharing
  • 6. AWS Glue Data Catalog Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single categorized list that is searchable
  • 7. Glue Data Catalog Data Catalog automatically populated through Crawlers (can also populate using Apache Hive DDL or bulk import script) Manage table metadata through an Apache Hive metastore API or Apache Hive SQL (supported by tools like Apache Hive, Presto, Apache Spark etc.) We added a few extensions:  Search over metadata for data discovery  Connection info – JDBC URLs, credentials  Classification for identifying and parsing files  Versioning of table metadata as schemas evolve and other metadata are updated
  • 8. Glue Data Catalog: Crawlers  Automatically discover new data and extract schema definitions • Detect schema changes and version tables • Detect Apache Hive style partitions on Amazon S3  Built-in classifiers for popular data types • Custom classifiers using Grok expressions  Run ad hoc or on a schedule; serverless – only pay when crawler runs Crawlers automatically build your Data Catalog and keep it in sync
  • 9. Crawlers: Classifiers IAM Role Glue Crawler Data Lakes Data Warehouse Databases Amazon RDS Amazon Redshift Amazon S3 JDBC Connection Object Connection Built-In Classifiers MySQL MariaDB PostreSQL Aurora Redshift Avro Parquet ORC JSON & BJSON Logs (Apache, Linux, MS, Ruby, Redis, and many others) Delimited (comma, pipe, tab, semicolon) Compressed Formats (ZIP, BZIP, GZIP, LZ4, Snappy) Create additional Custom Classifiers with Grok!
  • 10. Crawler: Detecting partitions file 1 file N… file 1 file N… date=10 date=15… month=No v S3 bucket hierarchy Table definition Estimate schema similarity among files at each level to handle semi-structured logs, schema evolution… sim=.99 sim=.95 sim=.93 month date col 1 col 2 str str int float Column Type
  • 11. Glue Data Catalog: Table details Table schema Table properties Data statistics Nested fields
  • 12. Glue Data Catalog: Version control List of table versionsCompare schema versions
  • 14. Analyzing and Processing your dark data
  • 16. Job authoring in AWS Glue  Python code generated by AWS Glue  Connect a notebook or IDE to AWS Glue  Existing code brought into AWS Glue You have choices on how to get started
  • 17. 1. Customize the mappings 2. Glue generates transformation graph and Python code 3. Connect your notebook to development endpoints to customize your code Job authoring: Automatic code generation
  • 18.  Human-readable, editable, and portable PySpark code  Flexible: Glue’s ETL library simplifies manipulating complex, semi-structured data  Customizable: Use native PySpark, import custom libraries, and/or leverage Glue’s libraries  Collaborative: share code snippets via GitHub, reuse code across jobs Job authoring: ETL code
  • 19. Job Authoring: Glue Dynamic Frames Dynamic frame schema A C D [ ] X Y B1 B2 Like Apache Spark’s Data Frames, but better for: • Cleaning and (re)-structuring semi-structured data sets, e.g. JSON, Avro, Apache logs ... No upfront schema needed: • Infers schema on-the-fly, enabling transformations in a single pass Easy to handle the unexpected: • Tracks new fields, and inconsistent changing data types with choices, e.g. integer or string • Automatically mark and separate error records
  • 20. Job Authoring: Glue transforms ResolveChoice() B B B project B cast B separate into cols B B Apply Mapping() A X Y A X Y Adaptive and flexible C
  • 21. Job authoring: Relationalize() transform Semi-structured schema Relational schema F K A B B C.X C. Y P K Valu e Offs et A C D [ ] X Y B B • Transforms and adds new columns, types, and tables on-the-fly • Tracks keys and foreign keys across runs • SQL on the relational schema is orders of magnitude faster than JSON processing
  • 22. Job authoring: Glue transforms  Prebuilt transformation: Click and add to your job with simple configuration  Spigot writes sample data from DynamicFrame to S3 in JSON format  Expanding… more transformations to come
  • 23. Job authoring: Write your own scripts Import custom libraries required by your code Convert to Apache Spark Data Frame for complex SQL-based ETL Convert back to Glue Dynamic Frame for semi-structured processing and AWS Glue connectors
  • 24. Job authoring: Developer endpoints  Environment to iteratively develop and test ETL code.  Connect your IDE or notebook (e.g. Zeppelin) to a Glue development endpoint.  When you are satisfied with the results you can create an ETL job that runs your code. Glue Apache Spark environment Remote interpreter Interpreter server
  • 26. Conclusion Data Catalog  Apache Hive Metastore compatible with enhanced functionality  Crawlers automatically extract metadata and create tables  Integrated with Amazon Athena, Amazon Redshift Spectrum Job Execution  Runs jobs on a serverless Apache Spark environment  Provides flexible scheduling  Handles dependency resolution, monitoring, and alerting Job Authoring  Auto-generates ETL code  Built on open frameworks – Python and Apache Spark  Developer-centric – editing, debugging, sharing