Learning Objectives:
- Discover dark data that you are currently not analyzing.
- Analyze dark data without moving it into your data warehouse.
- Visualize the results of your dark data analytics.
2. Agenda
• What is Dark Data?
• Automatically discovering your Dark Data
• Understating the Dark Data
• Analyzing, processing and transforming your Dark Data
• Demonstration
• Conclusion
3. What is Dark Data?
“Dark data” is data that is collected and stored by an organization, but it is not
used by processes or analytics.
• Therefore, dark data is currently providing very little value.
Organizations, however, believe that their dark data can provide value, so
they want to:
• Discover the dark data that they have
• Query / analyze it to drive additional insights to move the business forward
4. AWS Glue
Automatically discovers and categorizes your dark data to make it
immediately searchable and queryable
Generates code to clean, enrich, and reliably move data between data
stores; you can also use their favorite tools to build ETL jobs
Runs your jobs on a serverless, fully managed, scale-out environment
without needing to provision or manage compute resources
Discover
Develop
Deploy
5. AWS Glue: Components
Data Catalog
Apache Hive Metastore compatible with enhanced functionality
Crawlers automatically extract metadata and create tables
Integrated with Amazon Athena, Amazon Redshift Spectrum
Job Execution
Runs jobs on a serverless Apache Spark environment
Provides flexible scheduling
Handles dependency resolution, monitoring, and alerting
Job Authoring
Auto-generates ETL code
Built on open frameworks – Python and Apache Spark
Developer-centric – editing, debugging, sharing
6. AWS Glue Data Catalog
Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single
categorized list that is searchable
7. Glue Data Catalog
Data Catalog automatically populated through Crawlers
(can also populate using Apache Hive DDL or bulk import script)
Manage table metadata through an Apache Hive metastore API or Apache
Hive SQL
(supported by tools like Apache Hive, Presto, Apache Spark etc.)
We added a few extensions:
Search over metadata for data discovery
Connection info – JDBC URLs, credentials
Classification for identifying and parsing files
Versioning of table metadata as schemas evolve and other metadata are updated
8. Glue Data Catalog: Crawlers
Automatically discover new data and extract schema definitions
• Detect schema changes and version tables
• Detect Apache Hive style partitions on Amazon S3
Built-in classifiers for popular data types
• Custom classifiers using Grok expressions
Run ad hoc or on a schedule; serverless – only pay when crawler runs
Crawlers automatically build your Data Catalog and keep it in sync
9. Crawlers: Classifiers
IAM Role
Glue Crawler
Data Lakes
Data Warehouse
Databases
Amazon
RDS
Amazon
Redshift
Amazon S3
JDBC Connection
Object Connection
Built-In Classifiers
MySQL
MariaDB
PostreSQL
Aurora
Redshift
Avro
Parquet
ORC
JSON & BJSON
Logs
(Apache, Linux, MS, Ruby, Redis, and many others)
Delimited
(comma, pipe, tab, semicolon)
Compressed Formats
(ZIP, BZIP, GZIP, LZ4, Snappy)
Create additional Custom
Classifiers with Grok!
10. Crawler: Detecting partitions
file 1 file N… file 1 file N…
date=10 date=15…
month=No
v
S3 bucket hierarchy Table definition
Estimate schema similarity among files at each level to
handle semi-structured logs, schema evolution…
sim=.99 sim=.95
sim=.93
month
date
col 1
col 2
str
str
int
float
Column Type
11. Glue Data Catalog: Table details
Table schema
Table properties
Data statistics
Nested fields
12. Glue Data Catalog: Version control
List of table versionsCompare schema versions
16. Job authoring in AWS Glue
Python code generated by AWS Glue
Connect a notebook or IDE to AWS Glue
Existing code brought into AWS Glue
You have choices on
how to get started
17. 1. Customize the mappings
2. Glue generates transformation graph and Python code
3. Connect your notebook to development endpoints to customize your code
Job authoring: Automatic code generation
18. Human-readable, editable, and portable PySpark code
Flexible: Glue’s ETL library simplifies manipulating complex, semi-structured data
Customizable: Use native PySpark, import custom libraries, and/or leverage Glue’s libraries
Collaborative: share code snippets via GitHub, reuse code across jobs
Job authoring: ETL code
19. Job Authoring: Glue Dynamic Frames
Dynamic frame schema
A C D [ ]
X Y
B1 B2
Like Apache Spark’s Data Frames, but better for:
• Cleaning and (re)-structuring semi-structured
data sets, e.g. JSON, Avro, Apache logs ...
No upfront schema needed:
• Infers schema on-the-fly, enabling transformations
in a single pass
Easy to handle the unexpected:
• Tracks new fields, and inconsistent changing data
types with choices, e.g. integer or string
• Automatically mark and separate error records
20. Job Authoring: Glue transforms
ResolveChoice() B B B
project
B
cast
B
separate into cols
B B
Apply Mapping() A
X Y
A X Y
Adaptive and flexible
C
21. Job authoring: Relationalize() transform
Semi-structured schema Relational schema
F
K
A B B C.X C.
Y
P
K
Valu
e
Offs
et
A C D [ ]
X Y
B B
• Transforms and adds new columns, types, and tables on-the-fly
• Tracks keys and foreign keys across runs
• SQL on the relational schema is orders of magnitude faster than JSON processing
22. Job authoring: Glue transforms
Prebuilt transformation: Click and
add to your job with simple
configuration
Spigot writes sample data from
DynamicFrame to S3 in JSON format
Expanding… more transformations
to come
23. Job authoring: Write your own scripts
Import custom libraries required by your code
Convert to Apache Spark Data Frame
for complex SQL-based ETL
Convert back to Glue Dynamic Frame
for semi-structured processing and
AWS Glue connectors
24. Job authoring: Developer endpoints
Environment to iteratively develop and test ETL code.
Connect your IDE or notebook (e.g. Zeppelin) to a Glue development endpoint.
When you are satisfied with the results you can create an ETL job that runs your code.
Glue Apache Spark environment
Remote
interpreter
Interpreter
server