SlideShare ist ein Scribd-Unternehmen logo
1 von 52
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Architecting a Serverless Data Lake on
AWS
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is a Data Lake?
A Data Lake allows you to store all your structured and
unstructured data, in one centralized repository, and at
any scale. With a Data Lake, you can store your data as-
is, without having to first structure the data, based on
potential questions you may have in the future. Data Lakes
also allow you to run different types of analytics on your
data like SQL queries, big data analytics, full text search,
real-time analytics, and machine learning to guide better
decisions.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Characteristics of a Data Lake
Future
Proof
Flexible
Access
Dive in
Anywhere
Collect
Anything
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is Serverless computing?
No Server Management
High Availability
No Idle Capacity
$
Flexible Scaling
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Building a Data Lake on AWS
AnalyticsMachine Learning
Real-time
Data Movement
On-premises
Data Movement
Data Lake on AWS
Storage | Archival Storage | Data Catalog
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Building a Data Lake on AWS
AnalyticsMachine Learning
Real-time
Data Movement
On-premises
Data Movement
Data Lake on AWS
Storage | Archival Storage | Data Catalog
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Movement From on-premises Datacenters
AWS Snowball,
Snowball Edge and
Snowmobile
Petabyte and Exabyte-
scale data transport
solution that uses secure
appliances to transfer
large amounts of data
into and out of the AWS
cloud
AWS Direct Connect
Establish a dedicated
network connection from
your premises to AWS;
reduces your network
costs, increase bandwidth
throughput, and provide a
more consistent network
experience than Internet-
based connections
AWS Storage
Gateway
Lets your on-premises
applications to use AWS
for storage; includes a
highly-optimized data
transfer mechanism,
bandwidth management,
along with local cache
AWS Database
Migration Service
Migrate database from
the most widely-used
commercial and open-
source offerings to AWS
quickly and securely with
minimal downtime to
applications
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Movement From Real-time Sources
Amazon Kinesis
Video Streams
Securely stream video
from connected devices
to AWS for analytics,
machine learning (ML),
and other processing
Amazon Kinesis Data
Firehose
Capture, transform, and
load data streams into
AWS data stores for near
real-time analytics with
existing business
intelligence tools.
Amazon Kinesis Data
Streams
Build custom, real-time
applications that process
data streams using
popular stream
processing frameworks
AWS IoT Core
Supports billions of
devices and trillions of
messages, and can
process and route those
messages to AWS
endpoints and to other
devices reliably and
securely
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Building a Data Lake on AWS
AnalyticsMachine Learning
Real-time
Data Movement
On-premises
Data Movement
Data Lake on AWS
Storage | Archival Storage | Data Catalog
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3—Object Storage
Secure, highly scalable, durable object storage with millisecond latency for data access
Store any type of data–web sites, mobile apps, corporate applications, and IoT sensors
Security and
Compliance
Three different forms of
encryption; encrypts data
in transit when
replicating across regions;
log and monitor with
CloudTrail, use ML to
discover and protect
sensitive data with Macie
Flexible Management
Classify, report, and
visualize data usage
trends; objects can be
tagged to see storage
consumption, cost, and
security; build lifecycle
policies to automate
tiering, and retention
Durability, Availability
& Scalability
Built for eleven nine’s of
durability; data
distributed across 3
physical facilities in an
AWS region;
automatically replicated
to any other AWS region
Query in Place
Run analytics & ML on
data lake without data
movement; S3 Select can
retrieve subset of data,
improving analytics
performance by 400%
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Glacier—Backup and Archive
Secure, durable, and extremely low-cost storage for data archiving and long-term backup
Store data at $0.004/GB/month
Durability, Availability
& Scalability
Built for eleven nine’s of
durability; data
distributed across 3
physical facilities in an
AWS region;
automatically replicated
to any other AWS region
Secure
Log and monitor with
CloudTrail, Vault Lock
enables WORM storage
capabilities, helping
satisfy compliance
requirements
Retrieves data in
minutes
Three retrieval options to
fit your use case;
expedited retrievals with
Glacier Select can return
data in minutes
Inexpensive
Lowest cost AWS object
storage class, allowing
you to archive large
amounts of data at a very
low cost
$
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Storing is Not Enough, Data Needs to Be Discoverable
Dark data are the information
assets organizations collect,
process, and store during
regular business activities,
but generally fail to use for
other purposes (for example,
analytics, business relationships
and direct monetizing).
Traditional
enterprise
data
Big data
Dark data
CRM ERP Data warehouse Mainframe
data
Web Social Log
files
Machine
data
Semi-
structured
Unstructured
Gartner IT Glossary, 2018
https://www.gartner.com/it-glossary/dark-data
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue—Data Catalog
• Automatically discovers data and stores schema
• Catalog makes data searchable, and available for ETL
• Catalog contains table and job definitions
• Computes statistics to make queries efficient
Glue
Data Catalog
Discover data and
extract schema
Compliance
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue—ETL Service
• Automatically generates ETL code
• Code is customizable with Python
and Spark
• Endpoints provided to edit, debug,
test code
• Jobs are scheduled or event-based
• Serverless
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Building a Data Lake on AWS
AnalyticsMachine Learning
Real-time
Data Movement
On-premises
Data Movement
Data Lake on AWS
Storage | Archival Storage | Data Catalog
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift—Data Warehousing
Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost
Massively parallel, scale from gigabytes to petabytes
Fast at scale
Columnar storage
technology to improve
I/O efficiency and scale
query performance
Secure
Audit everything; encrypt
data end-to-end;
extensive certification
and compliance
Open file formats
Analyze optimized data
formats on the latest
SSD, and all open data
formats in Amazon S3
Inexpensive
As low as $1,000 per
terabyte per year, 1/10th
the cost of traditional
data warehouse
solutions; start at $0.25
per hour
$
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift Spectrum
S3 data lakeRedshift data
Redshift Spectrum
query engine • Exabyte Redshift SQL queries against S3
• Join data across Redshift and S3
• Scale compute and storage separately
• Stable query performance and unlimited concurrency
• CSV, ORC, Grok, Avro, & Parquet data formats
• Pay only for the amount of data scanned
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EMR—Big Data Processing
Analytics and ML at scale
19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more
Enterprise-grade security
Low cost
Flexible billing with per-
second billing, EC2 spot,
reserved instances and
auto-scaling to reduce
costs 50–80%
$
Easy
Launch fully managed
Hadoop & Spark in
minutes; no cluster
setup, node provisioning,
cluster tuning
Latest versions
Updated with the latest
open source frameworks
within 30 days of release
Use S3 storage
Process data directly in
the S3 data lake securely
with high performance
using the EMRFS
connector
Data Lake
100110000100101011
100101010111001010
100000111100101100
101010001100001
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Elasticsearch Service
Easy to deploy, secure, operate, and scale Elasticsearch
Customers use Elasticsearch for log analytics, full-text search & application monitoring
Easy to Use
Fully managed;
Deploy production-ready
clusters in minutes
Secure
Secure access with VPC to
keep all traffic within
AWS network
Open
Direct access to
Elasticsearch open-source
APIs; supports Logstash
and Kibana
Available
Zone awareness
replicates data between
two AZs; automatically
monitors & replaces
failed nodes
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Kinesis Data Analytics
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena—Interactive Analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Query Instantly
Zero setup cost; just
point to S3 and
start querying
SQL
Open
ANSI SQL interface,
JDBC/ODBC drivers,
multiple formats,
compression types,
and complex joins and
data types
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with
QuickSight
Pay per query
Pay only for queries
run; save 30–90% on
per-query costs
through compression
$
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon QuickSight
Fast, easy to use, serverless analytics at 1/10 the cost of traditional BI
Empower
everyone
Seamless
connectivity
Fast analysis Serverless
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Building a Data Lake on AWS
AnalyticsMachine Learning
Real-time
Data Movement
On-premises
Data Movement
Data Lake on AWS
Storage | Archival Storage | Data Catalog
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine Learning on AWS
PLATFORM SERVICES
APPLICATION SERVICES
FRAMEWORKS & INTERFACES
Caffe2 CNTK
Apache
MXNet
PyTorch
TensorFlo
w
Torch Keras Gluon
AWS Deep Learning AMIs
Amazon SageMaker AWS DeepLens
Rekognition Transcribe Translate Polly Comprehend Lex
INFRASTRUCTURE
CPU IoT & EdgeGPU (P3) Mobile
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demo
Are we all ready to build a
Data Lake?
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demo
Lets Do That
Right Here…..Right Now!
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What are we building?
Kinesis Data Firehose
Delivery Stream
Kinesis Data
Generator
Transactions
Ingest
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kinesis Data Firehose – How it Works
Ingest Transform Deliver
Amazon S3
Amazon
Redshift
Amazon Elasticsearch Service
AWS IoT
Amazon Kinesis Agent
Amazon Kinesis Streams
Amazon CloudWatch Logs
Amazon CloudWatch Events
Apache Kafka
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Key Features
Data durability:
• Data backup to S3 upon delivery or transformation failure
• 3X data replication in delivery stream for high data durability
Up to 24 hours data retention in delivery stream to absorb backpressure
from destinations
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Serverless Data Transformation
Kinesis Firehose AWS Lambda
Pre-Built Data Transformation Blueprints
• General Processing
• Apache Log to JSON
• Apache Log to CSV
• Syslog to JSON
• Syslog to CSV
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What are we building?
Data Lake
Kinesis Data Firehose
Delivery Stream
Kinesis Data
Generator
AmazonS3
(Raw)
AWS Glue
(Data Catalog)
Transactions
• Transactions
• Reference
Ingest Store & Catalog
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue Data Catalog
Bring in metadata from a variety of data sources (Amazon S3, Amazon
Redshift, etc.) into a single categorized list that is searchable
• Unified Metadata Repository
across Data Stores
• Schema Versioning
• Shared across AWS Glue, Amazon
Athena, Amazon Redshift
Spectrum and Amazon EMR
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What are Crawlers
Crawlers automatically build your Data Catalog and keep it in sync.
• Scan your data stored in various data stores, extract metadata and data
statistics, and add table definitions to your Data Catalog
• Classify data using built-in and custom classifiers
• You can write your own using Grok expressions
• Discover new data, extracts schema definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
• Run ad hoc or on a schedule; serverless – only pay when crawler runs
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Custom Classifiers
You can write a custom classifier by providing a Grok
pattern and a classification string for the matched
schema
A Grok pattern is a named set of regular expressions
(regex) that are used to match data one line at a time.
Example:
%{TIMESTAMP_ISO8601:timestamp}
[%{MESSAGEPREFIX:message_prefix}]
%{CRAWLERLOGLEVEL:loglevel} :
%{GREEDYDATA:message}
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What are we building?
Data Lake
Kinesis Data Firehose
Delivery Stream
Kinesis Data
Generator
AmazonS3
(Raw)
AWS Glue
(Data Catalog)
AWS Glue
(Transform)
Amazon S3
(Processed)
Transactions
Enrich
• Transactions
• Reference • Enriched
Ingest Store & Catalog
Process
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job Authoring:Automatic Code Generation
1. Customize the mappings
2. Glue generates transformation graph and Python or Scala code
3. Customize the code based on your requirements
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job authoring: Developer Endpoints
 Environment to iteratively explore data with Apache Spark SQL
 Develop and test ETL code.
 Connect your IDE or notebook (e.g. Zeppelin) to a Glue development endpoint.
 When you are satisfied with the results you can create an ETL job that runs your code.
Glue’s Apache Spark environment
Remote
interpreter
Interpreter
server
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
DynamicFrame Transforms
ResolveChoice() B B B
project
B
cast
B
separate into cols
B B
Apply Mapping() A
X Y
A X Y
C
15+ transforms out-of-the box
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Relationalize() Transform
Semi-structured schema Relational schema
F
K
A B B C.X C.
Y
P
K
Valu
e
Offs
et
A C D [ ]
X Y
B B
• Transforms and adds new columns, types, and tables on-the-fly
• Tracks keys and foreign keys across runs
• SQL on the relational schema is orders of magnitude faster than JSON processing
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job Bookmarks
Suppose you want to periodically run a job
avoid reprocessing previous input
avoid generating duplicate output
Examples:
Process githubarchive files daily
Process firehose files hourly
Track timestamps or primary keys in DBs
Track generated foreign keys for
normalization
Bookmarks are per-job checkpoints
that track persisted state from
previous runs.
They track state of sources, transforms,
and sinks
run 1 run 2 run 3
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job Execution:Scheduling and monitoring
Compose jobs globally with event-
based dependencies
 Easy to reuse and leverage work across
organization boundaries
Multiple triggering mechanisms
 Schedule-based: e.g., time of day
 Event-based: e.g., job completion, job
failure, job stopping events
 On-demand: e.g., AWS Lambda
…More coming soon!
Logs and alerts are available in
Amazon CloudWatch
Marketing: Ad-spend
by
customer segment
Event Based
Lambda Trigger
Sales: Revenue by
customer segment
Schedule
Data
based
Central: ROI by
customer
segment
Weekly
sales
Data
based
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job Execution:Serverless
 Auto-configure VPC and role-based access
 Customers can specify the capacity that
gets allocated to each job
 You pay only for the resources you
consume while consuming them
There is no need to provision, configure, or
manage servers
Customer VPC Customer VPC
Compute instances
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What are we building?
Data Lake
Kinesis Data Firehose
Delivery Stream
Kinesis Data
Generator
AmazonS3
(Raw)
AWS Glue
(Data Catalog)
AWS Glue
(Transform)
Amazon
Athena
Amazon S3
(Processed)
Transactions
Enrich
Explore
• Transactions
• Reference • Enriched
Ingest Store & Catalog Consume
Process
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena:Data Directly from Amazon S3
• No loading of data
• Query data in its raw format
• Athena supports multiple data formats
• Text, CSV, TSV, JSON, weblogs, AWS service logs
• Or convert to an optimized form like ORC or Parquet for the best performance and lowest
cost
• No ETL required
• Stream data directly from Amazon S3
• Take advantage of Amazon S3 durability and availability
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use ANSI SQL
• Start writing ANSI SQL
• Support for complex joins, nested
queries & window functions
• Support for complex data types
(arrays, structs)
• Support for partitioning of data by
any key
• (date, time, custom keys)
• e.g., Year, Month, Day, Hour or
Customer Key, Date
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Familiar Technologies Under the Covers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What are we building?
Data Lake
Kinesis Data Firehose
Delivery Stream
Kinesis Data
Generator
AmazonS3
(Raw)
AWS Glue
(Data Catalog)
AWS Glue
(Transform)
Amazon
QuickSight
Amazon
Athena
Amazon S3
(Processed)
Transactions
Enrich
Explore
• Transactions
• Reference • Enriched
Ingest Store & Catalog Consume
Process
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
QuickSight : Connect to data wherever it is
QuickSight is natively integrated with AWS data sources, as well as on-premise and hosted
databases and third party business applications
On-premises
Securely connect to on-premise
databases and flat files like
Excel and CSV
In the cloud
Connect to hosted database, big
data formats, and secure VPCs
Applications
Connect directly to third
party business applications
• Salesforce
• Square
• Adobe Analytics
• Jira
• ServiceNow
• Twitter
• Github
• Redshift
• RDS
• S3
• Athena
• Aurora
• Teradata
• MySQL
• Presto
• Spark
• SQL Server
• Postgre SQL
• MariaDB
• Snowflake
• Excel
• CSV
• Teradata
• MySQL
• SQL Server
• PostgreSQL
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SPICE
QuickSight is powered by SPICE, a super-fast calculation engine that delivers
performance and scale, regardless of how many users are active.
SPICEYour Data Source
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Governance
Create managed datasets that give power users and authors the flexibility to
perform self-serve analytics on data that you control.
Create datasets that:
• Can be shared with any user
• Automatically refresh
• Have row level security
• Users cannot modify
• Dynamically update
with changes
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
User Management and AD Integration
QuickSight Enterprise Edition can integrate with your Active Directory to
dynamically manage users and groups.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you

Weitere ähnliche Inhalte

Was ist angesagt?

Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineThe Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineAmazon Web Services
 
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWSAmazon Web Services Korea
 
Databases - Choosing the right Database on AWS
Databases - Choosing the right Database on AWSDatabases - Choosing the right Database on AWS
Databases - Choosing the right Database on AWSAmazon Web Services
 
Automated Solution for Deploying AWS Landing Zone (GPSWS407) - AWS re:Invent ...
Automated Solution for Deploying AWS Landing Zone (GPSWS407) - AWS re:Invent ...Automated Solution for Deploying AWS Landing Zone (GPSWS407) - AWS re:Invent ...
Automated Solution for Deploying AWS Landing Zone (GPSWS407) - AWS re:Invent ...Amazon Web Services
 
Executing a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWSExecuting a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWSAmazon Web Services
 
Getting Started with Amazon QuickSight
Getting Started with Amazon QuickSightGetting Started with Amazon QuickSight
Getting Started with Amazon QuickSightAmazon Web Services
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWSGary Stafford
 
Deep Dive on Amazon RDS (Relational Database Service)
Deep Dive on Amazon RDS (Relational Database Service)Deep Dive on Amazon RDS (Relational Database Service)
Deep Dive on Amazon RDS (Relational Database Service)Amazon Web Services
 
Real-Time Streaming: Intro to Amazon Kinesis
Real-Time Streaming: Intro to Amazon KinesisReal-Time Streaming: Intro to Amazon Kinesis
Real-Time Streaming: Intro to Amazon KinesisAmazon Web Services
 
Implementing your landing zone - FND210 - AWS re:Inforce 2019
Implementing your landing zone - FND210 - AWS re:Inforce 2019 Implementing your landing zone - FND210 - AWS re:Inforce 2019
Implementing your landing zone - FND210 - AWS re:Inforce 2019 Amazon Web Services
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSAmazon Web Services
 
Building Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSBuilding Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSAmazon Web Services
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Deep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech TalksDeep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech TalksAmazon Web Services
 
Secure, Build and Deduplicate Your Data Lake Data with Amazon Lake Formation
Secure, Build and Deduplicate Your Data Lake Data with Amazon Lake FormationSecure, Build and Deduplicate Your Data Lake Data with Amazon Lake Formation
Secure, Build and Deduplicate Your Data Lake Data with Amazon Lake FormationAmazon Web Services
 
AWS S3 Cost Optimization
AWS S3 Cost OptimizationAWS S3 Cost Optimization
AWS S3 Cost OptimizationEric Kim
 

Was ist angesagt? (20)

Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineThe Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
 
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
 
Databases - Choosing the right Database on AWS
Databases - Choosing the right Database on AWSDatabases - Choosing the right Database on AWS
Databases - Choosing the right Database on AWS
 
Automated Solution for Deploying AWS Landing Zone (GPSWS407) - AWS re:Invent ...
Automated Solution for Deploying AWS Landing Zone (GPSWS407) - AWS re:Invent ...Automated Solution for Deploying AWS Landing Zone (GPSWS407) - AWS re:Invent ...
Automated Solution for Deploying AWS Landing Zone (GPSWS407) - AWS re:Invent ...
 
Executing a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWSExecuting a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWS
 
Amazon Aurora
Amazon AuroraAmazon Aurora
Amazon Aurora
 
Getting Started with Amazon QuickSight
Getting Started with Amazon QuickSightGetting Started with Amazon QuickSight
Getting Started with Amazon QuickSight
 
Migrating to the Cloud
Migrating to the CloudMigrating to the Cloud
Migrating to the Cloud
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Deep Dive on Amazon RDS (Relational Database Service)
Deep Dive on Amazon RDS (Relational Database Service)Deep Dive on Amazon RDS (Relational Database Service)
Deep Dive on Amazon RDS (Relational Database Service)
 
Real-Time Streaming: Intro to Amazon Kinesis
Real-Time Streaming: Intro to Amazon KinesisReal-Time Streaming: Intro to Amazon Kinesis
Real-Time Streaming: Intro to Amazon Kinesis
 
Implementing your landing zone - FND210 - AWS re:Inforce 2019
Implementing your landing zone - FND210 - AWS re:Inforce 2019 Implementing your landing zone - FND210 - AWS re:Inforce 2019
Implementing your landing zone - FND210 - AWS re:Inforce 2019
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWS
 
Building Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSBuilding Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWS
 
Setting Up a Landing Zone
Setting Up a Landing ZoneSetting Up a Landing Zone
Setting Up a Landing Zone
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Deep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech TalksDeep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech Talks
 
Secure, Build and Deduplicate Your Data Lake Data with Amazon Lake Formation
Secure, Build and Deduplicate Your Data Lake Data with Amazon Lake FormationSecure, Build and Deduplicate Your Data Lake Data with Amazon Lake Formation
Secure, Build and Deduplicate Your Data Lake Data with Amazon Lake Formation
 
AWS S3 Cost Optimization
AWS S3 Cost OptimizationAWS S3 Cost Optimization
AWS S3 Cost Optimization
 

Ähnlich wie Architecting a Serverless Data Lake on AWS

Building Hybrid Cloud Storage Architectures with AWS @scale
Building Hybrid Cloud Storage Architectures with AWS @scaleBuilding Hybrid Cloud Storage Architectures with AWS @scale
Building Hybrid Cloud Storage Architectures with AWS @scaleAmazon Web Services
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudAmazon Web Services
 
Module 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWSModule 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWSLam Le
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...AWS Riyadh User Group
 
Drive Customer Value with Data-Driven Decisions (GPSBUS206) - AWS re:Invent 2018
Drive Customer Value with Data-Driven Decisions (GPSBUS206) - AWS re:Invent 2018Drive Customer Value with Data-Driven Decisions (GPSBUS206) - AWS re:Invent 2018
Drive Customer Value with Data-Driven Decisions (GPSBUS206) - AWS re:Invent 2018Amazon Web Services
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Amazon Web Services
 
Big data journey to the cloud rohit pujari 5.30.18
Big data journey to the cloud   rohit pujari 5.30.18Big data journey to the cloud   rohit pujari 5.30.18
Big data journey to the cloud rohit pujari 5.30.18Cloudera, Inc.
 
Backup & Recovery - Optimize Your Backup and Restore Architectures in the Cloud
Backup & Recovery - Optimize Your Backup and Restore Architectures in the CloudBackup & Recovery - Optimize Your Backup and Restore Architectures in the Cloud
Backup & Recovery - Optimize Your Backup and Restore Architectures in the CloudAmazon Web Services
 
Building with Purpose - Built Databases: Match Your Workloads to the Right Da...
Building with Purpose - Built Databases: Match Your Workloads to the Right Da...Building with Purpose - Built Databases: Match Your Workloads to the Right Da...
Building with Purpose - Built Databases: Match Your Workloads to the Right Da...Amazon Web Services
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesAmazon Web Services
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesBuild Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
An Overview of AWS Services for Data Storage and Migration - SRV205 - Atlanta...
An Overview of AWS Services for Data Storage and Migration - SRV205 - Atlanta...An Overview of AWS Services for Data Storage and Migration - SRV205 - Atlanta...
An Overview of AWS Services for Data Storage and Migration - SRV205 - Atlanta...Amazon Web Services
 
Building Data Lakes and Analytics on AWS. IPExpo Manchester.
Building Data Lakes and Analytics on AWS. IPExpo Manchester.Building Data Lakes and Analytics on AWS. IPExpo Manchester.
Building Data Lakes and Analytics on AWS. IPExpo Manchester.javier ramirez
 

Ähnlich wie Architecting a Serverless Data Lake on AWS (20)

Building Hybrid Cloud Storage Architectures with AWS @scale
Building Hybrid Cloud Storage Architectures with AWS @scaleBuilding Hybrid Cloud Storage Architectures with AWS @scale
Building Hybrid Cloud Storage Architectures with AWS @scale
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the Cloud
 
AWS 資料湖服務
AWS 資料湖服務AWS 資料湖服務
AWS 資料湖服務
 
AWS re:Invent Recap
AWS re:Invent RecapAWS re:Invent Recap
AWS re:Invent Recap
 
Module 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWSModule 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWS
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
 
Drive Customer Value with Data-Driven Decisions (GPSBUS206) - AWS re:Invent 2018
Drive Customer Value with Data-Driven Decisions (GPSBUS206) - AWS re:Invent 2018Drive Customer Value with Data-Driven Decisions (GPSBUS206) - AWS re:Invent 2018
Drive Customer Value with Data-Driven Decisions (GPSBUS206) - AWS re:Invent 2018
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
 
AWS reInvent 2018 recap edition
AWS reInvent 2018 recap editionAWS reInvent 2018 recap edition
AWS reInvent 2018 recap edition
 
Big data journey to the cloud rohit pujari 5.30.18
Big data journey to the cloud   rohit pujari 5.30.18Big data journey to the cloud   rohit pujari 5.30.18
Big data journey to the cloud rohit pujari 5.30.18
 
Backup & Recovery - Optimize Your Backup and Restore Architectures in the Cloud
Backup & Recovery - Optimize Your Backup and Restore Architectures in the CloudBackup & Recovery - Optimize Your Backup and Restore Architectures in the Cloud
Backup & Recovery - Optimize Your Backup and Restore Architectures in the Cloud
 
Building with Purpose - Built Databases: Match Your Workloads to the Right Da...
Building with Purpose - Built Databases: Match Your Workloads to the Right Da...Building with Purpose - Built Databases: Match Your Workloads to the Right Da...
Building with Purpose - Built Databases: Match Your Workloads to the Right Da...
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesBuild Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
An Overview of AWS Services for Data Storage and Migration - SRV205 - Atlanta...
An Overview of AWS Services for Data Storage and Migration - SRV205 - Atlanta...An Overview of AWS Services for Data Storage and Migration - SRV205 - Atlanta...
An Overview of AWS Services for Data Storage and Migration - SRV205 - Atlanta...
 
Building Data Lakes and Analytics on AWS. IPExpo Manchester.
Building Data Lakes and Analytics on AWS. IPExpo Manchester.Building Data Lakes and Analytics on AWS. IPExpo Manchester.
Building Data Lakes and Analytics on AWS. IPExpo Manchester.
 

Mehr von Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Architecting a Serverless Data Lake on AWS

  • 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Architecting a Serverless Data Lake on AWS
  • 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is a Data Lake? A Data Lake allows you to store all your structured and unstructured data, in one centralized repository, and at any scale. With a Data Lake, you can store your data as- is, without having to first structure the data, based on potential questions you may have in the future. Data Lakes also allow you to run different types of analytics on your data like SQL queries, big data analytics, full text search, real-time analytics, and machine learning to guide better decisions.
  • 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Characteristics of a Data Lake Future Proof Flexible Access Dive in Anywhere Collect Anything
  • 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is Serverless computing? No Server Management High Availability No Idle Capacity $ Flexible Scaling
  • 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Building a Data Lake on AWS AnalyticsMachine Learning Real-time Data Movement On-premises Data Movement Data Lake on AWS Storage | Archival Storage | Data Catalog
  • 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Building a Data Lake on AWS AnalyticsMachine Learning Real-time Data Movement On-premises Data Movement Data Lake on AWS Storage | Archival Storage | Data Catalog
  • 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Movement From on-premises Datacenters AWS Snowball, Snowball Edge and Snowmobile Petabyte and Exabyte- scale data transport solution that uses secure appliances to transfer large amounts of data into and out of the AWS cloud AWS Direct Connect Establish a dedicated network connection from your premises to AWS; reduces your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet- based connections AWS Storage Gateway Lets your on-premises applications to use AWS for storage; includes a highly-optimized data transfer mechanism, bandwidth management, along with local cache AWS Database Migration Service Migrate database from the most widely-used commercial and open- source offerings to AWS quickly and securely with minimal downtime to applications
  • 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Movement From Real-time Sources Amazon Kinesis Video Streams Securely stream video from connected devices to AWS for analytics, machine learning (ML), and other processing Amazon Kinesis Data Firehose Capture, transform, and load data streams into AWS data stores for near real-time analytics with existing business intelligence tools. Amazon Kinesis Data Streams Build custom, real-time applications that process data streams using popular stream processing frameworks AWS IoT Core Supports billions of devices and trillions of messages, and can process and route those messages to AWS endpoints and to other devices reliably and securely
  • 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Building a Data Lake on AWS AnalyticsMachine Learning Real-time Data Movement On-premises Data Movement Data Lake on AWS Storage | Archival Storage | Data Catalog
  • 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3—Object Storage Secure, highly scalable, durable object storage with millisecond latency for data access Store any type of data–web sites, mobile apps, corporate applications, and IoT sensors Security and Compliance Three different forms of encryption; encrypts data in transit when replicating across regions; log and monitor with CloudTrail, use ML to discover and protect sensitive data with Macie Flexible Management Classify, report, and visualize data usage trends; objects can be tagged to see storage consumption, cost, and security; build lifecycle policies to automate tiering, and retention Durability, Availability & Scalability Built for eleven nine’s of durability; data distributed across 3 physical facilities in an AWS region; automatically replicated to any other AWS region Query in Place Run analytics & ML on data lake without data movement; S3 Select can retrieve subset of data, improving analytics performance by 400%
  • 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Glacier—Backup and Archive Secure, durable, and extremely low-cost storage for data archiving and long-term backup Store data at $0.004/GB/month Durability, Availability & Scalability Built for eleven nine’s of durability; data distributed across 3 physical facilities in an AWS region; automatically replicated to any other AWS region Secure Log and monitor with CloudTrail, Vault Lock enables WORM storage capabilities, helping satisfy compliance requirements Retrieves data in minutes Three retrieval options to fit your use case; expedited retrievals with Glacier Select can return data in minutes Inexpensive Lowest cost AWS object storage class, allowing you to archive large amounts of data at a very low cost $
  • 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Storing is Not Enough, Data Needs to Be Discoverable Dark data are the information assets organizations collect, process, and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). Traditional enterprise data Big data Dark data CRM ERP Data warehouse Mainframe data Web Social Log files Machine data Semi- structured Unstructured Gartner IT Glossary, 2018 https://www.gartner.com/it-glossary/dark-data
  • 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue—Data Catalog • Automatically discovers data and stores schema • Catalog makes data searchable, and available for ETL • Catalog contains table and job definitions • Computes statistics to make queries efficient Glue Data Catalog Discover data and extract schema Compliance
  • 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue—ETL Service • Automatically generates ETL code • Code is customizable with Python and Spark • Endpoints provided to edit, debug, test code • Jobs are scheduled or event-based • Serverless
  • 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Building a Data Lake on AWS AnalyticsMachine Learning Real-time Data Movement On-premises Data Movement Data Lake on AWS Storage | Archival Storage | Data Catalog
  • 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift—Data Warehousing Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost Massively parallel, scale from gigabytes to petabytes Fast at scale Columnar storage technology to improve I/O efficiency and scale query performance Secure Audit everything; encrypt data end-to-end; extensive certification and compliance Open file formats Analyze optimized data formats on the latest SSD, and all open data formats in Amazon S3 Inexpensive As low as $1,000 per terabyte per year, 1/10th the cost of traditional data warehouse solutions; start at $0.25 per hour $
  • 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift Spectrum S3 data lakeRedshift data Redshift Spectrum query engine • Exabyte Redshift SQL queries against S3 • Join data across Redshift and S3 • Scale compute and storage separately • Stable query performance and unlimited concurrency • CSV, ORC, Grok, Avro, & Parquet data formats • Pay only for the amount of data scanned
  • 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EMR—Big Data Processing Analytics and ML at scale 19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more Enterprise-grade security Low cost Flexible billing with per- second billing, EC2 spot, reserved instances and auto-scaling to reduce costs 50–80% $ Easy Launch fully managed Hadoop & Spark in minutes; no cluster setup, node provisioning, cluster tuning Latest versions Updated with the latest open source frameworks within 30 days of release Use S3 storage Process data directly in the S3 data lake securely with high performance using the EMRFS connector Data Lake 100110000100101011 100101010111001010 100000111100101100 101010001100001
  • 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Elasticsearch Service Easy to deploy, secure, operate, and scale Elasticsearch Customers use Elasticsearch for log analytics, full-text search & application monitoring Easy to Use Fully managed; Deploy production-ready clusters in minutes Secure Secure access with VPC to keep all traffic within AWS network Open Direct access to Elasticsearch open-source APIs; supports Logstash and Kibana Available Zone awareness replicates data between two AZs; automatically monitors & replaces failed nodes
  • 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Kinesis Data Analytics
  • 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Athena—Interactive Analysis Interactive query service to analyze data in Amazon S3 using standard SQL No infrastructure to set up or manage and no data to load Query Instantly Zero setup cost; just point to S3 and start querying SQL Open ANSI SQL interface, JDBC/ODBC drivers, multiple formats, compression types, and complex joins and data types Easy Serverless: zero infrastructure, zero administration Integrated with QuickSight Pay per query Pay only for queries run; save 30–90% on per-query costs through compression $
  • 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon QuickSight Fast, easy to use, serverless analytics at 1/10 the cost of traditional BI Empower everyone Seamless connectivity Fast analysis Serverless
  • 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Building a Data Lake on AWS AnalyticsMachine Learning Real-time Data Movement On-premises Data Movement Data Lake on AWS Storage | Archival Storage | Data Catalog
  • 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning on AWS PLATFORM SERVICES APPLICATION SERVICES FRAMEWORKS & INTERFACES Caffe2 CNTK Apache MXNet PyTorch TensorFlo w Torch Keras Gluon AWS Deep Learning AMIs Amazon SageMaker AWS DeepLens Rekognition Transcribe Translate Polly Comprehend Lex INFRASTRUCTURE CPU IoT & EdgeGPU (P3) Mobile
  • 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demo Are we all ready to build a Data Lake?
  • 26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demo Lets Do That Right Here…..Right Now!
  • 27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What are we building? Kinesis Data Firehose Delivery Stream Kinesis Data Generator Transactions Ingest
  • 28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kinesis Data Firehose – How it Works Ingest Transform Deliver Amazon S3 Amazon Redshift Amazon Elasticsearch Service AWS IoT Amazon Kinesis Agent Amazon Kinesis Streams Amazon CloudWatch Logs Amazon CloudWatch Events Apache Kafka
  • 29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Key Features Data durability: • Data backup to S3 upon delivery or transformation failure • 3X data replication in delivery stream for high data durability Up to 24 hours data retention in delivery stream to absorb backpressure from destinations
  • 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Serverless Data Transformation Kinesis Firehose AWS Lambda Pre-Built Data Transformation Blueprints • General Processing • Apache Log to JSON • Apache Log to CSV • Syslog to JSON • Syslog to CSV
  • 31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What are we building? Data Lake Kinesis Data Firehose Delivery Stream Kinesis Data Generator AmazonS3 (Raw) AWS Glue (Data Catalog) Transactions • Transactions • Reference Ingest Store & Catalog
  • 32. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue Data Catalog Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single categorized list that is searchable • Unified Metadata Repository across Data Stores • Schema Versioning • Shared across AWS Glue, Amazon Athena, Amazon Redshift Spectrum and Amazon EMR
  • 33. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What are Crawlers Crawlers automatically build your Data Catalog and keep it in sync. • Scan your data stored in various data stores, extract metadata and data statistics, and add table definitions to your Data Catalog • Classify data using built-in and custom classifiers • You can write your own using Grok expressions • Discover new data, extracts schema definitions • Detect schema changes and version tables • Detect Hive style partitions on Amazon S3 • Run ad hoc or on a schedule; serverless – only pay when crawler runs
  • 34. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Custom Classifiers You can write a custom classifier by providing a Grok pattern and a classification string for the matched schema A Grok pattern is a named set of regular expressions (regex) that are used to match data one line at a time. Example: %{TIMESTAMP_ISO8601:timestamp} [%{MESSAGEPREFIX:message_prefix}] %{CRAWLERLOGLEVEL:loglevel} : %{GREEDYDATA:message}
  • 35. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What are we building? Data Lake Kinesis Data Firehose Delivery Stream Kinesis Data Generator AmazonS3 (Raw) AWS Glue (Data Catalog) AWS Glue (Transform) Amazon S3 (Processed) Transactions Enrich • Transactions • Reference • Enriched Ingest Store & Catalog Process
  • 36. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Job Authoring:Automatic Code Generation 1. Customize the mappings 2. Glue generates transformation graph and Python or Scala code 3. Customize the code based on your requirements
  • 37. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Job authoring: Developer Endpoints  Environment to iteratively explore data with Apache Spark SQL  Develop and test ETL code.  Connect your IDE or notebook (e.g. Zeppelin) to a Glue development endpoint.  When you are satisfied with the results you can create an ETL job that runs your code. Glue’s Apache Spark environment Remote interpreter Interpreter server
  • 38. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DynamicFrame Transforms ResolveChoice() B B B project B cast B separate into cols B B Apply Mapping() A X Y A X Y C 15+ transforms out-of-the box
  • 39. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Relationalize() Transform Semi-structured schema Relational schema F K A B B C.X C. Y P K Valu e Offs et A C D [ ] X Y B B • Transforms and adds new columns, types, and tables on-the-fly • Tracks keys and foreign keys across runs • SQL on the relational schema is orders of magnitude faster than JSON processing
  • 40. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Job Bookmarks Suppose you want to periodically run a job avoid reprocessing previous input avoid generating duplicate output Examples: Process githubarchive files daily Process firehose files hourly Track timestamps or primary keys in DBs Track generated foreign keys for normalization Bookmarks are per-job checkpoints that track persisted state from previous runs. They track state of sources, transforms, and sinks run 1 run 2 run 3
  • 41. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Job Execution:Scheduling and monitoring Compose jobs globally with event- based dependencies  Easy to reuse and leverage work across organization boundaries Multiple triggering mechanisms  Schedule-based: e.g., time of day  Event-based: e.g., job completion, job failure, job stopping events  On-demand: e.g., AWS Lambda …More coming soon! Logs and alerts are available in Amazon CloudWatch Marketing: Ad-spend by customer segment Event Based Lambda Trigger Sales: Revenue by customer segment Schedule Data based Central: ROI by customer segment Weekly sales Data based
  • 42. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Job Execution:Serverless  Auto-configure VPC and role-based access  Customers can specify the capacity that gets allocated to each job  You pay only for the resources you consume while consuming them There is no need to provision, configure, or manage servers Customer VPC Customer VPC Compute instances
  • 43. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What are we building? Data Lake Kinesis Data Firehose Delivery Stream Kinesis Data Generator AmazonS3 (Raw) AWS Glue (Data Catalog) AWS Glue (Transform) Amazon Athena Amazon S3 (Processed) Transactions Enrich Explore • Transactions • Reference • Enriched Ingest Store & Catalog Consume Process
  • 44. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Athena:Data Directly from Amazon S3 • No loading of data • Query data in its raw format • Athena supports multiple data formats • Text, CSV, TSV, JSON, weblogs, AWS service logs • Or convert to an optimized form like ORC or Parquet for the best performance and lowest cost • No ETL required • Stream data directly from Amazon S3 • Take advantage of Amazon S3 durability and availability
  • 45. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use ANSI SQL • Start writing ANSI SQL • Support for complex joins, nested queries & window functions • Support for complex data types (arrays, structs) • Support for partitioning of data by any key • (date, time, custom keys) • e.g., Year, Month, Day, Hour or Customer Key, Date
  • 46. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Familiar Technologies Under the Covers Used for SQL Queries In-memory distributed query engine ANSI-SQL compatible with extensions Used for DDL functionality Complex data types Multitude of formats Supports data partitioning
  • 47. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What are we building? Data Lake Kinesis Data Firehose Delivery Stream Kinesis Data Generator AmazonS3 (Raw) AWS Glue (Data Catalog) AWS Glue (Transform) Amazon QuickSight Amazon Athena Amazon S3 (Processed) Transactions Enrich Explore • Transactions • Reference • Enriched Ingest Store & Catalog Consume Process
  • 48. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. QuickSight : Connect to data wherever it is QuickSight is natively integrated with AWS data sources, as well as on-premise and hosted databases and third party business applications On-premises Securely connect to on-premise databases and flat files like Excel and CSV In the cloud Connect to hosted database, big data formats, and secure VPCs Applications Connect directly to third party business applications • Salesforce • Square • Adobe Analytics • Jira • ServiceNow • Twitter • Github • Redshift • RDS • S3 • Athena • Aurora • Teradata • MySQL • Presto • Spark • SQL Server • Postgre SQL • MariaDB • Snowflake • Excel • CSV • Teradata • MySQL • SQL Server • PostgreSQL
  • 49. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SPICE QuickSight is powered by SPICE, a super-fast calculation engine that delivers performance and scale, regardless of how many users are active. SPICEYour Data Source
  • 50. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Governance Create managed datasets that give power users and authors the flexibility to perform self-serve analytics on data that you control. Create datasets that: • Can be shared with any user • Automatically refresh • Have row level security • Users cannot modify • Dynamically update with changes
  • 51. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. User Management and AD Integration QuickSight Enterprise Edition can integrate with your Active Directory to dynamically manage users and groups.
  • 52. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you