SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Downloaden Sie, um offline zu lesen
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Javier Ramirez
@supercoco9
Technical Evangelist
Amazon Web Services
Big Data, Ingeniería de datos, y
Data Lakes en AWS
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Solution
My reports make
my database
server very slow
Before 2009
The DBA years
Overnight DB dump
Read-only replica
My data doesn’t fit in
one machine
And it’s not only
transactional
2009-2011
The Hadoop epiphany
Hadoop
Map/Reduce all the
things
My data is very
fast
Map/Reduce is
hard to use
2012-2014
The Message Broker
and NoSQL Age
Kafka/RabbitMQ
Cassandra/HBASE
/STORM
Basic ETL
Hive
Duplicating batch/stream is inefficient
I need to cleanse my source data
Hadoop ecosystem is hard to manage
My data scientists don’t like JAVA
I am not sure which data we are
already processing
2015-2017
The Spark kingdom and
the spreadsheet wars
Kafka/Spark
Complex ETL
Create new departments for data
governance
Spreadsheet all the things
Streaming is hard
My schemas have evolved
I cannot query old and new
data together
My cluster is running old
versions. Upgrading is hard
I want to use ML
2017-2018
The myth of DataOps
Kafka/Flink (JAVA or Scala
required)
Complex ETL with a pinch of
ML
Apache Atlas
Commercial distributions
Some problems during all periods
• My team spends more time maintaining the cluster than adding functionality
• Security and monitoring are hard
• Most of my time my cluster is sitting idle; Then it’s a bottleneck
• I don’t have the time to experiment
• Data preparation, cleansing, and basic transformations take a
disproportionally high amount of my time. And it’s so frustrating
Some simple things that scare me (and eat my productivity)
• Text encodings
• Empty strings. Literal ”NULL” strings
• Uppercase and Lowercase
• Date and time formats: which date would you say this is 1/4/19? And this? 1553589297
• CSV, especially if uploaded by end users
• JSON files with a single array and 200.000 records inside
• The same JSON file when row 176.543 has a column never seen before
• The same JSON file when all the numbers are strings
• XML
The downfall of the data engineer
Watching paint dry is exciting in comparison to writing and maintaining Extract
Transform and Load (ETL) logic. Most ETL jobs take a long time to execute and errors
or issues tend to happen at runtime or are post-runtime assertions. Since the
development time to execution time ratio is typically low, being productive means
juggling with multiple pipelines at once and inherently doing a lot of context
switching. By the time one of your 5 running “big data jobs” has finished, you have to
get back in the mind space you were in many hours ago and craft your next iteration.
Depending on how caffeinated you are, how long it’s been since the last iteration, and
how systematic you are, you may fail at restoring the full context in your short term
memory. This leads to systemic, stupid errors that waste hours.
“
”Maxime Beauchemin, Data engineer extraordinaire at Lyft, creator of Apache Airflow and Apache Superset.
Ex-Facebook, Ex-Yahoo!, Ex-Airbnb
https://medium.com/@maximebeauchemin/the-downfall-of-the-data-engineer-5bfb701e5d6b
A data lake is a centralized repository that allows
you to store all your structured and unstructured
data at any scale.
Modern data analytics 101
A good data lake allows self-service and can
easily plug-in new analytical engines.
A Possible Open Source solution
• Hadoop Cluster (static/multi tenant)
• Apache NiFi for ingestion workflows
• Sqoop to ingest data from RDBMS
• HDFS to store the data (tied to the Hadoop cluster)
• Hive/HCatalog for data Catalog
• Apache Atlas for a more human data catalog and governance
• Apache Spark for complex ETL –with Apache Livy for REST
• Hive for batch workloads with SQL
• Presto for interactive queries with SQL
• Kafka for streaming ingest
• Apache Spark/Apache Flink for streaming analytics
• Apache Hbase (or maybe Cassandra) to store streaming data
• Apache Phoenix to run SQL queries on top of Hbase
• Prometheus (or fluentd/collectd/ganglia/Nagios…) for logs and monitoring. Maybe with Elastic Search/Kibana
• Airflow/Oozie to schedule workflows
• Superset for business dashboards
• Jupyter/JupyterHub/Zeppelin for data science
• Security (Apache Sentry for Roles, Ranger for configuration, Knox as a firewall)
• YARN to coordinate resources
• Ambari for cluster administration
• Terraform/chef/puppet for provisioning
Or a cloud native Solution on AWS
Amazon
DynamoDB
Amazon Elasticsearch
Service
AWS
AppSync
Amazon
API Gateway
Amazon
Cognito
AWS
KMS
AWS
CloudTrail
AWS
IAM
Amazon
CloudWatch
AWS
Snowball
AWS Storage
Gateway
Amazon
Kinesis Data
Firehose
AWS Direct
Connect
AWS Database
Migration
Service
Amazon
Athena
Amazon
EMR
AWS
Glue
Amazon
Redshift
Amazon
DynamoDB
Amazon
QuickSight
Amazon
Kinesis
Amazon
Elasticsearch
Service
Amazon
Neptune
Amazon
RDS
AWS
Glue
More data lakes & analytics on AWS than anywhere else
Data Movement From On-premises Datacenters
AWS Snowball,
Snowball Edge and
Snowmobile
Petabyte and Exabyte-
scale data transport
solution that uses secure
appliances to transfer
large amounts of data
into and out of the AWS
cloud
AWS Direct Connect
Establish a dedicated
network connection from
your premises to AWS;
reduces your network
costs, increase bandwidth
throughput, and provide a
more consistent network
experience than Internet-
based connections
AWS Storage
Gateway
Lets your on-premises
applications to use AWS
for storage; includes a
highly-optimized data
transfer mechanism,
bandwidth management,
along with local cache
AWS Database
Migration Service
Migrate database from
the most widely-used
commercial and open-
source offerings to AWS
quickly and securely with
minimal downtime to
applications
Data Movement From Real-time Sources
Amazon Kinesis
Video Streams
Securely stream video
from connected devices
to AWS for analytics,
machine learning (ML),
and other processing
Amazon Kinesis Data
Firehose
Capture, transform, and
load data streams into
AWS data stores for near
real-time analytics with
existing business
intelligence tools.
Amazon Kinesis Data
Streams
Build custom, real-time
applications that process
data streams using
popular stream
processing frameworks
AWS IoT Core
Supports billions of
devices and trillions of
messages, and can
process and route those
messages to AWS
endpoints and to other
devices reliably and
securely
Managed Streaming
For Kafka
Fully managed open-
source platform for
building real-time
streaming data pipelines
and applications.
Amazon S3—Object Storage
Security and
Compliance
Three different forms of
encryption; encrypts data
in transit when
replicating across regions;
log and monitor with
CloudTrail, use ML to
discover and protect
sensitive data with Macie
Flexible Management
Classify, report, and
visualize data usage
trends; objects can be
tagged to see storage
consumption, cost, and
security; build lifecycle
policies to automate
tiering, and retention
Durability, Availability
& Scalability
Built for eleven nine’s of
durability; data
distributed across 3
physical facilities in an
AWS region;
automatically replicated
to any other AWS region
Query in Place
Run analytics & ML on
data lake without data
movement; S3 Select can
retrieve subset of data,
improving analytics
performance by 400%
Amazon Glacier—Backup and Archive
Durability, Availability
& Scalability
Built for eleven nine’s of
durability; data
distributed across 3
physical facilities in an
AWS region;
automatically replicated
to any other AWS region
Secure
Log and monitor with
CloudTrail, Vault Lock
enables WORM storage
capabilities, helping
satisfy compliance
requirements
Retrieves data in
minutes
Three retrieval options to
fit your use case;
expedited retrievals with
Glacier Select can return
data in minutes
Inexpensive
Lowest cost AWS object
storage class, allowing
you to archive large
amounts of data at a very
low cost
$
Data Preparation Accounts for ~80% of the Work
Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other
Use AWS Glue to cleanse, prep, and catalog
AWS Glue Data Catalog - a single view
across your data lake
Automatically discovers data and stores schema
Makes data searchable, and available for ETL
Contains table definitions and custom metadata
Use AWS Glue ETL jobs to cleanse,
transform, and store processed data
Serverless Apache Spark environment
Use Glue ETL libraries or bring your own code
Write code in Python or Scala
Call any AWS API using the AWS boto3 SDK
Amazon S3
(Raw data)
Amazon S3
(Staging
data)
Amazon S3
(Processed data)
AWS Glue Data Catalog
Crawlers Crawlers Crawlers
Data Lakes, Analytics, and ML Portfolio from AWS
Broadest, deepest set of analytic services
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch service
Amazon Kinesis
Amazon QuickSight
Analytics
Machine Learning
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS Storage Gateway
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Real-time
Data Movement
On-premises
Data Movement
Data Lake on AWS
Storage | Archival Storage | Data Catalog
Amazon EMR—Big Data Processing
Low cost
Flexible billing with per-
second billing, EC2 spot,
reserved instances and
auto-scaling to reduce
costs 50–80%
$
Easy
Launch fully managed
Hadoop & Spark in
minutes; no cluster
setup, node provisioning,
cluster tuning
Latest versions
Updated with the latest
open source frameworks
within 30 days of release
Use S3 storage
Process data directly in
the S3 data lake securely
with high performance
using the EMRFS
connector
Data Lake
100110000100101011100
101010111001010100000
111100101100101010001
100001
Amazon EMR— More than just managed Hadoop
Amazon Redshift—Data Warehousing
Fast at scale
Columnar storage
technology to improve
I/O efficiency and scale
query performance
Secure
Audit everything; encrypt
data end-to-end;
extensive certification
and compliance
Open file formats
Analyze optimized data
formats on the latest
SSD, and all open data
formats in Amazon S3
Inexpensive
As low as $1,000 per
terabyte per year, 1/10th
the cost of traditional
data warehouse
solutions; start at $0.25
per hour
$
Amazon Redshift Spectrum
Extend the data warehouse to exabytes of data in S3 data lake
S3 data lakeRedshift data
Redshift Spectrum
query engine • Exabyte Redshift SQL queries against S3
• Join data across Redshift and S3
• Scale compute and storage separately
• Stable query performance and unlimited concurrency
• CSV, ORC, Avro, & Parquet data formats
• Pay only for the amount of data scanned
Let’s play a game
Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017
https://youtu.be/RpPf38L0HHU?t=3963
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Let’s play a game
Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017
https://youtu.be/RpPf38L0HHU?t=3963
Numbers are fun
Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017
https://youtu.be/RpPf38L0HHU?t=3963
Numbers are fun
Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017
https://youtu.be/RpPf38L0HHU?t=3963
Amazon Athena—Interactive Analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Ability to run SQL queries on data archived in Amazon Glacier (coming soon)
Query Instantly
Zero setup cost; just
point to S3 and
start querying
SQL
Open
ANSI SQL interface,
JDBC/ODBC drivers,
multiple formats,
compression types,
and complex joins and
data types
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with
QuickSight
Pay per query
Pay only for queries
run; save 30–90% on
per-query costs
through compression
$
Amazon QuickSight
easy
Empower
everyone
Seamless
connectivity
Fast analysis Serverless
Now with ML superpowers!
AWS Provides Highest Levels of Security
Secure
Compliance
AWS Artifact
Amazon Inspector
Amazon Cloud HSM
Amazon Cognito
AWS CloudTrail
Security
Amazon GuardDuty
AWS Shield
AWS WAF
Amazon Macie
VPC
Encryption
AWS Certification Manager
AWS Key Management
Service
Encryption at rest
Encryption in transit
Bring your own keys, HSM
support
Identity
AWS IAM
AWS SSO
Amazon Cloud Directory
AWS Directory Service
AWS Organizations
Customer need to have multiple levels of security, identity and access management,
encryption, and compliance to secure their data lake
Compliance: Virtually Every Regulatory Agency
CSA
Cloud Security
Alliance Controls
ISO 9001
Global Quality
Standard
ISO 27001
Security Management
Controls
ISO 27017
Cloud Specific
Controls
ISO 27018
Personal Data
Protection
PCI DSS Level 1
Payment Card
Standards
SOC 1
Audit Controls
Report
SOC 2
Security, Availability, &
Confidentiality Report
SOC 3
General Controls
Report
Global United States
CJIS
Criminal Justice
Information Services
DoD SRG
DoD Data
Processing
FedRAMP
Government Data
Standards
FERPA
Educational
Privacy Act
FIPS
Government Security
Standards
FISMA
Federal Information
Security Management
GxP
Quality Guidelines
and Regulations
ISO FFIEC
Financial Institutions
Regulation
HIPPA
Protected Health
Information
ITAR
International Arms
Regulations
MPAA
Protected Media
Content
NIST
National Institute of
Standards and Technology
SEC Rule 17a-4(f)
Financial Data
Standards
VPAT/Section 508
Accountability
Standards
Asia Pacific
FISC [Japan]
Financial Industry
Information Systems
IRAP [Australia]
Australian Security
Standards
K-ISMS [Korea]
Korean Information
Security
MTCS Tier 3 [Singapore]
Multi-Tier Cloud
Security Standard
My Number Act [Japan]
Personal Information
Protection
Europe
C5 [Germany]
Operational Security
Attestation
Cyber Essentials
Plus [UK]
Cyber Threat
Protection
G-Cloud [UK]
UK Government
Standards
IT-Grundschutz
[Germany]
Baseline Protection
Methodology
X P
G
CHALLENGE
Need to create constant feedback loop
for designers
Gain up-to-the-minute understanding
of gamer satisfaction to guarantee
gamers are engaged, thus resulting in
the most popular game played in the
world
Fortnite | 125+ million players
Epic Games uses Data Lakes and analytics
Entire analytics platform running on AWS
S3 leveraged as a Data Lake
All telemetry data is collected with Kinesis
Real-time analytics done through Spark on EMR,
DynamoDB to create scoreboards and real-time queries
Use Amazon EMR for large batch data processing
Game designers use data to inform their decisions
Game
clients
Game
servers
Launcher
Game
services
N E A R R E A L T I M E P I P E L I N E
N E A R R E A L T I M E P I P E L I N E
Grafana
Scoreboards API
Limited Raw Data
(real time ad-hoc SQL)
User ETL
(metric definition)
Spark on EMR DynamoDB
NEAR REALTIME PIPELINES
BATCH PIPELINES
ETL using
EMR
Tableau/BI
Ad-hoc SQLS3
(Data Lake)
Kinesis
APIs
Databases
S3
Other
sources
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Extra ball: Lake Formation
Building data lakes can still take months
Typical steps of building a data lake
Setup Storage1
Move data2
Cleanse, prep, and
catalog data
3
Configure and enforce
security and compliance
policies
4
Make data available
for analytics
5
How it works: AWS Lake Formation
S3
IAM KMS
OLTP
ERP
CRM
LOB
Devices
Web
Sensors
Social Kinesis
Build Data Lakes quickly
• Identify, crawl, and catalog sources
• Ingest and clean data
• Transform into optimal formats
Simplify security management
• Enforce encryption
• Define access policies
• Implement audit login
Enable self-service and combined analytics
• Analysts discover all data available for analysis
from a single data catalog
• Use multiple analytics tools over the same data
Athena
Amazon
Redshift
AI Services
Amazon
EMR
Amazon
QuickSight
Data
Catalog
Customer interest in AWS Lake Formation
“We are very excited about the launch of AWS Lake
Formation, which provides a central point of control to
easily load, clean, secure, and catalog data from thousands of
clients to our AWS-based data lake, dramatically reducing
our operational load. … Additionally, AWS Lake Formation
will be HIPAA compliant from day one …”
- Aaron Symanski, CTO, Change Healthcare
“I can’t wait for my team to get our hands on AWS Lake
Formation. With an enterprise-ready option like Lake
Formation, we will be able to spend more time deriving
value from our data rather than doing the heavy lifting
involved in manually setting up and managing our data lake.”
- Joshua Couch, VP Engineering, Fender Digital
Javier Ramirez
@supercoco9
Gracias
Síguenos en twitter: https://twitter.com/awscloud_es
Webinars y eventos en: https://aws.amazon.com/es/about-aws/events/eventos-es/
Contacto: https://aws.amazon.com/es/contact-us/
Noticias y novedades: https://aws.amazon.com/es/new
No olvides rellenar la encuesta que te
enviaremos para ayudarnos a mejorar

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Data Lake na área da saúde- AWS
Data Lake na área da saúde- AWSData Lake na área da saúde- AWS
Data Lake na área da saúde- AWS
 
Building Data Lakes and Analytics on AWS. IPExpo Manchester.
Building Data Lakes and Analytics on AWS. IPExpo Manchester.Building Data Lakes and Analytics on AWS. IPExpo Manchester.
Building Data Lakes and Analytics on AWS. IPExpo Manchester.
 
Building-a-Modern-Data-Platform-in-the-Cloud.pdf
Building-a-Modern-Data-Platform-in-the-Cloud.pdfBuilding-a-Modern-Data-Platform-in-the-Cloud.pdf
Building-a-Modern-Data-Platform-in-the-Cloud.pdf
 
Migrating database to cloud
Migrating database to cloudMigrating database to cloud
Migrating database to cloud
 
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
 
Log Analytics with AWS
Log Analytics with AWSLog Analytics with AWS
Log Analytics with AWS
 
Amazon Elastic Container Service for Kubernetes (Amazon EKS)
Amazon Elastic Container Service for Kubernetes (Amazon EKS)Amazon Elastic Container Service for Kubernetes (Amazon EKS)
Amazon Elastic Container Service for Kubernetes (Amazon EKS)
 
Building low latency apps with a serverless architecture and in-memory data I...
Building low latency apps with a serverless architecture and in-memory data I...Building low latency apps with a serverless architecture and in-memory data I...
Building low latency apps with a serverless architecture and in-memory data I...
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
Easy Rider: How ML, Serverless, and IoT Drive Mobility as a Service (AMT302) ...
Easy Rider: How ML, Serverless, and IoT Drive Mobility as a Service (AMT302) ...Easy Rider: How ML, Serverless, and IoT Drive Mobility as a Service (AMT302) ...
Easy Rider: How ML, Serverless, and IoT Drive Mobility as a Service (AMT302) ...
 
No Hassle NoSQL - Amazon DynamoDB & Amazon DocumentDB | AWS Summit Tel Aviv ...
 No Hassle NoSQL - Amazon DynamoDB & Amazon DocumentDB | AWS Summit Tel Aviv ... No Hassle NoSQL - Amazon DynamoDB & Amazon DocumentDB | AWS Summit Tel Aviv ...
No Hassle NoSQL - Amazon DynamoDB & Amazon DocumentDB | AWS Summit Tel Aviv ...
 
Creare e gestire Data Lake e Data Warehouses
Creare e gestire Data Lake e Data WarehousesCreare e gestire Data Lake e Data Warehouses
Creare e gestire Data Lake e Data Warehouses
 
Construindo data lakes e analytics com AWS
Construindo data lakes e analytics com AWSConstruindo data lakes e analytics com AWS
Construindo data lakes e analytics com AWS
 
AWS Webinar Series | Using Virtual Reality to bring your ChatBot to life
AWS Webinar Series | Using Virtual Reality to bring your ChatBot to life AWS Webinar Series | Using Virtual Reality to bring your ChatBot to life
AWS Webinar Series | Using Virtual Reality to bring your ChatBot to life
 
Modern Data Platform on AWS
Modern Data Platform on AWSModern Data Platform on AWS
Modern Data Platform on AWS
 
How to go from zero to data lakes in days - ADB202 - New York AWS Summit
How to go from zero to data lakes in days - ADB202 - New York AWS SummitHow to go from zero to data lakes in days - ADB202 - New York AWS Summit
How to go from zero to data lakes in days - ADB202 - New York AWS Summit
 
Leveraging Data Analytics in the Cloud to Support Data-Driven Decisions
Leveraging Data Analytics in the Cloud to Support Data-Driven DecisionsLeveraging Data Analytics in the Cloud to Support Data-Driven Decisions
Leveraging Data Analytics in the Cloud to Support Data-Driven Decisions
 
Need for Speed – Intro To Real-Time Data Streaming Analytics on AWS | AWS Sum...
Need for Speed – Intro To Real-Time Data Streaming Analytics on AWS | AWS Sum...Need for Speed – Intro To Real-Time Data Streaming Analytics on AWS | AWS Sum...
Need for Speed – Intro To Real-Time Data Streaming Analytics on AWS | AWS Sum...
 
Operationalizing Microsoft Workloads (WIN320) - AWS re:Invent 2018
Operationalizing Microsoft Workloads (WIN320) - AWS re:Invent 2018Operationalizing Microsoft Workloads (WIN320) - AWS re:Invent 2018
Operationalizing Microsoft Workloads (WIN320) - AWS re:Invent 2018
 
Microservices on AWS: Architectural Patterns and Best Practices | AWS Summit ...
Microservices on AWS: Architectural Patterns and Best Practices | AWS Summit ...Microservices on AWS: Architectural Patterns and Best Practices | AWS Summit ...
Microservices on AWS: Architectural Patterns and Best Practices | AWS Summit ...
 

Ähnlich wie Big Data, Ingeniería de datos, y Data Lakes en AWS

Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Amazon Web Services
 

Ähnlich wie Big Data, Ingeniería de datos, y Data Lakes en AWS (20)

Building a Modern Data Platform in the Cloud. AWS Initiate Portugal
Building a Modern Data Platform in the Cloud. AWS Initiate PortugalBuilding a Modern Data Platform in the Cloud. AWS Initiate Portugal
Building a Modern Data Platform in the Cloud. AWS Initiate Portugal
 
AWS Big Data Landscape
AWS Big Data LandscapeAWS Big Data Landscape
AWS Big Data Landscape
 
From raw data to business insights. A modern data lake
From raw data to business insights. A modern data lakeFrom raw data to business insights. A modern data lake
From raw data to business insights. A modern data lake
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
Building a modern data platform on AWS. Utrecht AWS Dev Day
Building a modern data platform on AWS. Utrecht AWS Dev DayBuilding a modern data platform on AWS. Utrecht AWS Dev Day
Building a modern data platform on AWS. Utrecht AWS Dev Day
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Building_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdfBuilding_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdf
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Building a modern data platform in the cloud. AWS DevDay Nordics
Building a modern data platform in the cloud. AWS DevDay NordicsBuilding a modern data platform in the cloud. AWS DevDay Nordics
Building a modern data platform in the cloud. AWS DevDay Nordics
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 

Mehr von javier ramirez

Mehr von javier ramirez (20)

¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest
 
QuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series databaseQuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series database
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBDeduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDB
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Database
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming  data with Python. Pycon Italy 2022Processing and analysing streaming  data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...
 
Servicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en AragónServicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en Aragón
 
Primeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverlessPrimeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverless
 
How AWS is reinventing the cloud
How AWS is reinventing the cloudHow AWS is reinventing the cloud
How AWS is reinventing the cloud
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMAnalitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
 
Getting started with streaming analytics
Getting started with streaming analyticsGetting started with streaming analytics
Getting started with streaming analytics
 
Getting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipelineGetting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipeline
 
Monitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWSMonitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWS
 
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...
 
Recomendaciones, predicciones y detección de fraude usando servicios de intel...
Recomendaciones, predicciones y detección de fraude usando servicios de intel...Recomendaciones, predicciones y detección de fraude usando servicios de intel...
Recomendaciones, predicciones y detección de fraude usando servicios de intel...
 

Kürzlich hochgeladen

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Kürzlich hochgeladen (20)

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 

Big Data, Ingeniería de datos, y Data Lakes en AWS

  • 1. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Javier Ramirez @supercoco9 Technical Evangelist Amazon Web Services Big Data, Ingeniería de datos, y Data Lakes en AWS
  • 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Solution My reports make my database server very slow Before 2009 The DBA years Overnight DB dump Read-only replica My data doesn’t fit in one machine And it’s not only transactional 2009-2011 The Hadoop epiphany Hadoop Map/Reduce all the things My data is very fast Map/Reduce is hard to use 2012-2014 The Message Broker and NoSQL Age Kafka/RabbitMQ Cassandra/HBASE /STORM Basic ETL Hive Duplicating batch/stream is inefficient I need to cleanse my source data Hadoop ecosystem is hard to manage My data scientists don’t like JAVA I am not sure which data we are already processing 2015-2017 The Spark kingdom and the spreadsheet wars Kafka/Spark Complex ETL Create new departments for data governance Spreadsheet all the things Streaming is hard My schemas have evolved I cannot query old and new data together My cluster is running old versions. Upgrading is hard I want to use ML 2017-2018 The myth of DataOps Kafka/Flink (JAVA or Scala required) Complex ETL with a pinch of ML Apache Atlas Commercial distributions
  • 3. Some problems during all periods • My team spends more time maintaining the cluster than adding functionality • Security and monitoring are hard • Most of my time my cluster is sitting idle; Then it’s a bottleneck • I don’t have the time to experiment • Data preparation, cleansing, and basic transformations take a disproportionally high amount of my time. And it’s so frustrating
  • 4. Some simple things that scare me (and eat my productivity) • Text encodings • Empty strings. Literal ”NULL” strings • Uppercase and Lowercase • Date and time formats: which date would you say this is 1/4/19? And this? 1553589297 • CSV, especially if uploaded by end users • JSON files with a single array and 200.000 records inside • The same JSON file when row 176.543 has a column never seen before • The same JSON file when all the numbers are strings • XML
  • 5. The downfall of the data engineer Watching paint dry is exciting in comparison to writing and maintaining Extract Transform and Load (ETL) logic. Most ETL jobs take a long time to execute and errors or issues tend to happen at runtime or are post-runtime assertions. Since the development time to execution time ratio is typically low, being productive means juggling with multiple pipelines at once and inherently doing a lot of context switching. By the time one of your 5 running “big data jobs” has finished, you have to get back in the mind space you were in many hours ago and craft your next iteration. Depending on how caffeinated you are, how long it’s been since the last iteration, and how systematic you are, you may fail at restoring the full context in your short term memory. This leads to systemic, stupid errors that waste hours. “ ”Maxime Beauchemin, Data engineer extraordinaire at Lyft, creator of Apache Airflow and Apache Superset. Ex-Facebook, Ex-Yahoo!, Ex-Airbnb https://medium.com/@maximebeauchemin/the-downfall-of-the-data-engineer-5bfb701e5d6b
  • 6. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. Modern data analytics 101
  • 7. A good data lake allows self-service and can easily plug-in new analytical engines.
  • 8. A Possible Open Source solution • Hadoop Cluster (static/multi tenant) • Apache NiFi for ingestion workflows • Sqoop to ingest data from RDBMS • HDFS to store the data (tied to the Hadoop cluster) • Hive/HCatalog for data Catalog • Apache Atlas for a more human data catalog and governance • Apache Spark for complex ETL –with Apache Livy for REST • Hive for batch workloads with SQL • Presto for interactive queries with SQL • Kafka for streaming ingest • Apache Spark/Apache Flink for streaming analytics • Apache Hbase (or maybe Cassandra) to store streaming data • Apache Phoenix to run SQL queries on top of Hbase • Prometheus (or fluentd/collectd/ganglia/Nagios…) for logs and monitoring. Maybe with Elastic Search/Kibana • Airflow/Oozie to schedule workflows • Superset for business dashboards • Jupyter/JupyterHub/Zeppelin for data science • Security (Apache Sentry for Roles, Ranger for configuration, Knox as a firewall) • YARN to coordinate resources • Ambari for cluster administration • Terraform/chef/puppet for provisioning
  • 9. Or a cloud native Solution on AWS Amazon DynamoDB Amazon Elasticsearch Service AWS AppSync Amazon API Gateway Amazon Cognito AWS KMS AWS CloudTrail AWS IAM Amazon CloudWatch AWS Snowball AWS Storage Gateway Amazon Kinesis Data Firehose AWS Direct Connect AWS Database Migration Service Amazon Athena Amazon EMR AWS Glue Amazon Redshift Amazon DynamoDB Amazon QuickSight Amazon Kinesis Amazon Elasticsearch Service Amazon Neptune Amazon RDS AWS Glue
  • 10. More data lakes & analytics on AWS than anywhere else
  • 11. Data Movement From On-premises Datacenters AWS Snowball, Snowball Edge and Snowmobile Petabyte and Exabyte- scale data transport solution that uses secure appliances to transfer large amounts of data into and out of the AWS cloud AWS Direct Connect Establish a dedicated network connection from your premises to AWS; reduces your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet- based connections AWS Storage Gateway Lets your on-premises applications to use AWS for storage; includes a highly-optimized data transfer mechanism, bandwidth management, along with local cache AWS Database Migration Service Migrate database from the most widely-used commercial and open- source offerings to AWS quickly and securely with minimal downtime to applications
  • 12.
  • 13. Data Movement From Real-time Sources Amazon Kinesis Video Streams Securely stream video from connected devices to AWS for analytics, machine learning (ML), and other processing Amazon Kinesis Data Firehose Capture, transform, and load data streams into AWS data stores for near real-time analytics with existing business intelligence tools. Amazon Kinesis Data Streams Build custom, real-time applications that process data streams using popular stream processing frameworks AWS IoT Core Supports billions of devices and trillions of messages, and can process and route those messages to AWS endpoints and to other devices reliably and securely Managed Streaming For Kafka Fully managed open- source platform for building real-time streaming data pipelines and applications.
  • 14. Amazon S3—Object Storage Security and Compliance Three different forms of encryption; encrypts data in transit when replicating across regions; log and monitor with CloudTrail, use ML to discover and protect sensitive data with Macie Flexible Management Classify, report, and visualize data usage trends; objects can be tagged to see storage consumption, cost, and security; build lifecycle policies to automate tiering, and retention Durability, Availability & Scalability Built for eleven nine’s of durability; data distributed across 3 physical facilities in an AWS region; automatically replicated to any other AWS region Query in Place Run analytics & ML on data lake without data movement; S3 Select can retrieve subset of data, improving analytics performance by 400%
  • 15. Amazon Glacier—Backup and Archive Durability, Availability & Scalability Built for eleven nine’s of durability; data distributed across 3 physical facilities in an AWS region; automatically replicated to any other AWS region Secure Log and monitor with CloudTrail, Vault Lock enables WORM storage capabilities, helping satisfy compliance requirements Retrieves data in minutes Three retrieval options to fit your use case; expedited retrievals with Glacier Select can return data in minutes Inexpensive Lowest cost AWS object storage class, allowing you to archive large amounts of data at a very low cost $
  • 16. Data Preparation Accounts for ~80% of the Work Building training sets Cleaning and organizing data Collecting data sets Mining data for patterns Refining algorithms Other
  • 17. Use AWS Glue to cleanse, prep, and catalog AWS Glue Data Catalog - a single view across your data lake Automatically discovers data and stores schema Makes data searchable, and available for ETL Contains table definitions and custom metadata Use AWS Glue ETL jobs to cleanse, transform, and store processed data Serverless Apache Spark environment Use Glue ETL libraries or bring your own code Write code in Python or Scala Call any AWS API using the AWS boto3 SDK Amazon S3 (Raw data) Amazon S3 (Staging data) Amazon S3 (Processed data) AWS Glue Data Catalog Crawlers Crawlers Crawlers
  • 18. Data Lakes, Analytics, and ML Portfolio from AWS Broadest, deepest set of analytic services Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch service Amazon Kinesis Amazon QuickSight Analytics Machine Learning AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS Storage Gateway AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Real-time Data Movement On-premises Data Movement Data Lake on AWS Storage | Archival Storage | Data Catalog
  • 19. Amazon EMR—Big Data Processing Low cost Flexible billing with per- second billing, EC2 spot, reserved instances and auto-scaling to reduce costs 50–80% $ Easy Launch fully managed Hadoop & Spark in minutes; no cluster setup, node provisioning, cluster tuning Latest versions Updated with the latest open source frameworks within 30 days of release Use S3 storage Process data directly in the S3 data lake securely with high performance using the EMRFS connector Data Lake 100110000100101011100 101010111001010100000 111100101100101010001 100001
  • 20.
  • 21. Amazon EMR— More than just managed Hadoop
  • 22. Amazon Redshift—Data Warehousing Fast at scale Columnar storage technology to improve I/O efficiency and scale query performance Secure Audit everything; encrypt data end-to-end; extensive certification and compliance Open file formats Analyze optimized data formats on the latest SSD, and all open data formats in Amazon S3 Inexpensive As low as $1,000 per terabyte per year, 1/10th the cost of traditional data warehouse solutions; start at $0.25 per hour $
  • 23. Amazon Redshift Spectrum Extend the data warehouse to exabytes of data in S3 data lake S3 data lakeRedshift data Redshift Spectrum query engine • Exabyte Redshift SQL queries against S3 • Join data across Redshift and S3 • Scale compute and storage separately • Stable query performance and unlimited concurrency • CSV, ORC, Avro, & Parquet data formats • Pay only for the amount of data scanned
  • 24. Let’s play a game Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017 https://youtu.be/RpPf38L0HHU?t=3963 © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Let’s play a game Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017 https://youtu.be/RpPf38L0HHU?t=3963
  • 25. Numbers are fun Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017 https://youtu.be/RpPf38L0HHU?t=3963
  • 26. Numbers are fun Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017 https://youtu.be/RpPf38L0HHU?t=3963
  • 27. Amazon Athena—Interactive Analysis Interactive query service to analyze data in Amazon S3 using standard SQL No infrastructure to set up or manage and no data to load Ability to run SQL queries on data archived in Amazon Glacier (coming soon) Query Instantly Zero setup cost; just point to S3 and start querying SQL Open ANSI SQL interface, JDBC/ODBC drivers, multiple formats, compression types, and complex joins and data types Easy Serverless: zero infrastructure, zero administration Integrated with QuickSight Pay per query Pay only for queries run; save 30–90% on per-query costs through compression $
  • 29. AWS Provides Highest Levels of Security Secure Compliance AWS Artifact Amazon Inspector Amazon Cloud HSM Amazon Cognito AWS CloudTrail Security Amazon GuardDuty AWS Shield AWS WAF Amazon Macie VPC Encryption AWS Certification Manager AWS Key Management Service Encryption at rest Encryption in transit Bring your own keys, HSM support Identity AWS IAM AWS SSO Amazon Cloud Directory AWS Directory Service AWS Organizations Customer need to have multiple levels of security, identity and access management, encryption, and compliance to secure their data lake
  • 30. Compliance: Virtually Every Regulatory Agency CSA Cloud Security Alliance Controls ISO 9001 Global Quality Standard ISO 27001 Security Management Controls ISO 27017 Cloud Specific Controls ISO 27018 Personal Data Protection PCI DSS Level 1 Payment Card Standards SOC 1 Audit Controls Report SOC 2 Security, Availability, & Confidentiality Report SOC 3 General Controls Report Global United States CJIS Criminal Justice Information Services DoD SRG DoD Data Processing FedRAMP Government Data Standards FERPA Educational Privacy Act FIPS Government Security Standards FISMA Federal Information Security Management GxP Quality Guidelines and Regulations ISO FFIEC Financial Institutions Regulation HIPPA Protected Health Information ITAR International Arms Regulations MPAA Protected Media Content NIST National Institute of Standards and Technology SEC Rule 17a-4(f) Financial Data Standards VPAT/Section 508 Accountability Standards Asia Pacific FISC [Japan] Financial Industry Information Systems IRAP [Australia] Australian Security Standards K-ISMS [Korea] Korean Information Security MTCS Tier 3 [Singapore] Multi-Tier Cloud Security Standard My Number Act [Japan] Personal Information Protection Europe C5 [Germany] Operational Security Attestation Cyber Essentials Plus [UK] Cyber Threat Protection G-Cloud [UK] UK Government Standards IT-Grundschutz [Germany] Baseline Protection Methodology X P G
  • 31. CHALLENGE Need to create constant feedback loop for designers Gain up-to-the-minute understanding of gamer satisfaction to guarantee gamers are engaged, thus resulting in the most popular game played in the world Fortnite | 125+ million players
  • 32. Epic Games uses Data Lakes and analytics Entire analytics platform running on AWS S3 leveraged as a Data Lake All telemetry data is collected with Kinesis Real-time analytics done through Spark on EMR, DynamoDB to create scoreboards and real-time queries Use Amazon EMR for large batch data processing Game designers use data to inform their decisions Game clients Game servers Launcher Game services N E A R R E A L T I M E P I P E L I N E N E A R R E A L T I M E P I P E L I N E Grafana Scoreboards API Limited Raw Data (real time ad-hoc SQL) User ETL (metric definition) Spark on EMR DynamoDB NEAR REALTIME PIPELINES BATCH PIPELINES ETL using EMR Tableau/BI Ad-hoc SQLS3 (Data Lake) Kinesis APIs Databases S3 Other sources
  • 33. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Extra ball: Lake Formation
  • 34. Building data lakes can still take months
  • 35. Typical steps of building a data lake Setup Storage1 Move data2 Cleanse, prep, and catalog data 3 Configure and enforce security and compliance policies 4 Make data available for analytics 5
  • 36. How it works: AWS Lake Formation S3 IAM KMS OLTP ERP CRM LOB Devices Web Sensors Social Kinesis Build Data Lakes quickly • Identify, crawl, and catalog sources • Ingest and clean data • Transform into optimal formats Simplify security management • Enforce encryption • Define access policies • Implement audit login Enable self-service and combined analytics • Analysts discover all data available for analysis from a single data catalog • Use multiple analytics tools over the same data Athena Amazon Redshift AI Services Amazon EMR Amazon QuickSight Data Catalog
  • 37. Customer interest in AWS Lake Formation “We are very excited about the launch of AWS Lake Formation, which provides a central point of control to easily load, clean, secure, and catalog data from thousands of clients to our AWS-based data lake, dramatically reducing our operational load. … Additionally, AWS Lake Formation will be HIPAA compliant from day one …” - Aaron Symanski, CTO, Change Healthcare “I can’t wait for my team to get our hands on AWS Lake Formation. With an enterprise-ready option like Lake Formation, we will be able to spend more time deriving value from our data rather than doing the heavy lifting involved in manually setting up and managing our data lake.” - Joshua Couch, VP Engineering, Fender Digital
  • 38. Javier Ramirez @supercoco9 Gracias Síguenos en twitter: https://twitter.com/awscloud_es Webinars y eventos en: https://aws.amazon.com/es/about-aws/events/eventos-es/ Contacto: https://aws.amazon.com/es/contact-us/ Noticias y novedades: https://aws.amazon.com/es/new No olvides rellenar la encuesta que te enviaremos para ayudarnos a mejorar