Big Data, Ingeniería de datos, y Data Lakes en AWS

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Javier Ramirez
@supercoco9
Technical Evangelist
Amazon Web Services
Big Data, Ingeniería de datos, y
Data Lakes en AWS

Solution
My reports make
my database
server very slow
Before 2009
The DBA years
Overnight DB dump
Read-only replica
My data doesn’t fit in
one machine
And it’s not only
transactional
2009-2011
The Hadoop epiphany
Hadoop
Map/Reduce all the
things
My data is very
fast
Map/Reduce is
hard to use
2012-2014
The Message Broker
and NoSQL Age
Kafka/RabbitMQ
Cassandra/HBASE
/STORM
Basic ETL
Hive
Duplicating batch/stream is inefficient
I need to cleanse my source data
Hadoop ecosystem is hard to manage
My data scientists don’t like JAVA
I am not sure which data we are
already processing
2015-2017
The Spark kingdom and
the spreadsheet wars
Kafka/Spark
Complex ETL
Create new departments for data
governance
Spreadsheet all the things
Streaming is hard
My schemas have evolved
I cannot query old and new
data together
My cluster is running old
versions. Upgrading is hard
I want to use ML
2017-2018
The myth of DataOps
Kafka/Flink (JAVA or Scala
required)
Complex ETL with a pinch of
ML
Apache Atlas
Commercial distributions

Some problems during all periods
• My team spends more time maintaining the cluster than adding functionality
• Security and monitoring are hard
• Most of my time my cluster is sitting idle; Then it’s a bottleneck
• I don’t have the time to experiment
• Data preparation, cleansing, and basic transformations take a
disproportionally high amount of my time. And it’s so frustrating

Some simple things that scare me (and eat my productivity)
• Text encodings
• Empty strings. Literal ”NULL” strings
• Uppercase and Lowercase
• Date and time formats: which date would you say this is 1/4/19? And this? 1553589297
• CSV, especially if uploaded by end users
• JSON files with a single array and 200.000 records inside
• The same JSON file when row 176.543 has a column never seen before
• The same JSON file when all the numbers are strings
• XML

The downfall of the data engineer
Watching paint dry is exciting in comparison to writing and maintaining Extract
Transform and Load (ETL) logic. Most ETL jobs take a long time to execute and errors
or issues tend to happen at runtime or are post-runtime assertions. Since the
development time to execution time ratio is typically low, being productive means
juggling with multiple pipelines at once and inherently doing a lot of context
switching. By the time one of your 5 running “big data jobs” has finished, you have to
get back in the mind space you were in many hours ago and craft your next iteration.
Depending on how caffeinated you are, how long it’s been since the last iteration, and
how systematic you are, you may fail at restoring the full context in your short term
memory. This leads to systemic, stupid errors that waste hours.
“
”Maxime Beauchemin, Data engineer extraordinaire at Lyft, creator of Apache Airflow and Apache Superset.
Ex-Facebook, Ex-Yahoo!, Ex-Airbnb
https://medium.com/@maximebeauchemin/the-downfall-of-the-data-engineer-5bfb701e5d6b

A data lake is a centralized repository that allows
you to store all your structured and unstructured
data at any scale.
Modern data analytics 101

A good data lake allows self-service and can
easily plug-in new analytical engines.

A Possible Open Source solution
• Hadoop Cluster (static/multi tenant)
• Apache NiFi for ingestion workflows
• Sqoop to ingest data from RDBMS
• HDFS to store the data (tied to the Hadoop cluster)
• Hive/HCatalog for data Catalog
• Apache Atlas for a more human data catalog and governance
• Apache Spark for complex ETL –with Apache Livy for REST
• Hive for batch workloads with SQL
• Presto for interactive queries with SQL
• Kafka for streaming ingest
• Apache Spark/Apache Flink for streaming analytics
• Apache Hbase (or maybe Cassandra) to store streaming data
• Apache Phoenix to run SQL queries on top of Hbase
• Prometheus (or fluentd/collectd/ganglia/Nagios…) for logs and monitoring. Maybe with Elastic Search/Kibana
• Airflow/Oozie to schedule workflows
• Superset for business dashboards
• Jupyter/JupyterHub/Zeppelin for data science
• Security (Apache Sentry for Roles, Ranger for configuration, Knox as a firewall)
• YARN to coordinate resources
• Ambari for cluster administration
• Terraform/chef/puppet for provisioning

Or a cloud native Solution on AWS
Amazon
DynamoDB
Amazon Elasticsearch
Service
AWS
AppSync
Amazon
API Gateway
Amazon
Cognito
AWS
KMS
AWS
CloudTrail
AWS
IAM
Amazon
CloudWatch
AWS
Snowball
AWS Storage
Gateway
Amazon
Kinesis Data
Firehose
AWS Direct
Connect
AWS Database
Migration
Service
Amazon
Athena
Amazon
EMR
AWS
Glue
Amazon
Redshift
Amazon
DynamoDB
Amazon
QuickSight
Amazon
Kinesis
Amazon
Elasticsearch
Service
Amazon
Neptune
Amazon
RDS
AWS
Glue

More data lakes & analytics on AWS than anywhere else

Data Movement From On-premises Datacenters
AWS Snowball,
Snowball Edge and
Snowmobile
Petabyte and Exabyte-
scale data transport
solution that uses secure
appliances to transfer
large amounts of data
into and out of the AWS
cloud
AWS Direct Connect
Establish a dedicated
network connection from
your premises to AWS;
reduces your network
costs, increase bandwidth
throughput, and provide a
more consistent network
experience than Internet-
based connections
AWS Storage
Gateway
Lets your on-premises
applications to use AWS
for storage; includes a
highly-optimized data
transfer mechanism,
bandwidth management,
along with local cache
AWS Database
Migration Service
Migrate database from
the most widely-used
commercial and open-
source offerings to AWS
quickly and securely with
minimal downtime to
applications

Data Movement From Real-time Sources
Amazon Kinesis
Video Streams
Securely stream video
from connected devices
to AWS for analytics,
machine learning (ML),
and other processing
Amazon Kinesis Data
Firehose
Capture, transform, and
load data streams into
AWS data stores for near
real-time analytics with
existing business
intelligence tools.
Amazon Kinesis Data
Streams
Build custom, real-time
applications that process
data streams using
popular stream
processing frameworks
AWS IoT Core
Supports billions of
devices and trillions of
messages, and can
process and route those
messages to AWS
endpoints and to other
devices reliably and
securely
Managed Streaming
For Kafka
Fully managed open-
source platform for
building real-time
streaming data pipelines
and applications.

Amazon S3—Object Storage
Security and
Compliance
Three different forms of
encryption; encrypts data
in transit when
replicating across regions;
log and monitor with
CloudTrail, use ML to
discover and protect
sensitive data with Macie
Flexible Management
Classify, report, and
visualize data usage
trends; objects can be
tagged to see storage
consumption, cost, and
security; build lifecycle
policies to automate
tiering, and retention
Durability, Availability
& Scalability
Built for eleven nine’s of
durability; data
distributed across 3
physical facilities in an
AWS region;
automatically replicated
to any other AWS region
Query in Place
Run analytics & ML on
data lake without data
movement; S3 Select can
retrieve subset of data,
improving analytics
performance by 400%

Amazon Glacier—Backup and Archive
Durability, Availability
& Scalability
Built for eleven nine’s of
durability; data
distributed across 3
physical facilities in an
AWS region;
automatically replicated
to any other AWS region
Secure
Log and monitor with
CloudTrail, Vault Lock
enables WORM storage
capabilities, helping
satisfy compliance
requirements
Retrieves data in
minutes
Three retrieval options to
fit your use case;
expedited retrievals with
Glacier Select can return
data in minutes
Inexpensive
Lowest cost AWS object
storage class, allowing
you to archive large
amounts of data at a very
low cost
$

Data Preparation Accounts for ~80% of the Work
Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other

Use AWS Glue to cleanse, prep, and catalog
AWS Glue Data Catalog - a single view
across your data lake
Automatically discovers data and stores schema
Makes data searchable, and available for ETL
Contains table definitions and custom metadata
Use AWS Glue ETL jobs to cleanse,
transform, and store processed data
Serverless Apache Spark environment
Use Glue ETL libraries or bring your own code
Write code in Python or Scala
Call any AWS API using the AWS boto3 SDK
Amazon S3
(Raw data)
Amazon S3
(Staging
data)
Amazon S3
(Processed data)
AWS Glue Data Catalog
Crawlers Crawlers Crawlers

Data Lakes, Analytics, and ML Portfolio from AWS
Broadest, deepest set of analytic services
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch service
Amazon Kinesis
Amazon QuickSight
Analytics
Machine Learning
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS Storage Gateway
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Real-time
Data Movement
On-premises
Data Movement
Data Lake on AWS
Storage | Archival Storage | Data Catalog

Amazon EMR—Big Data Processing
Low cost
Flexible billing with per-
second billing, EC2 spot,
reserved instances and
auto-scaling to reduce
costs 50–80%
$
Easy
Launch fully managed
Hadoop & Spark in
minutes; no cluster
setup, node provisioning,
cluster tuning
Latest versions
Updated with the latest
open source frameworks
within 30 days of release
Use S3 storage
Process data directly in
the S3 data lake securely
with high performance
using the EMRFS
connector
Data Lake
100110000100101011100
101010111001010100000
111100101100101010001
100001

Amazon EMR— More than just managed Hadoop

Amazon Redshift—Data Warehousing
Fast at scale
Columnar storage
technology to improve
I/O efficiency and scale
query performance
Secure
Audit everything; encrypt
data end-to-end;
extensive certification
and compliance
Open file formats
Analyze optimized data
formats on the latest
SSD, and all open data
formats in Amazon S3
Inexpensive
As low as $1,000 per
terabyte per year, 1/10th
the cost of traditional
data warehouse
solutions; start at $0.25
per hour
$

Amazon Redshift Spectrum
Extend the data warehouse to exabytes of data in S3 data lake
S3 data lakeRedshift data
Redshift Spectrum
query engine • Exabyte Redshift SQL queries against S3
• Join data across Redshift and S3
• Scale compute and storage separately
• Stable query performance and unlimited concurrency
• CSV, ORC, Avro, & Parquet data formats
• Pay only for the amount of data scanned

Let’s play a game
Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017
https://youtu.be/RpPf38L0HHU?t=3963
Let’s play a game

Numbers are fun

Amazon Athena—Interactive Analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Ability to run SQL queries on data archived in Amazon Glacier (coming soon)
Query Instantly
Zero setup cost; just
point to S3 and
start querying
SQL
Open
ANSI SQL interface,
JDBC/ODBC drivers,
multiple formats,
compression types,
and complex joins and
data types
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with
QuickSight
Pay per query
Pay only for queries
run; save 30–90% on
per-query costs
through compression
$

Amazon QuickSight
easy
Empower
everyone
Seamless
connectivity
Fast analysis Serverless
Now with ML superpowers!

AWS Provides Highest Levels of Security
Secure
Compliance
AWS Artifact
Amazon Inspector
Amazon Cloud HSM
Amazon Cognito
AWS CloudTrail
Security
Amazon GuardDuty
AWS Shield
AWS WAF
Amazon Macie
VPC
Encryption
AWS Certification Manager
AWS Key Management
Service
Encryption at rest
Encryption in transit
Bring your own keys, HSM
support
Identity
AWS IAM
AWS SSO
Amazon Cloud Directory
AWS Directory Service
AWS Organizations
Customer need to have multiple levels of security, identity and access management,
encryption, and compliance to secure their data lake

Compliance: Virtually Every Regulatory Agency
CSA
Cloud Security
Alliance Controls
ISO 9001
Global Quality
Standard
ISO 27001
Security Management
Controls
ISO 27017
Cloud Specific
Controls
ISO 27018
Personal Data
Protection
PCI DSS Level 1
Payment Card
Standards
SOC 1
Audit Controls
Report
SOC 2
Security, Availability, &
Confidentiality Report
SOC 3
General Controls
Report
Global United States
CJIS
Criminal Justice
Information Services
DoD SRG
DoD Data
Processing
FedRAMP
Government Data
Standards
FERPA
Educational
Privacy Act
FIPS
Government Security
Standards
FISMA
Federal Information
Security Management
GxP
Quality Guidelines
and Regulations
ISO FFIEC
Financial Institutions
Regulation
HIPPA
Protected Health
Information
ITAR
International Arms
Regulations
MPAA
Protected Media
Content
NIST
National Institute of
Standards and Technology
SEC Rule 17a-4(f)
Financial Data
Standards
VPAT/Section 508
Accountability
Standards
Asia Pacific
FISC [Japan]
Financial Industry
Information Systems
IRAP [Australia]
Australian Security
Standards
K-ISMS [Korea]
Korean Information
Security
MTCS Tier 3 [Singapore]
Multi-Tier Cloud
Security Standard
My Number Act [Japan]
Personal Information
Protection
Europe
C5 [Germany]
Operational Security
Attestation
Cyber Essentials
Plus [UK]
Cyber Threat
Protection
G-Cloud [UK]
UK Government
Standards
IT-Grundschutz
[Germany]
Baseline Protection
Methodology
X P
G

CHALLENGE
Need to create constant feedback loop
for designers
Gain up-to-the-minute understanding
of gamer satisfaction to guarantee
gamers are engaged, thus resulting in
the most popular game played in the
world
Fortnite | 125+ million players

Epic Games uses Data Lakes and analytics
Entire analytics platform running on AWS
S3 leveraged as a Data Lake
All telemetry data is collected with Kinesis
Real-time analytics done through Spark on EMR,
DynamoDB to create scoreboards and real-time queries
Use Amazon EMR for large batch data processing
Game designers use data to inform their decisions
Game
clients
Game
servers
Launcher
Game
services
N E A R R E A L T I M E P I P E L I N E
N E A R R E A L T I M E P I P E L I N E
Grafana
Scoreboards API
Limited Raw Data
(real time ad-hoc SQL)
User ETL
(metric definition)
Spark on EMR DynamoDB
NEAR REALTIME PIPELINES
BATCH PIPELINES
ETL using
EMR
Tableau/BI
Ad-hoc SQLS3
(Data Lake)
Kinesis
APIs
Databases
S3
Other
sources

Building data lakes can still take months

Typical steps of building a data lake
Setup Storage1
Move data2
Cleanse, prep, and
catalog data
3
Configure and enforce
security and compliance
policies
4
Make data available
for analytics
5

How it works: AWS Lake Formation
S3
IAM KMS
OLTP
ERP
CRM
LOB
Devices
Web
Sensors
Social Kinesis
Build Data Lakes quickly
• Identify, crawl, and catalog sources
• Ingest and clean data
• Transform into optimal formats
Simplify security management
• Enforce encryption
• Define access policies
• Implement audit login
Enable self-service and combined analytics
• Analysts discover all data available for analysis
from a single data catalog
• Use multiple analytics tools over the same data
Athena
Amazon
Redshift
AI Services
Amazon
EMR
Amazon
QuickSight
Data
Catalog

Customer interest in AWS Lake Formation
“We are very excited about the launch of AWS Lake
Formation, which provides a central point of control to
easily load, clean, secure, and catalog data from thousands of
clients to our AWS-based data lake, dramatically reducing
our operational load. … Additionally, AWS Lake Formation
will be HIPAA compliant from day one …”
- Aaron Symanski, CTO, Change Healthcare
“I can’t wait for my team to get our hands on AWS Lake
Formation. With an enterprise-ready option like Lake
Formation, we will be able to spend more time deriving
value from our data rather than doing the heavy lifting
involved in manually setting up and managing our data lake.”
- Joshua Couch, VP Engineering, Fender Digital

Javier Ramirez
@supercoco9
Gracias
Síguenos en twitter: https://twitter.com/awscloud_es
Webinars y eventos en: https://aws.amazon.com/es/about-aws/events/eventos-es/
Contacto: https://aws.amazon.com/es/contact-us/
Noticias y novedades: https://aws.amazon.com/es/new
No olvides rellenar la encuesta que te
enviaremos para ayudarnos a mejorar

Big Data, Ingeniería de datos, y Data Lakes en AWS

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Big Data, Ingeniería de datos, y Data Lakes en AWS

Ähnlich wie Big Data, Ingeniería de datos, y Data Lakes en AWS (20)

Mehr von javier ramirez

Mehr von javier ramirez (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big Data, Ingeniería de datos, y Data Lakes en AWS