Ingest, Transform & Visualize w Amazon Web Services

Cloudwick© 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
Data Ingestion, Transformation, and Visualization
with Amazon Web Services
Sponsored by Amazon Web Services and Intel (Venue )
Presented by Arun Kumar Palathumpattu (Cloudwick)

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
Cloudwick

Agenda Cloudwick
• Introduction
• Common Challenges in Data Analytics Environments
• Creating Data Lake – Single source of truth
• Build Effective Data work flow
• Demo
• Q&A

Top 5 Challenges in Building Data Analysis
Environments
Cloudwick

# 1 Market Landscape
Cloudwick

Cloudwick
Image : http://mwicorp.com/lake-okeechobee-water-transfer/
Structured
# 2 Issues of Storing All the Data

# 3 Time it Takes to Find the Useful Information
Cloudwick
Image : https://img.clipartfest.com/0ffb2c38607970437e20a6fdd2872eb9_4-benefits-from-taking-guitar-time-value-of-money-clipart_500-500.png

# 4 Security
Cloudwick
Image : https://insights.ubuntu.com/2017/03/20/three-flaws-at-the-heart-of-iot-security/

# 5 Skill Gap
Cloudwick
https://www.linkedin.com/pulse/wonder-three-years-analytics-big-data-skills-most-demand-hosseini

Evolution of Data Architectures
1985: Data Warehouse Appliances Benefits
• Consolidated multiple decision support
environments (i.e. databases) into a single
architecture
• Best performance available at time of conception,
hence the expensive licenses
• Worked well with structured, columnar data
• Could build customized data marts on top
Shared Storage Tier
(NAS Appliance)
Compute
Node
Compute
Node
Compute
Node
Compute
Node
• Proprietary software license paid per node per
year
• Gold-plated hardware available only from the
vendor with per node per year cost
Constraints
• Proprietary software license paid per node per year
• Gold-plated hardware available only from the
vendor with per node per year cost
• Could not handle unstructured data sets
• Heavy ETL & data cleansing

2006: Hadoop Clusters
CPU
Memory
HDFS Storage
Hadoop Master Node
CPU
Memory
HDFS Storage
CPU
Memory
HDFS Storage
Improvements
• Open source based software license!!!
• Commodity white box servers!!!!
• Could handle structured & unstructured data sets
• Many different applications within the framework
(MapReduce, Spark, Hive, Pig, HBase, Presto, etc.)
Constraints
• HDFS 3X replication to protect against node failure
gets expensive at scale
• 500 TB data set = 1.5 PB cluster
• Local storage means you must scale and pay for
CPU & memory resources when adding data
capacity
• General purpose, monolithic cluster with many
different apps on same hardware
• Still a data silo

Legacy Data Architectures Exist as Isolated Data Silos
Hadoop
Cluster
SQL
Database
Data
Warehouse
Appliance

2009: Decoupled EMR Architecture
CPU
Memory
Hadoop Master Node
CPU
Memory
CPU
Memory
Improvements
• Decoupled storage & compute
• Scale CPU and memory resources independently
and up & down
• Only pay for the 500 TB data set (not 3X)
• Multi-physical facility replication via S3
• Multiple clusters can run in parallel against shared
data in S3
• Each job gets its own optimized cluster. i.e. Spark
on memory intensive, Hive on CPU intensive, HBase
on I/O intensive, etc.
Constraints
• Still have a cluster to provision and manage
• Must expose EMR cluster to SQL users via Hive,
Presto, etc.
S3 as HDFS

2012: Amazon Redshift – Cloud DW
Improvements
Constraints
• Still have to load data into a schema
Leader node
Compute node
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal VPC
BI tools SQL clientsAnalytics tools
Compute node Compute node
JDBC/ODBC
• Automated installation, patching, backups
• No servers to manage and maintain
• MPP Columnar relational database
• $1,000 / TB / Year
• Accessible to any ODBC or JDBC BI Tool

Today: Clusterless Improvements
• No cluster/infrastructure to manage
• Business users and analysts can write SQL without
having to provision a cluster or touch infrastructure
• Pay by the query
• Zero Administration
• Process data where it lives
Constraints
• Limited to SQL, Hive and Spark jobs today. More
frameworks to come!
SQL Interface in web
browser
Athena for SQL
S3 Data Lake
Glue for ETL
S3 Data Lake
Spark & Hive Interface in
web browser

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Building a Flexible Data Lake
Architecture on AWS

Enter Data Lake Architectures
Data Lake is a new and increasingly
popular architecture to store and analyze
massive volumes and heterogeneous
types of data.
Separating your storage and compute
allows you to scale each component as
required
“How can I scale up with the
volume of data being generated?”

Benefits of a Data Lake – All Data in One Place
Store and analyze all of your data,
from all of your sources, in one
centralized location.
“Why is the data distributed in
many locations? Where is the
single source of truth ?”
Quickly ingest data
without needing to force it into a
pre-defined schema.
“How can I collect data quickly
from various sources and store it
efficiently?”

Benefits of a Data Lake – Schema on Read
“Is there a way I can apply multiple
analytics and processing frameworks
to the same data?”
A Data Lake enables ad-hoc
analysis by applying schemas
on read, not write.

Benefits of an AWS S3 Data Lake
Fixed Cluster Data Lake AWS S3 Data Lake
• Limited to only the single tool contained on
the cluster (i.e. Hadoop or data warehouse
or Cassandra, etc.). Use cases & ecosystem
tools change rapidly
• Expensive to add nodes to add storage
capacity
• Expensive to replicate data against node
loss
• Complexity in scaling local storage capacity
• Long refresh cycles to add additional
storage equipment
• Decouple storage and compute by making
S3 object based storage, not a fixed tool
cluster the data lake
• Flexibility to use any and all tools in the
ecosystem. The right tool for the job
• Future proof your architecture. As new use
cases and new tools emerge you can plug
and play current best of breed.

Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance
Multiple upload
Range GET
Store as much as you need
Scale storage and compute
independently
No minimum usage commitments
Scalable
Amazon EMR
Amazon Redshift
Amazon DynamoDB
Integrated
Simple REST API
AWS SDKs
Read-after-create consistency
Event notification
Lifecycle policies
Easy to use
S3 for data lake

Building a Data Lake on AWS
Kinesis Firehose
Athena
Query Service

Processing & Analytics
Real-time Batch
AI & Predictive
BI & Data Visualization
Transactional &
RDBMS
AWS Lambda
Apache Storm
on EMR
Apache Flink
on EMR
Spark Streaming
on EMR
Elasticsearch
Service
Kinesis Analytics,
Kinesis Streams
DynamoDB
NoSQL DB Relational Database
Aurora
EMR
Hadoop, Spark,
Presto
Redshift
Data Warehouse
Athena
Query Service
Amazon Lex
Speech
recognition
Amazon
Rekognition
Amazon Polly
Text to speech
Machine Learning
Predictive analytics
Kinesis Streams
& Firehose

Summary of AWS Analytics, Database & AI Tools
Amazon Redshift
Enterprise Data Warehouse
Amazon EMR
Hadoop/Spark
Amazon Athena
Clusterless SQL
Amazon Glue
Clusterless ETL
Amazon Aurora
Managed Relational Database
Amazon Machine Learning
Predictive Analytics
Amazon Quicksight
Business Intelligence/Visualization
Amazon ElasticSearch Service
ElasticSearch
Amazon ElastiCache
Redis In-memory Datastore
Amazon DynamoDB
Managed NoSQL Database
Amazon Rekognition
Deep Learning-based Image Recognition
Amazon Lex
Voice or Text Chatbots

Encryption ComplianceSecurity
Identity and Access
Management (IAM) policies
Bucket policies
Access Control Lists (ACLs)
Private VPC endpoints to
Amazon S3
SSL endpoints
Server Side Encryption
(SSE-S3)
S3 Server Side
Encryption with
provided keys (SSE-C,
SSE-KMS)
Client-side Encryption
Buckets access logs
Lifecycle Management
Policies
Access Control Lists
(ACLs)
Versioning & MFA
deletes
Certifications – HIPAA,
PCI, SOC 1/2/3 etc.
Implement the right cloud security controls

More Efficient Data Lake Architectures
Cloudwick

What is a Modern Enterprise Data
Warehouse?
A Modern Enterprise Data warehouse
(EDW), is designed to support rapid data
growth, quick analytics over relational,
non-relational as well as streaming data,
with an easy and single interface to
consume all these types of data.

Data
Consumption
Building a Modern EDW on AWS
Amazon
Kinesis
Firehose
Amazon S3
Amazon S3
Amazon
EMR
Amaz
on
Redsh
ift
Streamin
g
Un-
structured
Relational
Amaz
on
RDS
Amaz
on
Athen
a
Amazon
Quicksig
ht
Amazo
n
Machin
e
Learnin
g
Analyze
Predict
Data
Mart
EDW
Ad-hoc
query
Data
Ingestion

What is Real-time analytics platform?
Real-time Analytics platform provides an
ability to load and analyze streaming
data, and build custom streaming data
applications for specialized needs. This
platform can handle staggering amounts
of streaming data – sometimes TBs per
hour – that need to be collected, stored,
and processed continuously.

Building a Real-time analytics platform (Streaming
vs batching)
Stream Delivery
Stream Analytics
Stream processing
Event Driven
Kinesis
Streami
ng

Simplify Big Data Processing
ingest /
collect
store process /
analyze
consume /
visualize
data answers
Time to Answer (Latency)
Throughput
Cost

Amazon QuickSight is a Business Analytics Service that lets business users
quickly and easily visualize, explore, and share insights from their data.

Amazon S3 Data Lake
Visualizing the Data Lake
Amazon Athena

Cloudwick
DEMO

Cloudwick© 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
Credits
Event Sponsored by Amazon Web Services
Venue by Intel
Presented by Arun Kumar Palathumpattu (Cloudwick)
Slides credit : AWS + Cloudwick

Ingest, Transform & Visualize w Amazon Web Services

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ingest, Transform & Visualize w Amazon Web Services

Similar to Ingest, Transform & Visualize w Amazon Web Services (20)

More from BigDataCamp

More from BigDataCamp (11)

Recently uploaded

Recently uploaded (20)

Ingest, Transform & Visualize w Amazon Web Services