Full Stack Analytics on AWS - AWS Summit Cape Town 2017

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Specialist Solutions Architect, Data and Analytics, EMEA
July 5th, 2017
Full Stack Analytics on AWS
Ian Robinson

Forces and Trends
Cost Optimization
Licenses
Hardware
Data center and operations
Dark Data
Prematurely discarding data
Agility
Experimentation (data & tools)
Democratised Access to Data
Time-to-first-results
Terminate failed experiments early
From BI to Data Science
In-house data science
From back office to product

Storage is the Gravity for Cloud Applications

Foundations: Storage, Discovery and Lifecycle
Secure, governed, scalable, cheap
Storage & Catalog
Secure, cost-effectivestorage in Amazon
S3. Robust metadata in AWSCatalog

Amazon EFS
File
Amazon EBS
Amazon EC2
Instance Store
Block
Amazon S3 Amazon Glacier
Object
Data Transfer
AWS Direct
Connect
AWS
Snowball
ISV Connectors Amazon
Kinesis
Firehose
S3 Transfer
Acceleration
Storage
Gateway
AWS Storage Platforms

Amazon S3 Amazon Glacier
Object
Object Storage is Foundational
EC2 Lambda EMR
Data
Pipeline
Kinesis
CloudFront
RDS DynamoDB RedShift
Database
AnalyticsCompute
Elastic
Transcoder
Content Delivery

S3 Data Lifecycle and Events
Standard
Active data Archive dataInfrequently accessed data
Standard - Infrequent
Access
Amazon Glacier
Create
Delete

Data Catalog
Scalable (secure, versioned, durable) storage +
Immutable data at every stage of its lifecycle +
Versioned schema and metadata
=
Data discovery, lineage and governance

AWS Glue: Components
Data Catalog
Crawl, store, search metadata in different data stores
Populate in a Hive metastore compliant catalog
Job Execution
Fully managed orchestration & execution of ETL jobs
Server-less execution model – no need to pre-provision
resources
Job Authoring
Author, edit, share ETL jobs in using your favorite tools
Store, share, re-use ETL code/script with Git integration

Manage table metadata through a Hive
metastore API or Hive SQL. Supported by
tools such as Hive, Presto, Spark, etc.
We added a few extensions:
 Search metadata for data discovery
 Connection info – JDBC URLs, credentials
 Classification for identifying and parsing files
 Versioning of table metadata as schemas
evolve and other metadata are updated
Populate using Hive DDL, bulk import, or
automatically through crawlers.
Glue Data Catalog

semi-structured
per-file schema
semi-structured
unified schema
identify file type
and parse files
enumerate
S3 objects
file 1
file 2
file N
…
int
array
intchar
struct
char int
array
struct
char
bool int
int
arraybool int
char
char int
custom classifiers
app log parser
metrics parser
…
system classifiers
JSON parser
CSV parser
Apache log parser
…
Crawlers: Automatic Schema Inference

Data Access & Authorisation
Give your users easy and secure access
Storage & Catalog
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified

AWS implements security at the data level,
not tool-by-tool
IAM
Amazon
S3
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
EMR
Amazon
Kinesis
Amazon
Athena
Service API Access

Third Party Ecosystem Security Tools
Amazon
S3
AWS
CloudTrail
http://amzn.to/2tSimHj
Amazon
Athena
Access Logging
API Logging
Access Log
Analytics
IAM
Amazon
EMR
http://amzn.to/2si6RqS
+ storage level support for access logging and audit

Additional S3 Security Practices
Use S3 bucket policies:
• Restrict access by IP
address
• Restrict deletes
• Enforce encryption use
Restrict deletes to require
MFA Authentication
Use Versioning!!!

AWS Server-Side encryption
AWS managed key infrastructure
AWS Key Management Service
Automated key rotation & auditing
Integration with other AWS services
AWS CloudHSM
Dedicated Tenancy SafeNet Luna SA HSM Device
Common Criteria EAL4+, NIST FIPS 140-2
Encryption Options

Extensible and Hybrid Crypto Integration for AWS Services
class myCrypt implements EncryptionMaterialsProvider
Amazon
Redshift
On Premises
HSM

Kinesis Firehose
Data Ingestion
Get your data into S3
quicklyand securely
Storage & Catalog
Protect and Secure

S3 Transfer Acceleration
S3 Bucket
AWS Edge
Location
Uploader
Optimized
Throughput!
Typically 50%-400% faster
Change your endpoint, not your code
No firewall exceptions or client
software required
59 global edge locations

Rio De
Janeiro
Warsaw New York Atlanta Madrid Virginia Melbourne Paris Los
Angeles
Seattle Tokyo Singapore
Time[hrs.]
500 GB upload from these edge locations to a bucket in Singapore
Public Internet
How Fast is S3 Transfer Acceleration?
S3 Transfer Acceleration

Stream Events to S3 Using Kinesis Firehose

Write Database Changes to S3 with DMS
<schema_name>/<table_name>/LOAD001.csv
<schema_name>/<table_name>/LOAD002.csv
<schema_name>/<table_name>/<time-stamp>.csv
Full Load
Change Data Capture

Kinesis Firehose
Athena
Query Service Glue
Data Ingestion
quicklyand securely
Processing & Analytics
Use of predictive and prescriptive
analytics to gain better understanding
Storage & Catalog
Protect and Secure
Machine Learning
Predictive analytics
Amazon AI

Glue: Managed ETL
• Serverless job execution
• PySpark transformations
• Monitoring, metrics and
notifications
• Combine with AWS Lambda
and AWS Step Functions for
complex data orchestrations

Amazon Kinesis Analytics
• Interact with streaming data in real time using SQL
• Build fully managed and elastic stream processing
applications that process data for real-time visualizations
and alarms

SELECT STREAM author,
count(author) OVER ONE_MINUTE
FROM Tweets
WINDOW ONE_MINUTE AS
(PARTITION BY author
RANGE INTERVAL '1' MINUTE PRECEDING)
WHERE text LIKE ‘%#AWSSummit%';
Amazon Kinesis Analytics – Simple SQL Interface

Analyzing Streaming Data… and Data at Rest

Amazon Athena
• No Infrastructure or administration
• Zero spin up time
• Transparent upgrades
• Query data in its raw format
• AVRO, Text, CSV, JSON, weblogs, AWS service logs
• Convert to an optimized form like ORC or Parquet for the
best performance and lowest cost
• No loading of data, no ETL required
• Stream data from directly from Amazon S3, take advantage
of Amazon S3 durability and availability

Simple Query editor
with syntax highlighting
and autocomplete
Data Catalog
Query History, Saved Queries, and
Catalog Management

QuickSight allows you to connect to data from a wide variety of AWS, third-party, and on-premises sources including Amazon Athena
Amazon RDS
Amazon S3
Amazon Redshift
Amazon Athena
Using Amazon Athena with Amazon QuickSight

Add Machine Learning Capabilities
Amazon Machine Learning Service
Batch and online predictions
Train using data in S3, RDS and
Redshift
Amazon EMR
Comprehensive machine learning
libraries (eg Spark MLlib, Anaconda)
Provision analytics clusters in minutes,
autoscale with data volume or query
demand

Amazon AI Services
Amazon Polly – Lifelike Text-to-Speech
47 voices, 24 languages
Low-latency, real time
Amazon Rekognition – Image Analysis
Object and scene detection
Facial analysis
Amazon Lex – Conversational Engine
Speech and text recognition
Enterprise connectors

Demographic Data
Facial Landmarks
Sentiment Expressed
Image Quality
Facial Analysis with Rekognition
Brightness: 25.84
Sharpness: 160
General Attributes

Up to ~40k CUDA cores
Pre-configured CUDA drivers
Jupyter notebook with Python2,
Python3, Anaconda
CloudFormation Template
AWS Marketplace – one-click deploy
AWS Deep Learning AMI

Scaling Distributed Experiments
• Inception v3 model
• Increasing machines
from 1 to 47
• 2x faster than
TensorFlow if using
more than 10 machines

Example MXNet User | TuSimple|Autonomous Driving

Kinesis Firehose
Athena
Query Service Glue
Machine Learning
Predictive analytics
Data Ingestion
quicklyand securely
Processing & Analytics
Use of predictive and prescriptive
analytics to gain better understanding
Storage & Catalog
Protect and Secure
Amazon AI

Thank You
Full Stack Analytics on AWS

Full Stack Analytics on AWS - AWS Summit Cape Town 2017

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Full Stack Analytics on AWS - AWS Summit Cape Town 2017

Ähnlich wie Full Stack Analytics on AWS - AWS Summit Cape Town 2017 (20)

Mehr von Amazon Web Services

Mehr von Amazon Web Services (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Full Stack Analytics on AWS - AWS Summit Cape Town 2017

Hinweis der Redaktion