This document summarizes key aspects of full stack analytics on AWS, including foundational services like storage, data ingestion, processing and analytics, machine learning, and security. It discusses AWS services like S3, Athena, Glue, Kinesis, Rekognition, and how they can be used together for cost-effective analytics from ingestion to machine learning to building smarter applications. Security is addressed at both the service and data levels using tools like IAM, encryption, and third party integration.
2. Forces and Trends
Cost Optimization
Licenses
Hardware
Data center and operations
Dark Data
Prematurely discarding data
Agility
Experimentation (data & tools)
Democratised Access to Data
Time-to-first-results
Terminate failed experiments early
From BI to Data Science
In-house data science
From back office to product
8. S3 Data Lifecycle and Events
Standard
Active data Archive dataInfrequently accessed data
Standard - Infrequent
Access
Amazon Glacier
Create
Delete
9. Data Catalog
Scalable (secure, versioned, durable) storage +
Immutable data at every stage of its lifecycle +
Versioned schema and metadata
=
Data discovery, lineage and governance
10. AWS Glue: Components
Data Catalog
Crawl, store, search metadata in different data stores
Populate in a Hive metastore compliant catalog
Job Execution
Fully managed orchestration & execution of ETL jobs
Server-less execution model – no need to pre-provision
resources
Job Authoring
Author, edit, share ETL jobs in using your favorite tools
Store, share, re-use ETL code/script with Git integration
11. Manage table metadata through a Hive
metastore API or Hive SQL. Supported by
tools such as Hive, Presto, Spark, etc.
We added a few extensions:
Search metadata for data discovery
Connection info – JDBC URLs, credentials
Classification for identifying and parsing files
Versioning of table metadata as schemas
evolve and other metadata are updated
Populate using Hive DDL, bulk import, or
automatically through crawlers.
Glue Data Catalog
12. semi-structured
per-file schema
semi-structured
unified schema
identify file type
and parse files
enumerate
S3 objects
file 1
file 2
file N
…
int
array
intchar
struct
char int
array
struct
char
bool int
int
arraybool int
char
char int
custom classifiers
app log parser
metrics parser
…
system classifiers
JSON parser
CSV parser
Apache log parser
…
Crawlers: Automatic Schema Inference
14. Data Access & Authorisation
Give your users easy and secure access
Storage & Catalog
Secure, cost-effectivestorage in Amazon
S3. Robust metadata in AWSCatalog
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
15. AWS implements security at the data level,
not tool-by-tool
IAM
Amazon
S3
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
EMR
Amazon
Kinesis
Amazon
Athena
Service API Access
16. Third Party Ecosystem Security Tools
Amazon
S3
AWS
CloudTrail
http://amzn.to/2tSimHj
Amazon
Athena
Access Logging
API Logging
Access Log
Analytics
IAM
Amazon
EMR
http://amzn.to/2si6RqS
+ storage level support for access logging and audit
17. Additional S3 Security Practices
Use S3 bucket policies:
• Restrict access by IP
address
• Restrict deletes
• Enforce encryption use
Restrict deletes to require
MFA Authentication
Use Versioning!!!
18. AWS Server-Side encryption
AWS managed key infrastructure
AWS Key Management Service
Automated key rotation & auditing
Integration with other AWS services
AWS CloudHSM
Dedicated Tenancy SafeNet Luna SA HSM Device
Common Criteria EAL4+, NIST FIPS 140-2
Encryption Options
19. Extensible and Hybrid Crypto Integration for AWS Services
class myCrypt implements EncryptionMaterialsProvider
Amazon
Redshift
On Premises
HSM
20. Kinesis Firehose
Data Access & Authorisation
Give your users easy and secure access
Data Ingestion
Get your data into S3
quicklyand securely
Storage & Catalog
Secure, cost-effectivestorage in Amazon
S3. Robust metadata in AWSCatalog
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
22. S3 Transfer Acceleration
S3 Bucket
AWS Edge
Location
Uploader
Optimized
Throughput!
Typically 50%-400% faster
Change your endpoint, not your code
No firewall exceptions or client
software required
59 global edge locations
23. Rio De
Janeiro
Warsaw New York Atlanta Madrid Virginia Melbourne Paris Los
Angeles
Seattle Tokyo Singapore
Time[hrs.]
500 GB upload from these edge locations to a bucket in Singapore
Public Internet
How Fast is S3 Transfer Acceleration?
S3 Transfer Acceleration
25. Write Database Changes to S3 with DMS
<schema_name>/<table_name>/LOAD001.csv
<schema_name>/<table_name>/LOAD002.csv
<schema_name>/<table_name>/<time-stamp>.csv
Full Load
Change Data Capture
26. Kinesis Firehose
Athena
Query Service Glue
Data Access & Authorisation
Give your users easy and secure access
Data Ingestion
Get your data into S3
quicklyand securely
Processing & Analytics
Use of predictive and prescriptive
analytics to gain better understanding
Storage & Catalog
Secure, cost-effectivestorage in Amazon
S3. Robust metadata in AWSCatalog
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
Machine Learning
Predictive analytics
Amazon AI
27. Glue: Managed ETL
• Serverless job execution
• PySpark transformations
• Monitoring, metrics and
notifications
• Combine with AWS Lambda
and AWS Step Functions for
complex data orchestrations
29. Amazon Kinesis Analytics
• Interact with streaming data in real time using SQL
• Build fully managed and elastic stream processing
applications that process data for real-time visualizations
and alarms
30. SELECT STREAM author,
count(author) OVER ONE_MINUTE
FROM Tweets
WINDOW ONE_MINUTE AS
(PARTITION BY author
RANGE INTERVAL '1' MINUTE PRECEDING)
WHERE text LIKE ‘%#AWSSummit%';
Amazon Kinesis Analytics – Simple SQL Interface
32. Amazon Athena
• No Infrastructure or administration
• Zero spin up time
• Transparent upgrades
• Query data in its raw format
• AVRO, Text, CSV, JSON, weblogs, AWS service logs
• Convert to an optimized form like ORC or Parquet for the
best performance and lowest cost
• No loading of data, no ETL required
• Stream data from directly from Amazon S3, take advantage
of Amazon S3 durability and availability
33. Simple Query editor
with syntax highlighting
and autocomplete
Data Catalog
Query History, Saved Queries, and
Catalog Management
34. QuickSight allows you to connect to data from a wide variety of AWS, third-party, and on-premises sources including Amazon Athena
Amazon RDS
Amazon S3
Amazon Redshift
Amazon Athena
Using Amazon Athena with Amazon QuickSight
37. Add Machine Learning Capabilities
Amazon Machine Learning Service
Batch and online predictions
Train using data in S3, RDS and
Redshift
Amazon EMR
Comprehensive machine learning
libraries (eg Spark MLlib, Anaconda)
Provision analytics clusters in minutes,
autoscale with data volume or query
demand
38. Amazon AI Services
Amazon Polly – Lifelike Text-to-Speech
47 voices, 24 languages
Low-latency, real time
Amazon Rekognition – Image Analysis
Object and scene detection
Facial analysis
Amazon Lex – Conversational Engine
Speech and text recognition
Enterprise connectors
40. Up to ~40k CUDA cores
Pre-configured CUDA drivers
Jupyter notebook with Python2,
Python3, Anaconda
CloudFormation Template
AWS Marketplace – one-click deploy
AWS Deep Learning AMI
41. Scaling Distributed Experiments
• Inception v3 model
• Increasing machines
from 1 to 47
• 2x faster than
TensorFlow if using
more than 10 machines
43. Kinesis Firehose
Athena
Query Service Glue
Machine Learning
Predictive analytics
Data Access & Authorisation
Give your users easy and secure access
Data Ingestion
Get your data into S3
quicklyand securely
Processing & Analytics
Use of predictive and prescriptive
analytics to gain better understanding
Storage & Catalog
Secure, cost-effectivestorage in Amazon
S3. Robust metadata in AWSCatalog
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
Amazon AI
Any modern data and analytics architecture must address a number of forces and trends
Technologies come and go
Data has a geological lifespan
Store all your data, forever, at every stage of its lifecycle
Apply it using the appropriate technology
As we talk to customers on their cloud journey and walk through our cloud adoption framework, at each step of the process (Strategy -> Plan Build/Iterate Run), storage is a critical element to be considered. In fact, storage is central to virtually every workload running in AWS today.
So as we begin to think about either cloud native or cloud migration strategies, think about storage as a strategic, foundational element.
Once your data is stored in the cloud, the world of AWS services offerings opens to you,
A good architecture reduces irreversibility, and allows you to defer decisions to a later point in time while locking in key parameters, such as cost.
That is, you can anticipate how much a deferred decision will cost when you have to make it.
With our foundational storage capability we don’t want to have to make upfront, irreversible decisions about capacity or data format.
What if I have to choose capacity up front, but then exceed it? I close down certain opportunities.
What if I never use my reserved capacity? I pay for wasted space.
If I have to choose a specific format, perhaps a proprietary format, up front, I potentially exclude certain opportunities in the future.
Related to this, we don’t want to choose a foundational storage strategy that results in a “pager architecture”
We don’t want to be in a position where in order to guarantee durability and availability of a our data, we always need someone on call, to replace a failed node and restore 3x redundancy
We want something that provides for near-infinite scalability, a range of data formats, and high durability and availability guarantees.
S3 allows you to keep your data in a secure, cheap, near-infinitely scalable environment, without having to make up-front decisions about capacity or data formats. To that extent, S3 is the most architecturally significant element in your data architecture.
Storage is more than just the protocol or interface. It’s the lifeblood of application design and renewed architectures. Our customers have taught us that they need two things: scale and trust. 1. Make sure I can grow. 2. Make sure I can access what I need when I need it, (and of course help me keep costs down).
The suite of transfer services that support customers in their migrations means more choice. Large batches, incremental changes, constant streams or seamless integration are all part of the storage offering. Today we’re going to talk about two of the newest ways to do cloud data migration, Snowball and S3 Transfer Acceleration.
By convention, S3 has been at the heart of our “data lake” architecture for many years, but more and more it is being integrated with our data and analytics services:
RDS, Redshift, etc. backup to S3
Athena can query against S3 using SQL
Redshift Spectrum can join data in S3 with data in Redshift
Kinesis Firehose ingests streaming data into S3
EMR can treat S3 as near-infinite capacity, highly durable HDFS
And so on…
S3 isn’t just dumb storage: it allows you to manage data lifecycle and act on data events
S3 Standard – general purpose storage class. High dur, avail, performance. If you don’t want to think about your data access patterns.
Glacier archival storage with 3-5 hours of retrieval time. low cost with pricing starting at 7-tenth of a cent
many AWS customers store backups or log files that
are almost never read. Or access frequency drops as the data ages. But need immmediate access.
S3 Standard-IA is a new storage class on Amazon S3 design for colder or less frequently access workload. Offers same high performance, high throughput and low latency as S3 Standard.
low cost with storage starting at one and a quarter cent per GB
If you think about the typical lifecycle of data, newly created active data is access very frequently.
In our example take a new video clip you share with your friends and family. People will be consuming this new data actively, this new video will be played back frequently, shared and commented on very frequently.
As this video becomes older, a smaller number of people will engage, it will be LESS FREQUENTLY accessed.
If you don’t want to think about your data access patterns but just want to high durability, availability and performance for Amazon S3 you can simply select S3 Standard.
For data that is less-frequently accessed, you can leverage Amazon S3 Standard-IA to save on cost while still benefiting from the great durability and performance as S3 Standard.
At some point in time your data will be ready to be archived because no one if actively interacting with your data and you need to archive that away for record keeping etc.
In addition to transitioning your data to S-IA as its characteristics change, you can also leverage Amazon S3 Standard-IA for new data that fits the bill for Infrequently accessed data. For example you can leverage the S-IA storage class to stored detailed applications logs that you analyst in-frequently and save on storage cost.
Near infinite (secure, versioned, highly durable) storage + immutable data at every stage of its lifecycle + versioned metadata = data lineage and governance
Need better graphics. Have asked Jason for some.
Data Catalog: A metadata store that automatically organizes the metadata for all your data assets across your business.
You can organize and search your assets.
ETL system: An engine that automatically generates ETL scripts and allows you to orchestrate, monitor, refine and manage your jobs
Securely control access to all digital resources based on users, groups, and application roles
In S3 we can control access at the bucket and even at the object level: not only who can access an object, but what they can do with it
But you can layer on additional security
Apache Ranger or Knox: pluggable security layer for Hive, allow AD-federated access to data
Comprehensive auditing of all data access API calls via CloudTrail, which you can then analyze with Athena
We can say here that these strategies can help give additional protection against ransomware attacks
For additional security, enable MFA (multi-factor authentication) delete, which requires additional authentication to:
Change the versioning state of your bucket
Permanently delete an object version
MFA delete requires both your security credentials and a code from an approved authentication device
Protection even if you give your account credentials to the wrong person or a malicious employee
Protects from recover from unintended user deletes or application logic failures,
no performance penalty.
Keeps all versions, new uploads stored separately, with delete, latest version is maintained, delete marker added
Can retrieve deleted or roll back to previous
3 states: **default, not versions saved, deleted objects cannot be retrieved, ** versioning-enabled, as discussed, save versions of overwritten or deleted, ** suspended, all saved versions are maintained, but new versions are not created
AWS offers a number of encryption options that allow you to vary the security based on where the key is stored and who has access to it.
AWS SSE
At the simplest level you can take advantage of the AWS SSE. Integrated with S3 and Resdhift. Encrypts automatically and transparently which makes it very easy to use.
AWS KMS
Create and control encryption keys, rotate them
Centralized and fully managed, so you focus on encryption needs, not infrastructure
CloudHSM
Dedicated tennacy hardware security modules.
Certified infrastructure where AWS has absolutely no access to your encrption information.
Designed to destroy the keys rather than allow you access into the system.
Clustered for HA so that your keys are secure and durable.
HSM = hardware security modules
custom encrytion materials provider - use any strategy for providing encryption materials, such as integrating with existing key management systems
One-off migrations
Batch uploads
Streaming data – whether streaming events from IoT devices, logs, clickstream providers, or change data capture events from an on-prem relational
Transferring large files over a long distance can be challenging. If you are moving data across continents, large number of objects, or if your customers are long ways away from AWS regions.
Accelerate transfer to S3 using the AWS edge network. Leverages POP locations to insure your transfers travel a shorter distance on the public internet and then travel the remaining portion over an optimized route via the Amazon backbone.
FASTER OR FREE There is no cost for using the XA if the upload is not faster. In the event the network is the same as normal upload, you don’t pay for XA..
S3-XA uses standard TCP and HTTP so it does not require any firewall exceptions or custom software installation.
Variable internet traffic is shorter with TA, when you’re uploading file thru a long distance. we transfer thru optimized Amazon backbone. Data transfer thru the Amazon backbone, which has much stable connectivity than the internet. Freeway with performance booster because we know the road is open.
Although primary is upload/ ingestion to S3. we have seen customer for downloads, sharing video on individual basis, very few people pulling the video. Downloading thru XA is a fast path
Use XA to improve their Upload availability : spotty internet connection. Uploading file across the global poor internet connectivity, increase their upload availability. (FS)
Associate performance wit availability. Taking a long time to upload or download, assume it’s not working properly. Customer may not be as patient and cancel. Finish faster.
-download files from S3 when the files are not pulled down frequently (single user versus many). Customers were seeing a benefit of pulling down files from S3 in the quickest way possible by leveraging S3-XA. for files that are frequently accessed, recommended to use CloudFront which caches your data at the edge
Time it takes to upload a 500GB object to SIN from various location
Yellow bar, blue bar
The greater difference between the two bars, the more improvement
1/ the farther your bucket, the more benefit from moving over the AWS network.
2/ the larger file you upload, the more benefit you’ll see
Lot less variability over all locations
Transform data – ETL or computationally expensive operation on dataset that emits another dataset
Analytics over streaming data – understand something about current state of the world – end users or system behaviours
Interactive, ad hoc, exploratory analysis over our data
Build predictive analytics that help us forecast the future and pre-emptively act on insights we generate
You can then edit these transformations, if necessary, using the tools and technologies you already know, such as Python, Spark, Git and your favorite integrated developer environment (IDE), and share them with other AWS Glue users.
Work with fast moving data
accesible language: SQL,
manipulate data on stream, describe filters, projections, functions,
emit another data source as stream -> could push to Kibana, for example
Sliding window
for each new record that appears on the stream, we emit an output by applying aggregates on tweets in the preceding 1-minute window.
Ad hoc, interactive, exploratory analyses over TB or PB data in S3
P2 instances – GPU accelerated instances
Up to 16 GPUs per instance, nearly 40,000 CUDA cores
CUDA is NVIDIA’s GPU-accelerated parallel computing programming model
Inception: architecture for training image recognition system
16 x p2.16xlarge - Scale out beyond single instance, get near linear scalability – over half million parallel processing cores
US- and China-based technology company developing autonomous driving technology.
Mission: set a new standard on safety, reliability, and efficiency in the trucking industry.