In this session, we’ll expand on the S3 re:Invent deep-dive session with a hands-on workshop on advanced S3 features and storage management capabilities. We’ll have AWS S3 and Glacier experts on-hand to deep-dive on S3 architecture, performance & scalability optimization, how to analyze your content and leverage storage tiers (S3 Standard, S3 Standard Infrequent Access, Glacier) to balance cost and SLAs, security considerations, replication with Cross Region Replication (CRR), versioning for data protection and more.
In the hands-on lab, we’ll walk through a customer scenario: architecting a high-performance infrastructure for consumer applications. In the scenario, we’ll use sample data sets on S3, analyze object retrieval patterns and design a complete solution using many of the features S3 offers including migrating objects to an appropriate tier.
Prerequisites:
- Participants should have an AWS account established and available for use during the workshop.
- Please bring your own laptop.
2. What to Expect from the SessionWhat to Expect from the Session
• How does a workshop differ from other sessions?
• S3 new features
• How we think about storage management for S3
• Storage Management Portfolio for S3
• Understand your data
• Discover your data
• Manage your data
• Pulling it all together
• Key naming schemes
• Group activity
3. How does a workshop differ from other sessions
Learn from AWS
45 minutes of lecture
Learn from each other
Group learning activity
5. 2012 2013 2014
Amazon storage usage
Trillions of objects
Millions of requests per second
6. Choice of storage classes on S3
Standard
Active data Archive dataInfrequently accessed data
Standard - Infrequent Access Amazon Glacier
7. File sync and share
+
consumer file
storage
Backup and archive +
disaster recovery
Long-retained
data
Use cases for Standard-Infrequent Access
8. Designed for 11 9s of
durability
Standard - Infrequent Access storage
Designed for
99.9% availability
Durable Available
Same as Standard storage
High performance
• Bucket policies
• AWS Identity and Access
Management (IAM) policies
• Many encryption options
Secure
• Lifecycle management
• Versioning
• Event notifications
• Metrics
Integrated
• No impact on user
experience
• Simple REST API
Easy to use
9. - Directly PUT to Standard - IA
- Transition Standard to Standard - IA
- Transition Standard - IA to Amazon Glacier
storage
- Expiration lifecycle policy
- Versioning support
Standard - Infrequent Access storage
Integrated: Lifecycle management
Standard - Infrequent Access
13. S3 Inventory
Use case: trigger business workflows and applications such as secondary index garbage
collection, data auditing, and offline analytics
• More information about your objects than provided by LIST API such as replication
status, multipart upload flag, and delete marker
Save time Daily or Weekly delivery Delivery to S3 bucketCSV File Output
14. S3 Inventory
Eventually consistent rolling snapshot
• New objects may not be listed
• Removed objects may still be included
Name Value Type Description
Bucket String Bucket name. UTF-8 encoded.
Key String Object key name. UTF-8 encoded.
Version Id String Version Id of the object
Is Latest boolean true if object is the latest version (current version) of a versioned object, otherwise false
Delete Marker boolean true if object is a delete marker of a versioned object, otherwise false
Size long Object size in bytes
Last Modified String Last modified timestamp. Format in ISO: YYYY-MM-DDTHH:mm:ss.SSSZ
ETag String eTag in HEX encoded format
StorageClass String Valid values: STANDARD, REDUCED_REDUNDANCY, GLACIER, STANDARD_IA. UTF-8 encoded.
Multipart Uploaded boolean true if object is uploaded by using multipart, otherwise false
Replication Status String Valid values: REPLICA, COMPLETED, PENDING, FAILED. UTF-8 encoded.
Validate before you act!
• Use HEAD OBJECT
15. S3 Analytics – Storage Class Analysis
Analyze buckets,
prefixes, or tags
$0.10 per million
objects analyzed
Storage Class
Analysis
&
lifecycle
recommendation
Data-driven storage management for S3
Export Analysis to
your S3 bucket
19. Monitor your storage
Monitor and Alert with
Amazon CloudWatch
Audit your storage with
AWS CloudTrail
Server Access Logs
20. CloudWatch metrics for S3
Operational & Performance monitoring
• Generate metrics for data of your choice
• Entire bucket, prefixes, and tags
• Up to 1,000 object groups
• 1-minute CloudWatch metrics
• Alert and alarm on metrics
• Pay for what you use
21. CloudWatch metrics for S3
Price per metric
• $0.30 per metric per month
Metric Name Metric value
AllRequests Count
PutRequests Count
GetRequests Count
ListRequests Count
DeleteRequests Count
HeadRequests Count
PostRequests Count
BytesDownloaded MB
BytesUploaded MB
4xxErrors Count
5xxErrors Count
FirstByteLatency ms
TotalRequestLatency ms
22. CloudTrail data events for S3
Use case: Perform security analysis, meet your IT auditing and
compliance needs
API logs for bucket and object-level requests
• Creation/deletion of buckets
• Changes to bucket configuration (bucket policy, lifecycle policies,
replication policies, etc.)
• SNS notification for log file delivery (optional)
24. Manage your data
S3 Object Tags
Manage storage based on object tags
• Classify your data
• Tag your objects with key-value pairs
• Write policies once based on the type of data
AnalyzeLifecycle PolicyAccess Control
25. Deep dive on tags
• Tags are key value pairs
• Maximum 10 tags per object
• Maximum key length—127 Unicode characters
• Maximum value length—255 Unicode characters
• Tag keys and values are case sensitive
2 ways to put tags via API
• Put objects with tag parameter, or
• Add tag API after object is created
26. What can I do with tags?
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": "arn:aws:s3:::EXAMPLE-BUCKET-NAME/*"
"Condition": {"StringEquals": {"S3:ResourceTag/HIPAA":"True"}}
}
]
}
Manage permissions with tags
31. Getting high throughput performance with S3
• S3 can scale to many thousands of requests per second
• Need a good key naming scheme
• Only at scale do you need to consider your key naming
scheme
• What are Partitions?
• Why?
• Spread keys lexigraphically
• Goal of partitioning is too spread the heat
• Prevent HotSpots
33. Distributing key names
Add randomness to the beginning of the key name…
my-bucket/6213-2013_11_13.jpg
my-bucket/4653-2013_11_13.jpg
my-bucket/9873-2013_11_13.jpg
my-bucket/4657-2013_11_13.jpg
my-bucket/1256-2013_11_13.jpg
my-bucket/8345-2013_11_13.jpg
my-bucket/0321-2013_11_13.jpg
my-bucket/5654-2013_11_13.jpg
my-bucket/2345-2013_11_13.jpg
my-bucket/7567-2013_11_13.jpg
my-bucket/3455-2013_11_13.jpg
my-bucket/4313-2013_11_13.jpg
Partitions:
my-bucket/0
my-bucket/1
my-bucket/2
my-bucket/3
my-bucket/4
my-bucket/5
my-bucket/6
my-bucket/7
my-bucket/8
my-bucket/9
34. Monotonically Increasing Customer ID
mycustdata/2134857/app_data_1/2016-11-30-02:01:01:24/log.txt
mycustdata/2134857/app_data_1/2016-11-30-02:01:01:32/wrk_user
mycustdata/2134858/app_data_1/2016-11-30-02:01:01:29/product_usage.csv
mycustdata/2134858/app_data_1/2016-11-30-02:01:01:24/log.txt
mycustdata/2134858/app_data_1/2016-11-30-02:01:01:14/wrk_user
mycustdata/2134859/app_data_1/2016-11-30-02:01:01:28/product_usage.csv
mycustdata/2134859/app_data_1/2016-11-30-02:01:01:45/log.txt
mycustdata/2134859/app_data_1/2016-11-30-02:01:01:34/wrk_user
mycustdata/7584312/app_data_1/2016-11-30-02:01:01:23/product_usage.csv
mycustdata/7584312/app_data_1/2016-11-30-02:01:01:24/log.txt
mycustdata/7584312/app_data_1/2016-11-30-02:01:01:32/wrk_user
mycustdata/8584312/app_data_1/2016-11-30-02:01:01:29/product_usage.csv
mycustdata/8584312/app_data_1/2016-11-30-02:01:01:24/log.txt
mycustdata/8584312/app_data_1/2016-11-30-02:01:01:14/wrk_user
mycustdata/9584312/app_data_1/2016-11-30-02:01:01:28/product_usage.csv
mycustdata/9584312/app_data_1/2016-11-30-02:01:01:45/log.txt
mycustdata/9584312/app_data_1/2016-11-30-02:01:01:34/wrk_user
Partition:
mycustdata/213485
Partitions:
mycustdata/7
mycustdata/8
mycustdata/9
Reverse Monotonically Increase prefix
If a single customer can push a higher
workload, they can cause a Hotspot.
35. Add A Hash to Beginning of Key – Best
mycustdata/2134857/app_data_1/2016-11-30-02:01:01:24/log.txt
mycustdata/2134857/app_data_1/2016-11-30-02:01:01:32/wrk_user
mycustdata/2134858/app_data_1/2016-11-30-02:01:01:29/product_usage.csv
mycustdata/2134858/app_data_1/2016-11-30-02:01:01:24/log.txt
mycustdata/2134858/app_data_1/2016-11-30-02:01:01:14/wrk_user
mycustdata/2134859/app_data_1/2016-11-30-02:01:01:28/product_usage.csv
mycustdata/2134859/app_data_1/2016-11-30-02:01:01:45/log.txt
mycustdata/2134859/app_data_1/2016-11-30-02:01:01:34/wrk_user
mycustdata/1a/2134857/app_data_1/2016-11-30-02:01:01:24/log.txt
mycustdata/34/2134857/app_data_1/2016-11-30-02:01:01:32/wrk_user
mycustdata/a7/2134858/app_data_1/2016-11-30-02:01:01:29/product_usage.csv
mycustdata/58/2134858/app_data_1/2016-11-30-02:01:01:24/log.txt
mycustdata/70/2134858/app_data_1/2016-11-30-02:01:01:14/wrk_user
mycustdata/02/2134859/app_data_1/2016-11-30-02:01:01:28/product_usage.csv
mycustdata/2b/2134859/app_data_1/2016-11-30-02:01:01:45/log.txt
mycustdata/63/2134859/app_data_1/2016-11-30-02:01:01:34/wrk_user
Partition:
mycustdata/213485
Partitions:
mycustdata/0
mycustdata/1
mycustdata/2
mycustdata/3
mycustdata/4
mycustdata/5
mycustdata/6
mycustdata/7
Add a hash to evenly distribute the keys for all requests
mycustdata/8
mycustdata/9
mycustdata/a
mycustdata/b
mycustdata/c
mycustdata/d
mycustdata/e
mycustdata/f
36. Challenges of using a hash to create entropy
• Listing challenges/opportunities:
• A Secondary Index can be used to avoid listing
• Can be accomplished with Event Notification to AWS Lambda and
Amazon DynamoDB
• Blog Post - Building and Maintaining an Amazon S3 Metadata Index
without Servers
• Hash can be used to split work of LISTing objects
• Lifecycle constraints
• Max number of lifecycle rules – 1000
• Tagging can make this easier
37. Faster upload of large objects
Parallelize PUTs with Multipart Uploads
• Increase aggregate throughput by
parallelizing PUTs on high-bandwidth
networks
• Move the bottleneck to the network,
where it belongs
• Increase resiliency to network errors;
fewer large restarts on error-prone
networks
Best Practice
38. Faster download
You can parallelize GETs too
For large objects, use range-based GETs
For content distribution, enable Amazon CloudFront
• Caches objects at the edge
• 59 global edge locations
GET /example-object HTTP/1.1
Host: example-bucket.s3.amazonaws.com
x-amz-date: Fri, 28 Jan 2011 21:32:02 GMT
Range: bytes=0-9
Authorization: AWS AKIAIOSFODNN7EXAMPLE:Yxg83MZaEgh3OZ3l0rLo5RTX11o=
43. Ring Neighborhoods: Network Effects in Practice
Wilshire Park study with LAPD:
Ring installed on 10% of homes
Burglaries down 55% for the
entire community in 6 months
Burglars want an easy hit, and go
elsewhere if you’re home
Alarms are reactive, not proactive
Traditional systems don’t link up,
so protection ends at your door
44. Devices installed in nearly every country on Earth
Millions of connected apps and devices
Over 1 billion videos and rapidly increasing
High growth brings challenges, even month to month
Ring Urban Activity Index
2016-10-20, USA-only,
low-cut rural areas
47. Ring Requirements
• Live video is ingested from devices and apps via our application servers
• Videos are uploaded to our S3 buckets
• The videos are transcoded and make them available for customers to
stream
• Customers need low latency in delivering video streams around the world
• Customers get a 30-day free trial of video backups.
• If they decide to continue to store videos, they can store videos for up to 6
months after the activity.
• When users share videos, we expect them to be watched a lot, and
sometimes they go viral
48. Present Your Design
• How did you address the use case?
• What was your key naming scheme?
• How did you address scale?
• How did you manage object metadata?
• Did you minimize cost?
• How do you monitor your requests?
• How did you address security considerations?
49. Ring Video Pipeline
Raw
Buckets
Final
(Standard)
S3 Logs
Amazon
CloudFront
Ring App(s)
AWS
Lambda
Viewers
Amazon
SQS
Owner(s)
Visitor
Application
Servers
Ring Device
GPU
Farm
Final
(IA)
Lifecycle
Transitions
Event
Triggers
Live Video
50. Extreme Performance is Easy
S3 will automatically partition if you use good keys – or just add more buckets
CloudFront as a CDN for GET heavy loads and faster downloads
Faster uploads with Transfer Acceleration
TCP Window Scaling - without it, 64kB window kneecaps long fat networks
TCP SACK is good for fast but lossy connections like mobile connections
examplebucket/2134857/data/start.png
examplebucket/2134857/data/resource.rsrc
examplebucket/2134857/data/results.txt
examplebucket/2134858/data/start.png
examplebucket/2134858/data/resource.rsrc
examplebucket/2134858/data/results.txt
examplebucket/2134859/data/start.png
examplebucket/2134859/data/resource.rsrc
examplebucket/2134859/data/results.txt
examplebucket/7584312/data/start.png
examplebucket/7584312/data/resource.rsrc
examplebucket/7584312/data/results.txt
examplebucket/8584312/data/start.png
examplebucket/8584312/data/resource.rsrc
examplebucket/8584312/data/results.txt
examplebucket/9584312/data/start.png
examplebucket/9584312/data/resource.rsrc
examplebucket/9584312/data/results.txt