In this webinar AWS Technical Evangelist, Ian Massingham, discusses the role that AWS services can play in helping you to derive value from your data, from stream processing with Amazon Kinesis, techniques for managing ingest of large data sets, through to processing data with Amazon Elastic MapReduce (EMR) and its ecosystem of tools and running large scale data warehouses on AWS with Redshift.
View the recording: http://youtu.be/7bkqopn19WY
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Journey Through the AWS Cloud - Big Data Analysis
1. Journey through the Cloud:
Big Data
Ian Massingham – Technical Evangelist
@IanMmmm
2. Journey through the cloud
Common use cases & stepping stones into the AWS cloud
Learning from customer journeys
Best practices to bootstrap your projects
3. Big Data on AWS
Collect and store Big Data in the AWS Cloud
Meet the challenge of the increasing volume, variety, and velocity of data
Reduce costs, scale to meet demand, and increase the speed of innovation
Make use of solutions for every stage of the big data lifecycle
4. Agenda
Collecting Big Data in the AWS Cloud
Real-time Streaming and Analysis
Big Data Cloud Storage Solutions
Analytics with Hadoop with Amazon EMR
Case Studies & Useful Resources
21. AWS Import / Export
AWS Direct Connect
Amazon Storage Gateway
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
22. Getting data into the AWS Cloud
AWS Direct
Connect
Direct connect, import/export and storage gateway
Dedicated bandwidth between
you site and AWS
Amazon Storage
Gateway
Gateway-Stored Volumes
Shrink-wrapped gateway for volume
synchronization
AWS Import/Export
Physical transfer of media into
and out of AWS
23. Inbound data transfer is free
Multipart upload to S3
Physical media via AWS Import/Export
AWS Direct Connect
28. Hourly server logs: how your
systems went wrong an hour ago
Weekly / Monthly Bill: What you
spent this past billing cycle
Daily customer report from your
website: tells you what deal or ad
to try next time
Daily fraud reports: tells you if there
was fraud yesterday
Daily business reports: tells me
how customers used my services
yesterday
Real-time metrics: what just went
wrong now
Real-time spending alerts/caps:
guaranteeing you can’t overspend
Real-time analysis: what to offer
the current customer now
Real-time detection: blocks
fraudulent use now
Fast ETL into Amazon Redshift:
how are customers using my
services now
30. AMAZON KINESIS ARCHITECTURE"
Millions of sources
producing 100s of
TB per hour!
Front
End
Authentication!
Authorization!
AZ!AZ!AZ!
Durable, highly consistent storage replicates
data across three centers (availability zones)!
Amazon Web Services!
Inexpensive: $0.028 per million puts!
Aggregate and
archive to S3!
Real-time
dashboards
and alarms!
Machine
learning
algorithms !
Aggregate analysis
in Hadoop or a
data warehouse!
Ordered stream of events
supporting multiple readers!
32. Fundamental Storage Options
Elastic Block Store, S3 and Glacier
Simple Storage Service
Highly scalable object storage
1 byte to 5TB in size
99.999999999% durability
Elastic Block Store
High performance block storage device
1GB to 1TB in size
Mount as drives to instances with
snapshot/cloning functionalities
Glacier
Long term object archive
Extremely low cost per gigabyte
99.999999999% durability
33. Fundamental Storage Options
Elastic Block Store, S3 and Glacier
Very fast
‘instance’
disks
Slow, rare
access
Simple Storage Service
Highly scalable object storage
1 byte to 5TB in size
99.999999999% durability
Fast web object
storage
Elastic Block Store
High performance block storage device
1GB to 1TB in size
Mount as drives to instances with
snapshot/cloning functionalities
Glacier
Long term object archive
Extremely low cost per gigabyte
99.999999999% durability
34. Fundamental Storage Options
Elastic Block Store, S3 and Glacier
Elastic Block Store
High performance block storage
device
1GB to 1TB in size
Mount as drives to instances with
snapshot/cloning functionalities
Glacier
Long term object archive
Extremely low cost per gigabyte
99.999999999% durability
IMAGE
Amazon S3
Simple Storage Service
Highly scalable
data storage in-the-cloud
Programmatic access
via web services API
Is a Web Store
Not a file system
Fast, highly available
Durable
Economical
Paradigm Object store
Performance Very fast
Redundancy Across data centers
Security Public Key / Private Key
Pricing
$0.03/GB/month stored
(eu-west-1)
Access from
the Net?
Yes
Typical use
case
Write once, read many
Simple Storage Service
Highly scalable object storage
1 byte to 5TB in size
99.999999999% durability
40. Lots of actions by
John Smith
Very large
click log
(e.g TBs)
41. Lots of actions by
John Smith
Split the log
into many
small pieces
Very large
click log
(e.g TBs)
42. Lots of actions by
John Smith
Split the log
into many
small pieces
Process in an
EMR cluster
Very large
click log
(e.g TBs)
43. Lots of actions by
John Smith
Split the log
into many
small pieces
Process in an
EMR cluster
Aggregate
the results
from all the
nodes
Very large
click log
(e.g TBs)
44. Lots of actions by
John Smith
Split the log
into many
small pieces
Process in an
EMR cluster
Aggregate
the results
from all the
nodes
Very large
click log
(e.g TBs)
What
John
Smith
did
45. What
John
Smith
did
Insight in a fraction of the time
Very large
click log
(e.g TBs)
46. Analytics Data management languages/engines
HDFS
Pig
Amazon
Redshift
AWS Data Pipeline
Amazon
Kinesis
Amazon
S3
Amazon
DynamoDB
Amazon
RDS
Amazon
EMR
Data
Sources
47. ODBC and JDBC drivers now for Amazon EMR
Amazon EMR now ships with ODBC and JDBC drivers for Hive, Impala & HBase
Now easier to use popular BI tools like:
Microsoft Excel, Tableau, MicroStrategy, and QlikView
52. Foursquare…
33 million users
1.3 million businesses
…generates a lot of Data
3.5 billion check-ins
15M+ venues,
Terabytes of log data
53. Uses EMR for
Evaluation of new features
Machine learning
Exploratory analysis
Daily customer usage reporting
Long-term trend analysis
54. Benefits of Amazon EMR
Ease-of-Use
“We have decreased the processing time for urgent data-analysis”
Flexibility
To deal with changing requirements & dynamically expand reporting clusters
Costs
“We have reduced our analytics costs by over 50%”
55. Gorilla Coffee
Gray's Papaya
Amorino
When do people go to a place?
Thursday
Friday
Saturday
Sunday
68. AWS Big Data Training
Fundamentals Course
http://aws.amazon.com/training/course-descriptions/bigdata-fundamentals/
This is a free, online training course and is intended for individuals who are new to big data concepts,
including solutions architects, data scientists, and data analysts.
Big Data on AWS
http://aws.amazon.com/training/course-descriptions/bigdata/
The instructor led Big Data on AWS course introduces you to cloud-based big data solutions and Amazon
Elastic MapReduce (EMR), the AWS big data platform. In this course, we will show you how to use
Amazon EMR to process data using the broad ecosystem of Hadoop tools like Pig and Hive.
69. Webinar Resources
Amazon Elastic MapReduce Masterclass
https://www.brighttalk.com/webcast/9019/103985?channel=CP
Amazon Redshift Masterclass
Coming up on September 23
Register here: http://aws.amazon.com/webinars/emea-masterclass/
70. AWS Training & Certification
Certification
Demonstrate your skills,
knowledge, and expertise
with the AWS platform
aws.amazon.com/certification
Self-Paced Labs
Try products, gain new
skills, and get hands-on
practice working with
AWS technologies
aws.amazon.com/training/
self-paced-labs
Training
Skill up and gain
confidence to design,
develop, deploy and
manage your applications
on AWS
aws.amazon.com/training