6. GB
TB
PB
ZB
EB
Big Data: Unconstrained Growth
• Unstructured data growth
is explosive
• 95% of the 1.2 zettabytes
of data in the digital
universe is unstructured
• Logs, Machine data and
IoT will only steepen the
curve
• 70% of this data is user-
generated content
• Videos resolution is always
increasing: 1080p, 4K, 8K
Source: IDC, The Internet of Things: Getting Ready to Embrace Its Impact on the Digital Economy, March 2016.
7. Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Available for analysis
Generated data
1990 2000 2010 2020
Key Insight: Most Data Falls on the Floor
90% of the data in a company
is never analyzed
High costs and complexity of
traditional DW systems make it
hard to justify the capital
expense
8. Data is a strategic asset
for every organization
The world’s most
valuable resource is
no longer oil, but data.*
*Copyright:The Economist, 2017, David Parkins
“
”
13. Example Data Center: Where Do We Put All of This on AWS?
DB
(Master)
DB
(Slave)
Back-ups on
tapes
Web
server
Web
server
App serverApp server App server
SAN
NAS file
server
File system
disks
LDAP server
14. Example Data Center: Where Do We Put All of This on AWS?
Web
server
Web
server
App serverApp server App server
Amazon Elastic
File System
Elastic Load
Balancing
Elastic Load
Balancing
Amazon
Elastic
Block Store
Amazon RDS
(Master)
Amazon RDS
(Standby) Backups to
Amazon S3
or Glacier
AWS Directory
Service
62. S3 Standard S3 Standard –
Infrequent Access
Amazon Glacier
Active data Archive dataInfrequently accessed data
Milliseconds Minutes to HoursMilliseconds
$0.021/GB/mo $0.004/GB/mo$0.0125/GB/mo
Choice of storage classes on Amazon S3
64. AWS offers the most ways to move data to the cloud
AWS
Direct
Connect
A private
connection
between your data
center, office, or
colocation
environment and
AWS
AWS Snow
family
(Snowball, Snowball
Edge, Snowmobile)
Secure, physical
transport
appliances that
move up to
Exabytes of data
into and out of
AWS
AWS
Storage
Gateways
Hybrid storage that
seamlessly
connects on-
premises
applications to AWS
storage. Ideal for
backup, DR,
bursting, tiering or
migration
Amazon
Kinesis
Firehose
Capture, trans-
form, & load
streaming data
into S3 for use
with Amazon
business
intelligence and
analytics tools
Amazon EFS
File
Sync
Up to 5x faster file
transfers than
open source tools.
Ideal for migrating
data into EFS or
moving between
cloud file systems
Amazon S3
Transfer
Acceleration
Up to 300%
faster transfers
into and out of
S3. Ideal when
working with
long geographic
distances
APN
competency
partners
Integrations
between 3rd party
vendors and AWS
services. Ideal for
leveraging
existing software
licenses and skills
Networks Shipping Hybrid
65. Storage Gateway: Enterprise Backup
Amazon S3
Amazon
Glacier
Internet
Amazon S3-IA
Application
servers
Storage Gateway
Local disk
Media
server
Gateway
Application
servers
Cloud Connector/Native Integration
Local disk
Media server
with cloud
connector
VPNVPN
66. Which On-Premise Backup Software? All of them!
AWS Storage Gateway VTL Native S3 Integration
67. Enterprise Backup: Direct Connect
Amazon S3
Amazon
Glacier
AWS
Direct
Connect
Amazon S3-IA
Application
servers
Storage Gateway
Local disk
Media
server
Gateway
Application
servers
Cloud Connector/Native Integration
Local disk
Media server
with cloud
connector
VPN
1 GB or 10 GB dedicated link
68. Amazon S3Transfer Acceleration
Rio De
Janeiro
Warsaw New York Atlanta Madrid Virginia Melbourne Paris Los
Angeles
Seattle Tokyo Singapore
Time[hrs]
500 GB upload from clients in these locations to a bucket in Singapore
Public InternetAccelerated Transfer
Up to 300% faster
171% on average
70. What is Snowball? Petabyte scale data transport
E-ink shipping
label
Ruggedized
case
“8.5G Impact”
All data encrypted
end-to-end
50TB or 80TB
10G network
Rain & dust
resistant
Tamper-resistant
case & electronics
71. How fast is Snowball?
• Less than 1 day to transfer 50TB via a 10G connection with Snowball, less
than 1 week including shipping
• Number of days to transfer 50TB via the internet at typical utilizations
Internet Connection Speed
Utilization 1Gbps 500Mbps 300Mbps 150Mbps
25% 19 38 63 126
50% 9 19 32 63
75% 6 13 21 42
72. How fast is Snowball?
• Less than 1 day to transfer 250TB via 5x10G connections with 5 Snowballs,
less than 1 week including shipping
• Number of days to transfer 250TB via the Internet at typical utilizations
Internet Connection Speed
Utilization 1Gbps 500Mbps 300Mbps 150Mbps
25% 95 190 316 632
50% 47 95 158 316
75% 32 63 105 211
73. AWS Snow* Family
Snowball Snowball Edge Snowmobile
Petabyte-scale data
migration
Showball with Lambda
inside
Exabyte-scale data
migration
80. S3 for Big Data
• Scalability & Elasticity
• Resize a running cluster based on how
much work is needed to be done.
• Durability and Availability
• Fault tolerant for slave node (HDFS)
• Backup to S3 for resilience against master
node failures
• Standard Interfaces
• Hive, Pig, Spark, Hbase, Impala, Hunk,
Presto, other popular tools
Amazon EMR Cluster
Amazon EMR Cluster
Amazon EMR Cluster
81. Big Data is about large amount of files
Stored logs structure
(in Amazon S3)
Raw log data
(sample)
Order_ID Customer_ID Order_date Total
82. AWS EMR Environment: Hadoop, Spark, et al.
Master instance group
Task instance groupCore instance group
Amazon S3
Core instances:
Manage data and
tasks
Can be added and
removed
Task instances
(optional) are added or
subtracted in response
to work
Amazon S3 as primary storage
HDFS HDFS
Terabytes of files
84. Fraud Detection
FINRA uses Amazon EMR and Amazon S3 to process up to 75 billion
trading events per day and securely store over 5 petabytes of data,
attaining savings of $10-20mm per year.
85. NASDAQ LISTS3 , 6 0 0 G L O B A L C O M P A N I E S
IN MARKET CAP REPRESENTING
WORTH $9.6TRILLION
DIVERSE INDUSTRIES AND
MANY OF THE WORLD’S
MOST WELL-KNOWN AND
INNOVATIVE BRANDSMORE THAN U.S.
1 TRILLIONNATIONAL VALUE IS TIED
TO OUR LIBRARY OF MORE THAN
41,000 GLOBAL INDEXES
N A S D A Q T E C H N O L O G Y
IS USED TO POWER MORE THAN
IN 50 COUNTRIES
100 MARKETPLACES
OUR GLOBAL PLATFORM
CAN HANDLE MORE THAN
1 MILLION
MESSAGES/SECOND
AT SUB-40 MICROSECONDS
AV E R A G E S P E E D S
1 C L E A R I N G H O U S E
WE OWN AND OPERATE
26 MARKETS
5 CENTRAL SECURITIES
DEPOSITORIES
INCLUDING
A C R O S S A S S E T CL A S SE S
& GEOGRAPHIES
91. Preencha a pesquisa de satisfação e ganhe crédito de
U$30,00 em nossa console
https://amazonmr.au1.qualtrics.com/jfe/form/SV_40Ex9lGFKy
2BifP
Hinweis der Redaktion
Here’s what we do know about all Big Data.
Due to the convergence of many technologies of cloud, mobile, social, and advancements in many field such as genomics, life sciences, space, the size of the digital universe is growing at an ever increasing rate.
Customers have also found tremendous value in being able to mine this data to make better medicine, tailored purchasing recommendations, detect fraudulent financial transactions in real time, provide on-demand digital content such as movies and songs, predict weather forecasts, the list goes on and on.
For on-promises storage solutions, a lot of information is on tape and not easily available at scale.
Timing: 10 seconds
So 7 years later, world agree that data matters. In fact it’s the most important asset for a company.
This thought has gone mainstream with The Economist saying it too.
Most organizations tell us they’ve concluded that it costs more to delete things than to store them.
- (We are always accumulating things)
Too hard to separate the good from the bad
Might lose something important
No tools to do this easily (but there are lots and lots of tools)
They also tell us that cloud storage is intriguing because it offers ways to make their stored data more useful.
Easy to scale, usually simpler than building it on your own
Easy to apply as the foundation of new development
Sometimes tricky to apply
Cloud storage is a solution: unlimited storage in a very cost effective way.
In order to meet the requirements of the wide variety of these use cases and other, AWS offers a storage platform with different types of storage suited for different needs, these include…
However, whether you are a building a new application in the cloud, or moving an existing workload, but how do you get that data into AWS? Today we will discussing eight options for data migration to the cloud, ranging from network based services like internet/vpn, S3 transfer acceleration, Amazon CloudFront and AWS DirectConnect. We will then be reviewing additional data migration options including Amazon Kinesis Firehose, Storage Gateway, AWS Snowball, and solutions provides by AWS technology partners.
A traditional on-premises or data center–based infrastructure might include a setup like this. Here we'll walk you through just one example of how an arrangement like this could be set up and run on AWS instead.
What happens when you turn this data center infrastructure into an AWS infrastructure?
Servers, such as these web servers and app servers, are replaced with Amazon EC2 instances that run all of the same software. Because Amazon EC2 instances can run a variety of Windows Server, Red Hat, SUSE, Ubuntu, or our own Amazon Linux operating systems, virtually all server applications can be run on Amazon EC2 instances.
The LDAP server is replaced with AWS Directory Service, which supports LDAP authentication and allows you to easily set up and run Microsoft Active Directory in the cloud or connect your AWS resources with existing on-premises Microsoft Active Directory.
Software-based load balancers are replaced with Elastic Load Balancing load balancers. Elastic Load Balancing is a fully managed load balancing solution that scales automatically as needed and can perform health checks on attached resources, thus redistributing load away from unhealthy resources as necessary.
SAN solutions can be replaced with Amazon Elastic Block Store (EBS) volumes. These volumes can be attached to the application servers to store data long-term and share the data between instances.
Amazon Elastic File System (EFS), currently available via preview, could be used to replace your NAS file server. Amazon EFS is a file storage service for Amazon EC2 instances with a simple interface that allows you to create and configure file systems. It also grows and shrinks your storage automatically as you add and remove files, so you are always using exactly the amount of storage you need. Another solution could be to run an NAS solution on an Amazon EC2 instance. Many NAS solutions are available via the AWS Marketplace at https://aws.amazon.com/marketplace/.
Databases can be replaced with Amazon Relational Database Service (RDS), which lets you run Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle, and Microsoft SQL Server on a managed AWS-based platform. Amazon RDS offers master, read replica, and standby instances.
Finally, Amazon RDS instances can be automatically backed up to Amazon S3, thus replacing the need for on-premises database backup hardware.
Each storage option has a unique combination of performance, durability, cost, and interface
What is EBS? Create a volume, attach to an EC server. That’s it.
EBS volumes are bound to an AZ. Restriction number 1.
If I loose a server, the EBS volume can remain. There is a property of each volume when you decide if it must remain or must be deleted when the EC2 instance goes away.
If I loose a server, the EBS volume can remain. There is a property of each volume when you decide if it must remain or must be deleted when the EC2 instance goes away.
To access that EBS data again, you can create another EC2 instance and attach that EBS volume.
One EC2 can have many EBS volumes. Max size of a EBS volume? 16TB.
If you need a 40 TB volume? Just combine as many EBS volumes as needed.
EBS volumes are not shareable between different EC2 instances.
Reliable!
EBS snapshots are copies of the blocks of the EBS volume stored in S3. Snapshots are smart to copy only the different blocks that were modified since the last snapshot. Successive snapshots are faster, dependeding on the amount of modifications.
Being stored in S3, the cost of the snapshots are the S3 cost of that stored information.
Being stored in S3, snapshots can be restored in a different AZ.....
Being stored in S3, snapshots can be restored in a different AZ..... Or even a different region.
EBS limitations...... :(
When thinking about Storage in on-premises environments it is granted that I can use NFS to share volumes between different servers.
A do-it-yourself NFS architecture in the cloud can be elaborate and with a very heavy daily admnistration.
So..... A do-it-by-yourself NFS server solution in the cloud is hard.......
What if there was something better????
EFS....!
EFS is a shared volume service, based on the NFS protocol.
Traditional storage on-premises solutions are expensive, and there is the administration overhead of keeping space available. Storage solutions will get full from time to time.
S3 is great, but how much does it cost. Let’s first discuss what we’re going to charge you for… with traditional storage you pay for raw capacity but after accounting for protection schemes, such as RAID, file system overhead and the need to keep a free storage reserve, you’re left with much less of actual capacity used by data. With S3 you only pay for used capacity, when you use it. So in this example for 400 TBs, you’re really paying for 400 TBs and this is not accounting for DR copies. This drastic difference affects both CAPEX and OPEX costs.
S3 is great. Unlimited storage, with a very low cost. But it is accessible via HTTP/HTTPS mainly.
Because the combination of a bucket, key, and version ID uniquely identifies each object, Amazon S3 can be thought of as a basic data map between "bucket + key + version" and the object itself. Every object in Amazon S3 can be uniquely addressed through the combination of the web service endpoint, bucket name, key, and optionally, a version.
For example, in the URL http://doc.s3.amazonaws.com/2006-03-01/AmazonS3.html, "doc" is the name of the bucket and "2006-03-01/AmazonS3.html" is the key.
Across the board, S3, SIA and Glacier all offer the same 11 9’s durability, where AWS stores data redundantly across multiple facilities and storage devices, and the services automatically perform data integrity check in the background to guard against potential data corruption. I work with many customers who archive data by storing two copies of tape either in the same building or one copy on-site and one remote. When we discuss durability, which is a big deal for many archive customers, many are accustomed to thinking in number of “copies” and found the 11 9’s a bit non-intuitive. To bridge that, we did a thought experiment with a large studio where, at a high level, we walked them through how we derived the 11 9’s using a Markov chain model where we modeled failures from storage device, server, network, availability zone, etc. We asked them to estimate their two-copy tape durability using a similar concept and they estimated ~4 9’s for two copies in a single building or ~5 9’s for two copies in separate locations. This helped them realize that Glacier’s 11 9’s durability can be thought of as 6 to 7 orders of magnitude more durable than two copies of tape and helped us bridge the conversation.
Across the board, S3, SIA and Glacier all offer the same 11 9’s durability, where AWS stores data redundantly across multiple facilities and storage devices, and the services automatically perform data integrity check in the background to guard against potential data corruption. I work with many customers who archive data by storing two copies of tape either in the same building or one copy on-site and one remote. When we discuss durability, which is a big deal for many archive customers, many are accustomed to thinking in number of “copies” and found the 11 9’s a bit non-intuitive. To bridge that, we did a thought experiment with a large studio where, at a high level, we walked them through how we derived the 11 9’s using a Markov chain model where we modeled failures from storage device, server, network, availability zone, etc. We asked them to estimate their two-copy tape durability using a similar concept and they estimated ~4 9’s for two copies in a single building or ~5 9’s for two copies in separate locations. This helped them realize that Glacier’s 11 9’s durability can be thought of as 6 to 7 orders of magnitude more durable than two copies of tape and helped us bridge the conversation.
Across the board, S3, SIA and Glacier all offer the same 11 9’s durability, where AWS stores data redundantly across multiple facilities and storage devices, and the services automatically perform data integrity check in the background to guard against potential data corruption. I work with many customers who archive data by storing two copies of tape either in the same building or one copy on-site and one remote. When we discuss durability, which is a big deal for many archive customers, many are accustomed to thinking in number of “copies” and found the 11 9’s a bit non-intuitive. To bridge that, we did a thought experiment with a large studio where, at a high level, we walked them through how we derived the 11 9’s using a Markov chain model where we modeled failures from storage device, server, network, availability zone, etc. We asked them to estimate their two-copy tape durability using a similar concept and they estimated ~4 9’s for two copies in a single building or ~5 9’s for two copies in separate locations. This helped them realize that Glacier’s 11 9’s durability can be thought of as 6 to 7 orders of magnitude more durable than two copies of tape and helped us bridge the conversation.
When you view our object storage as a portfolio of storage classes, we provide 3 storage options with different performance characteristics and price points.
S3 Standard which is our high performance object storage - very active, hot workloads.
available in milliseconds
starts at 2.1 cents/GB/month depending on the region
S3 Standard - Infrequent Access shares the same millisecond access times as S3 Standard, but
designed for data you plan to access maybe a few times a year or what we think of as “active archive”.
S3-IA costs $0.0125/GB/mo, and then you pay a nominal fee for requests.
Glacier is that cold archival tier
access latency from minutes to hours, depending on the retrieval option you choose,
storage costs $0.004/GB/month.
The first challenge for man organizations is the physics of moving data. Customers have asked us for help moving their data, for things like:
datacenter shutdowns
remote sites
migrating existing Enterprise applications
building hybrid workflows that can still accommodate on-premises data
S3 is a industry standard for backup solutions.
But even legacy backup systems can use S3 via a deployed AWS Storage Gateway.
S3 is a industry standard for backup solutions. Yes, it is.
For more bandwidth and throughput, Direct Connect is the way to go.
What is AWS Import/Export Snowball?
Snowball is a new AWS Import/Export offering that provides a petabyte-scale data transfer service that uses Amazon-provided storage devices for transport. Previously customers purchased their own portable storage devices and used these devices to ship their data. With the launch of Snowball customers are now able to use
highly secure, rugged Amazon-owned Network Attached Storage (NAS) devices, called Snowballs, to ship their data. Once received and set up, customers are able to copy up to 50TB data from their on prem file system to the Snowball via the Snowball client software via a 10Gbps network interface . Prior to transfer to the Snowball all data is encrypted by 256-bit GSM encryption by the client. When customers finish transferring data to the device they simply ship it back to an AWS facility where the data is ingested
at high speed into Amazon S3.
Compare and contrast Internet vs 1x Snowball.
Compare and contrast Internet vs 5x Snowball.
In the fullness of time we see hybrid cloud storage addressing needs at the edge of your networks. Customers asked for a way to incorporate simple detached cloud storage platform with some computing capability at the edge of their networks, for applications like wind farms, medical devices, shipboard scientific computing and manufacturing shop floors.
AWS Snowball Edge is a petabyte-scale data transfer solution with temporary on-premises storage and compute capabilities. It transports up to 100TB of data with the same embedded cryptography and security as the original Snowball, and may also integrate smoothly with existing workflows, scale local capacity, and process stored data. Snowball Edge hosts a file server and an S3-compatible endpoint that allow you to use the NFS protocol, S3 SDK or S3 CLI to transfer data directly to the device without specialized client software. Multiple units may be clustered together, forming a temporary data collection storage tier in your datacenter so you can work as data is generated without managing copies. As storage needs scale up and down, devices can be easily added or removed from the local cluster and returned to AWS.
Snowball Edge also comes with embedded computing power (equivalent to an EC2 m4.4 xlarge instance) that hosts a platform for general compute tasks. AWS Lambda functions can run on the device to do things like examine a data stream collected from an IoT sensor, search for anomalies, create aggregated metrics or send alarms or control signals. Environments with unstable connectivity but high operational demands can run data processes redundantly on Snowball Edge devices, protecting against connectivity issues and eventually returning the captured and processed results to AWS.
Snowball Edge is designed to keep data and applications secure while on site or in transit to AWS, making it appropriate for even the most sensitive customer data. The hardware and software is cryptographically signed and all data stored is automatically encrypted using 256 bit encryption keys, owned by the customer and managed by AWS Key Management Service (KMS). Customer data stays encrypted in the appliance and is decrypted only at the time when it is copied from the appliance to AWS. Encryption is now performed on the device, instead of on the client, producing higher data throughput rates and reducing overall processing time.
Snowball Edge devices are Amazon-owned and eliminate the need for customers to invest in new hardware. Customers pay $300 plus shipping per device and a $30 per day usage fee, applied after the initial 10 days on site. If more capacity is needed at the edge, multiple devices can be requested and used together in a cluster. Amazon monitors the health and utilization of Snowballs and provides replacement devices when needed. Current Snowball data transport appliances in 50TB and 80TB volumes will continue to be available in addition to the new Snowball Edge. Availability in regions will vary, please check the Snowball product page for additional information.
Philips Healthcare develops technology solutions for consumers, patients, providers and caregivers across the health continuum, from supporting healthy living and prevention to diagnosis, treatment and home care. They embedded Snowball v2 devices in their hospital networks to collect data and initiate real-time analytics. Now the hospital staff no longer waits for answers and they have a local dataset to run on in case of any connectivity issues.
Databases – any type of them – want big and fast disks. EBS is the case.
Observation: EFS is not a database friendly solution.
EBS disks can be attached to RDS DB servers or EC2 servers. Keep in mind that EC2 auto-scaling machines will die and their disks will be lost. Any information that must be saved is better placed in a database table or S3 bucket.
EBS disks attached to a database server are permanent. But even so, a database backup is always needed for production systems.
EBS disks can be attached to RDS DB servers or EC2 servers. Keep in mind that EC2 auto-scaling machines will die and their disks will be lost. Any information that must be saved is better placed in a database table or S3 bucket.
EBS disks attached to a database server are permanent. But even so, a database backup is always needed for production systems.
For those legacy systems that are not auto-scaling friendly, and must write information to local disks without loosing it, EFS is the solution. Any generated information can be saved to a database table, a S3 bucket, or a EFS share.
Because of the unlimited storage space available on S3, S3 is a natural component of Big Data solutions in AWS. Stored information in S3 can be accessed by any Big Data solution simultaneously.
.... Big data is about a variety of information, stored in a huge amount of files that need to be processed.....
... And EMR clusters, running Hadoop, or Spark, or others, can process all that information stored in a S3 based Data lake.
Netflix runs large EMR clusters for
Here is a high level overview of the solution we came up with.
These diagrams can all generally be read left-to-right, top-to-bottom. Anything contained within the dashed blue lines are systems running in Nasdaq datacenters, everything else is assumed to be in AWS.
So, we have systems inside Nasdaq which write data into a temp bucket in S3, which is then loaded into Redshift using COPY SQL commands.
For some data, we perform transformations and aggregations inside Redshift, then unload those results back to the temporary S3 bucket.
In all cases, original data or aggregates, we process the CSV data in the temporary bucket to produce Parquet file, which are stored in a separate S3 bucket for long-term storage.
Presto, running in an EMR cluster, is then used to query data stored in those files in S3
This transformation into Parquet is currently performed in Nasdaq datacenters, however this is a stop-gap measure until we move our data ingest system into EC2, which we are planning to complete in early 2016.
In both cases, SQL clients access the databases directly through JDBC.
Summary. Of course, for more specific details let’s talk to the SA about each use case.