Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Solving Big Data problems on AWS by Rajnish Malik
1. Solving Big Data Problems on AWS
Rajnish Malik
Email: rajnishm@amazon.com
Contact number: 09833311878
2. GB TB
PB
ZB
EB
The World is Producing Ever-Larger Volumes of Big Data
• IT/ Application server logs
IT Infrastructure logs, Metering,
Audit logs, Change logs
• Web sites / Mobile Apps/ Ads
Clickstream, User Engagement
• Sensor data
Weather, Smart Grids, Wearables
• Social Media, User Content
450MM+ Tweets/day
3. Big Data: Unconstrained data
growth
95% of the 1.2 zettabytes of
data in the digital universe is
unstructured
70% of of this is user-
generated content
Unstructured data growth
explosive, with estimates of
compound annual growth
(CAGR) at 62% from 2008 –
2012.
Source: IDC
GB
TB
PB
ZB
EB
12. Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Available for analysis
Generated data
Data volume - Gap
1990 2000 2010 2020
13. Elastic and highly scalable
No upfront capital expense
Only pay for what you use
+
+
Available on-demand
+
=
Remove
constraints
15. Big data and AWS Cloud computing
Big data Cloud computing
Variety, volume, and velocity
requiring new tools
Variety of compute, storage, and
networking options
16. Big data and AWS Cloud computing
Big data Cloud computing
Potentially massive datasets Massive, virtually unlimited
capacity
17. Big data and AWS Cloud computing
Big data Cloud computing
Iterative, experimental style of
data manipulation and analysis
Iterative, experimental style of
infrastructure deployment/usage
18. Big data and AWS Cloud computing
Big data Cloud computing
Frequently not steady-state
workload; peaks and valleys
At its most efficient with highly
variable workloads
19. Big data and AWS Cloud computing
Big data Cloud computing
Absolute performance not as critical
as “time to results”; shared
resources are a bottleneck
Parallel compute projects allow
each workgroup to have more
autonomy, get faster results
20. Only pay for what you use
No capital investment
Pay as you go
Lower costs
29. Free steak campaign
Disaster recovery
Web site & media sharing
Facebook app
Ground campaign
SAP & SharePoint
Marketing web site
Business line of sight
Consumer social app
IT operations
Mars exploration ops
Interactive TV apps
Media streaming
Consumer social app
Facebook page
Securities Trading Data Archiving
Financial markets analytics
Web and mobile apps
Big data analytics
Digital media
Ticket pricing optimization
Streaming webcasts
Mobile analytics
Consumer social app
Core IT and media
Due to the convergence of many technologies of cloud, mobile, social, and advancements in many field such as genomics, life sciences, space, the size of the digital universe is growing at an ever increasing rate.
Customers have also found tremendous value in being able to mine this data to make better medicine, tailored purchasing recommendations, detect fraudulent financial transactions in real time, provide on-demand digital content such as movies and songs, predict weather forecasts, the list goes on and on.
We see big data having a lifecycle with several high level, but distinct phases from generation to storage to analysis and sharing.
Big data has received a lot of attention over the last few years due to the ever increasing scale of volume, velocity and variety, the famous 3 Vs of big data. Let’s talk about generation of data
Big data is being used in many different use cases because of advancements of lowering the cost of generating data and the increasing aggregate amount of throughput
Here are a few big data use cases
Which require a lot of metrics such as…
From many different kinds of sources, including machines and application logs
In a variety of different formats and time
Which feed into why we have big data and that is to gain knowledge through various types of analysis to gain situational awareness, discover pattern and trends, and make predictions
The need and value of data along with the ease of generating it puts pressure on the rest of the big data lifecycle
There is an estimated growing gap of what is generated versus what is available readily for analysis.
There is one more point to make about the current state of data analysis.
Various analysts have attempted to quantify the gap between data generated by applications and data that makes it’s way into an analytical environment.
The general trend is that the gap is large and growing – people make decisions on what data to keep and what to leave on the cutting room floor.
However, we feel big data is a asset to an organization on par with capital and labor. Cloud computing enables you to flip the script, instead of asking questions based on the data that I have decided to keep, to now what should I be asking from ALL of my data. You know longer have to have your data model dictate what you keep, keep everything and evolve your data model.
This is done by allowing for the cloud to remove those constraints down the big data lifecycle.
Having infrastructure that can scale and grow larger due to increase demands or have the ability to add or remove resources on demand, without having to pay up front a large capital investment helps remove those constraints.
When we think of big data, we think of both the proliferation of digital information and also about the innovations to exploit or extract information from that data to increase sales, efficiency, better health, analysis, predictions, recommendations, and innovation
More specifically, we think cloud computing is a fundamental component to any big data strategy due to its inherent benefits
We will go over several of these storage and compute options
From TBs to PBs, we have the capacity and scale to handle your largest big data workloads
You can start and stop on demand, run big data workloads in parallel as you test out new ideas, allowing you to explore without commitments
With services such as Auto Scaling and elastic load balancing, you can dial up and down the amount of infrastructure you need for your variable or experimental workloads
The total time also includes the waiting to get access to those IT resources, with the cloud you can be up and running in minutes and in parallel allowing
We provide all of our services with a self service API, we als provide managed services so you don’t have to the back end administration and you can configure your infrastructure with code, scripts or point and click from our console all the while maintaining compatability with your current tools.
However, we don’t believe that there is one tool that can do everything, but rather if you use the right tools, you can build a highly configurable big data architcture to meet your specific needs.
While I won’t be able to go over all of our big data services, I would like to spend some time introducing to you several key big data services that are designed for high availability and durability,
as a managed service where we provision the infrastructure on your behalf
where you can get significant big data storage and analytics with a few clicks or api calls.
Fundamental storage at internet scale, it can store any number of objects from 1 byte to 5 TB in size
It is engineered for 11 9’s of durability replicating your data at least three times in three distinct physical data centers we call availability zones
We have customers such as Dropbox, Spotify, Pinterest store billions of objects or files as photos, videos, songs, or any other type of file.
DynamoDB is a fast, fully managed NoSQL database service that makes it simple and cost-effective to store and retrieve any amount of data, and serve any level of request traffic. Its guaranteed throughput and single-digit millisecond latency make it a great fit for gaming, ad tech, mobile and many other applications.
Runs on solid state hard drives for high speed performance at scale and you can provision reads and writes to a table without having to worry about the admin of scaling or sharding, it is done all behind the scenes for you.
For instance, real time bidding where in less than 200 milliseconds 3 rounds of bidding of what ad to place on a website while a page loads needs the performance of a single-digit millisecond latency to determine what ad to place and what price to bid for that ad impression.
When you think of big data these days, Hadoop is always an integral part. When you take the benefits of what the cloud can do along with the computational paradigm of MapReduce, you get Elastic MapReduce. Customers have launched millions of clusters to run big data workloads. Amazon Elastic MapReduce
A key tool in the toolbox to help with ‘Big Data’ challenges Makes possible analytics processes previously not feasibleCost effective when leveraged with EC2 spot market
Amazon Kinesis is a fully managed service for real-time processing of streaming data at massive scale. Amazon Kinesis can collect and process hundreds of terabytes of data per hour from hundreds of thousands of sources.
For instance, instead of having to process log files in batch, you can have log events stream into Kinesis and then have workers with the Kinesis client library read from the stream and process the informaiton and drive a real time dashboard.
Later on today, we will have the product manager, Adi Krishnan, for Amazon Kinesis give a deep dive into the service
Provision a petabyte scale cluster to handle complex SQL queries in just a few minutes.
You can get either a HDD drive based cluster or the recently introduced SSD based cluster that is smaller in total cluster size but higher performance per GB
This data warehouse solution is about a tenth of what traditional solutions cost of comparable size.
Redshift can drive business intelligence tools such as Jaspersoft or Microstrategy because it supports standard SQL and can connect using ODBC or JDBC drivers.
We have had many customers from startups to enterprises, government agencies and banks for big data workloads such as analytics on recommendations of where to eat
For collection and storage, we have a variety of storage options that depend on you requirements.
Direct Connect, Storage Gateway, Import/Export, Glacier, RDS
EMR Integrates with The Hadoop Ecosystem Tools
Kinesis tools
Nutch – web crawler software
Cascading – data processing
Hbase – large table, NoSQL
Cassandara – NoSQL database
Chuckwa – data collection system
Pig – create mapreduce programs with easy scripting
Thrift – build services, interfaces
Hive – SQL on Mapreduce
HDFS – distributed file system
Avro – compact binary serialize
MapReduce – process large data sets in parallel
Mahout – machine learning
Flume – collect aggreate and move large amounts of log data
Sqoop – command line transfer data between hadoop and relational databases
In summary, AWS provides you the tools so you can pick the right one at the scale that you need when you need it.
Life technologies
LinkedIn
DropCam
ICRAR
CDC
Channel4
Yelp
Nokia
AWS Marketplace is the AWS Online Software Store
Customer can find, research, buy software including a wide variety of big data options and software to help you manage your databases
With AWS Marketplace, the simple hourly pricing of most products aligns with EC2 usage model
You can find, purchase and 1-Click launch in minutes, making deployment easy
Marketplace billing integrated into your AWS account
1300+ product listings across 25 categories
The 1000 Genomes Project aims to build the most detailed map of human genetic variation, ultimately with data from the genomes of over 2,600 people from 26 populations around the world. The data contained within this release include results from sequencing the DNA of approximately first 1,700 of over 2,600 people; the remaining samples are expected to be sequenced in 2012 and the data will be released to researchers as soon as possible. The data presented here, over 200Tb, is intended for use in analysis on Amazon EC2 or Elastic MapReduce, rather than for download.
NASA NEX
Three NASA NEX datasets are now available, including climate projections and satellite images of Earth.
NASA NEX is a collaboration and analytical platform that combines state-of-the-art supercomputing, Earth system modeling, workflow management and NASA remote-sensing data. Through NEX, users can explore and analyze large Earth science data sets, run and share modeling algorithms, collaborate on new or existing projects and exchange workflows and results within and among other science communities.
A corpus of web crawl data composed of over 5 billion web pages. This data set is freely available on Amazon S3 and is released under the Common Crawl Terms of Use.
541TB
Common Crawl is a non-profit organization dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone.
The most current crawl data sets includes three different types of files: Raw Content, Text Only, and Metadata. The data sets from before 2012 contain only Raw Content files.
For more details about the file formats and directory structure please see this blog post.
Common Crawl provides the glue code required to launch Hadoop jobs on Amazon Elastic MapReduce that can run against the crawl corpus residing here in the Amazon Public Data Sets. By utilizing Amazon Elastic MapReduce to access the S3 resident data, end users can bypass costly network transfer costs.
To learn more about Amazon Elastic MapReduce please see the product detail page.
Common Crawl's Hadoop classes and other code can be found in its GitHub repository.
Three NASA NEX data sets are now available to all via Amazon S3. One data set, the NEX downscaled climate simulations, provides high-resolution climate change projections for the 48 contiguous U.S. states. The second data set, provided by the Moderate Resolution Imaging Spectroradiometer (MODIS) instrument on NASA's Terra and Aqua satellites, offers a global view of Earth's surface every 1 to 2 days. Finally, the Landsat data record from the U.S. Geological Survey provides the longest existing continuous space-based record of Earth's land.
The data sets are available at:
s3://nasanex/NEX-DCP30
s3://nasanex/MODIS
s3://nasanex/Landsat
You can learn more about the NASA NEX data sets here.