Join us for an in-depth look at the current state of big data at AWS. Learn about the latest big data trends and industry use cases. Hear how other organizations are using the AWS big data platform to innovate and remain competitive. Take a look at some of the most recent AWS big data developments.
3. Big Data on AWS
Immediate Availability. Deploy instantly. No hardware to
procure, no infrastructure to maintain & scale.
Trusted & Secure. Designed to meet the strictest
requirements. Continuously audited, including certifications
such as ISO 27001, FedRAMP, DoD CSM, and PCI DSS.
Broad & Deep Capabilities. Over 100 services and 100s
of features to support virtually any big data application &
workload.
Hundreds of Partners & Solutions. Get help from a
consulting partner or choose from a multitude of tools and
applications across the entire data management stack.
26. Processing & Analytics
Transactional &
RDBMS
DynamoDB
NoSQL DB Relational Database
Aurora
BI & Data Visualization
Kinesis Streams
& Firehose
Batch
EMR
Hadoop, Spark,
Presto
Redshift
Data Warehouse
Athena
Query Service
AWS Batch
Predictive
Real-time
AWS Lambda
Apache Storm
on EMR
Apache Flink
on EMR
Spark Streaming
on EMR
Elasticsearch
Service
Kinesis Analytics,
Kinesis Streams
EastiCache DAX
27. How to be successful in serving your
customers/citizens
E*
BI
RT
ML
Amazon EC2
Amazon ECS
AWS Elastic Beanstalk
Amazon Redshift
Amazon EMR
Amazon QuickSight
Amazon Kinesis
Amazon
Elasticsearch
Amazon AI
Spark ML (on Amazon EMR)
42. Start Querying Instantly
Serverless. No ETL.
Pay Per Query
Only pay for data scanned.
Open. Powerful. Standard
Built on Presto. Runs standard SQL.
Fast. Really Fast
Interactive performance
even for large datasets
44. Lets build an analytic query - #1
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets get the prior books she’s written.
1 Table
2 Filters
SELECT
P.ASIN,
P.TITLE
FROM
products P
WHERE
P.TITLE LIKE ‘%POTTER%’ AND
P.AUTHOR = ‘J. K. Rowling’
45. Lets build an analytic query - #2
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets compute the sales of the prior books she’s written in this
series and return the top 20 values
2 Tables (1 S3, 1 local)
2 Filters
1 Join
2 Group By columns
1 Order By
1 Limit
1 Aggregation
SELECT
P.ASIN,
P.TITLE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
products P
WHERE
D.ASIN = P.ASIN AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
GROUP BY P.ASIN, P.TITLE
ORDER BY SALES_sum DESC
LIMIT 20;
46. Lets build an analytic query - #3
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets compute the sales of the prior books she’s written in this
series and return the top 20 values, just for the first three days
of sales of first editions
3 Tables (1 S3, 2 local)
5 Filters
2 Joins
3 Group By columns
1 Order By
1 Limit
1 Aggregation
1 Function
2 Casts
SELECT
P.ASIN,
P.TITLE,
P.RELEASE_DATE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
asin_attributes A,
products P
WHERE
D.ASIN = P.ASIN AND
P.ASIN = A.ASIN AND
A.EDITION LIKE '%FIRST%' AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
GROUP BY P.ASIN, P.TITLE, P.RELEASE_DATE
ORDER BY SALES_sum DESC
LIMIT 20;
47. Lets build an analytic query - #4
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets compute the sales of the prior books she’s written in this
series and return the top 20 values, just for the first three days
of sales of first editions in the city of Seattle, WA, USA
4 Tables (1 S3, 3 local)
8 Filters
3 Joins
4 Group By columns
1 Order By
1 Limit
1 Aggregation
1 Function
2 Casts
SELECT
P.ASIN,
P.TITLE,
R.POSTAL_CODE,
P.RELEASE_DATE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
asin_attributes A,
products P,
regions R
WHERE
D.ASIN = P.ASIN AND
P.ASIN = A.ASIN AND
D.REGION_ID = R.REGION_ID AND
A.EDITION LIKE '%FIRST%' AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
R.COUNTRY_CODE = ‘US’ AND
R.CITY = ‘Seattle’ AND
R.STATE = ‘WA’ AND
D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE
ORDER BY SALES_sum DESC
LIMIT 20;
48. Now let’s run that query over an exabyte of data in S3
Roughly 140 TB of customer item order detail
records for each day over past 20 years.
190 million files across 15,000 partitions in S3.
One partition per day for USA and rest of world.
Need a billion-fold reduction in data processed.
Running this query using a 1000 node Hive cluster
would take over 5 years.*
• Compression ……………..….……..5X
• Columnar file format……….......…10X
• Scanning with 2500 nodes…....2500X
• Static partition elimination…............2X
• Dynamic partition elimination..….350X
• Redshift’s query optimizer……......40X
---------------------------------------------------
Total reduction……….…………3.5B X
* Estimated using 20 node Hive cluster & 1.4TB, assume linear
* Query used a 20 node DC1.8XLarge Amazon Redshift cluster
* Not actual sales data - generated for this demo based on data
format used by Amazon Retail.
49. Is Amazon Redshift Spectrum useful if I don’t have an exabyte?
Your data will get bigger
• On average, data warehousing volumes grow 10x every 5 years
• The average Amazon Redshift customer doubles data each year
Amazon Redshift Spectrum makes data analysis simpler
• Access your data without ETL pipelines
• Teams using Amazon EMR, Athena & Redshift can collaborate using the same data lake
Amazon Redshift Spectrum improves availability and concurrency
• Run multiple Amazon Redshift clusters against common data
• Isolate jobs with tight SLAs from ad hoc analysis
57. Digital Asset Management
Media and Entertainment
Travel and Hospitality
Influencer Marketing
Systems Integration
Digital Advertising
Consumer Storage
Law Enforcement
Public Safety
eCommerce
Education
Rekognition: Use Cases
60. What does the customer say?
https://aws.amazon.com/solutions/case-studies/analytics/
https://aws.amazon.com/solutions/case-studies/big-data/
61. Just Giving Creates a Big Data Platform on AWS
“Before AWS, [we were]
basing decisions on a
single high-level data
source. Now we can extract
much more granular data
based on millions of
donations…and use that
information to provide a
better platform for our
visitors.”
-Richard Atkinson, CIO
62. UMUC Improves Student Outcomes with Big Data
“Nobody can match
AWS’ product set, scale
and innovation. From an
analytics perspective,
Amazon Redshift is very
disruptive.”
---Darren Catalano, VP of
Analytics
63. FINRA Analyzes Billions of Transactions Daily
To respond to
rapidly changing
market dynamics,
FINRA, moved 75% of
its operations to
Amazon Web
Services, using AWS
to analyze 75B
records a day.
66. AWS is Positioned as a Leader in the Gartner Magic
Quadrant for Data Management Solutions for
Analytics
Gartner, Magic Quadrant for Data Management Solutions for Analytics, February 2017
This graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available upon request from AWS :
Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research
publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of
67. AWS Named as a Leader in The Forrester
WaveTM: Big Data Warehouse Q2 2017
http://bit.ly/2w1TAEy
On June 15, Forrester published the Big Data
Warehouse, Q2 2017, in which AWS is
positioned as a Leader. According to Forrester,
“With more than 5,000 deployments, Amazon
Redshift has the largest data warehouse
deployments in the cloud.” AWS received the
highest score possible, 5/5, for customer base,
market awareness, ability to execute, road map,
support, and partners. “AWS’s key strengths lie
in its dynamic scale, automated administration,
flexibility of database offerings, good security,
and high availability (HA) capabilities, which
make it a preferred choice for customers.
69. • AWS enables you to build sophisticated data strategies and related
analytics applications
• Retrospective, Real-time, Predictive
• You can build incrementally, adding use cases and increasing scale
as you go
• AWS provides a broad range of security and auditing features to
enable you to meet your security requirements
https://aws.amazon.com/big-data/
70. • Prescriptive guidance and rapidly deployable solutions.
• Derive Insights from IoT in Minutes using AWS IoT, Amazon
Kinesis Firehose, Amazon Athena, and Amazon QuickSight
• Deploying a Data Lake on AWS
• Harmonize, Search, and Analyze Loosely Coupled Datasets on
AWS with Glue, Athena and QuickSight
• From Data Lake to Data Warehouse: Enhancing Customer 360
with Amazon Redshift Spectrum
• Implement Continuous Integration and Delivery of Apache
Spark Applications using AWS
http://amzn.to/2vHIwBq
http://amzn.to/2i9gqZn
http://bit.ly/2qipA8h
http://amzn.to/2qpiFaK
http://amzn.to/2lpbc8p
Takeaways
https://aws.amazon.com/blogs/big-data/
https://aws.amazon.com/answers/big-data/
http://amzn.to/2gIJcj8