SlideShare ist ein Scribd-Unternehmen logo
1 von 49
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Matt Yanchyshyn, Sr. Manager Solutions Architecture
June 17th, 2015
AWS Deep Dive
Big Data Analytics and Business Intelligence
Analytics and BI on AWS
Amazon S3
Amazon Kinesis
Amazon DynamoDB
Amazon RDS (Aurora)
AWS Lambda
KCL Apps
Amazon
EMR
Amazon
Redshift
Amazon Machine
Learning
Collect Process Analyze
Store
Data Collection
and Storage
Data
Processing
Event
Processing
Data
Analysis
Batch processing
GBs of logs
pushed to Amazon
S3 hourly
Daily Amazon EMR
cluster using Hive to
process data
Input and output stored
in Amazon S3
Load subset into
Amazon Redshift
Reporting
Amazon S3
Log Bucket
Amazon EMR Structured
log data
Amazon
Redshift
Operational Reports
Streaming data processing
TBs of logs sent
daily
Logs stored in
Amazon Kinesis
Amazon Kinesis
Client Library
AWS Lambda
Amazon EMR
Amazon EC2
TBs of logs sent
daily
Logs stored in
Amazon S3
Amazon EMR
clusters
Hive Metastore
on Amazon EMR
Interactive query
Structured data
In Amazon Redshift
Load predictions into
Amazon Redshift
-or-
Read prediction results
directly from S3
Predictions
in S3
Query for predictions with
Amazon ML batch API
Your application
Batch predictions
Your application
Amazon
DynamoDB
+
Trigger event with Lambda
+
Query for predictions with
Amazon ML real-time API
Real-time predictions
Amazon Machine Learning
Amazon Machine Learning
Easy to use, managed machine learning
service built for developers
Create models using data stored in AWS
Deploy models to production in seconds
Powerful machine learning technology
Based on Amazon’s battle-hardened
internal systems
Not just the algorithms:
Smart data transformations
Input data and model quality alerts
Built-in industry best practices
Grows with your needs
Train on up to 100 GB of data
Generate billions of predictions
Obtain predictions in batches or real-time
Pay-as-you-go and inexpensive
Data analysis, model training, and
evaluation: $0.42/instance hour
Batch predictions: $0.10/1000
Real-time predictions: $0.10/1000
+ hourly capacity reservation charge
Build & Train
model
Evaluate and
optimize
Retrieve
predictions
1 2 3
Building smart applications with Amazon ML
Create a Datasource object
Create a Datasource object
>>> import boto
>>> ml = boto.connect_machinelearning()
>>> ds = ml.create_data_source_from_s3(
data_source_id = ’my_datasource',
data_spec= {
'DataLocationS3':'s3://bucket/input/',
'DataSchemaLocationS3':'s3://bucket/input/.schema'},
compute_statistics = True)
Explore and understand your data
Train your model
Train your model
>>> import boto
>>> ml = boto.connect_machinelearning()
>>> model = ml.create_ml_model(
ml_model_id=’my_model',
ml_model_type='REGRESSION',
training_data_source_id='my_datasource')
Build & Train
model
Evaluate and
optimize
Retrieve
predictions
1 2 3
Building smart applications with Amazon ML
Explore model quality
Fine-tune model interpretation
Build & Train
model
Evaluate and
optimize
Retrieve
predictions
1 2 3
Building smart applications with Amazon ML
Batch predictions
Asynchronous, large-volume prediction generation
Request through service console or API
Best for applications that deal with batches of data records
>>> import boto
>>> ml = boto.connect_machinelearning()
>>> model = ml.create_batch_prediction(
batch_prediction_id = 'my_batch_prediction’
batch_prediction_data_source_id = ’my_datasource’
ml_model_id = ’my_model',
output_uri = 's3://examplebucket/output/’)
Real-time predictions
Synchronous, low-latency, high-throughput prediction generation
Request through service API or server or mobile SDKs
Best for interaction applications that deal with individual data records
>>> import boto
>>> ml = boto.connect_machinelearning()
>>> ml.predict(
ml_model_id=’my_model',
predict_endpoint=’example_endpoint’,
record={’key1':’value1’, ’key2':’value2’})
{
'Prediction': {
'predictedValue': 13.284348,
'details': {
'Algorithm': 'SGD',
'PredictiveModelType': 'REGRESSION’
}
}
}
Amazon Elastic MapReduce (EMR)
Why Amazon EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Elastic
Easily add or remove capacity
Reliable
Spend less time monitoring
Secure
Manage firewalls
Flexible
Control the cluster
The Hadoop ecosystem can run in Amazon EMR
Try different configurations to find your optimal architecture
CPU
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Choose your instance types
Batch Machine Spark and Large
process learning interactive HDFS
Easy to add/remove compute capacity to your cluster
Match compute
demands with
cluster sizing
Resizable clusters
Spot Instances
for task nodes
Up to 90%
off Amazon EC2
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Easy to use Spot Instances
Meet SLA at predictable cost Exceed SLA at lower cost
Amazon S3 as your persistent data store
Separate compute and storage
Resize and shut down Amazon EMR
clusters with no data loss
Point multiple Amazon EMR clusters at
same data in Amazon S3
EMR
EMR
Amazon
S3
EMRFS makes it easier to use Amazon S3
Read-after-write consistency
Very fast list operations
Error handling options
Support for Amazon S3 encryption
Transparent to applications: s3://
EMRFS client-side encryption
Amazon S3
AmazonS3encryption
clients
EMRFSenabledfor
AmazonS3client-sideencryption
Key vendor (AWS KMS or your custom key vendor)
(client-side encrypted objects)
HDFS is still there if you need it
Iterative workloads
• If you’re processing the same dataset more than
once
Disk I/O intensive workloads
Persist data on Amazon S3 and use S3DistCp to
copy to/from HDFS for processing
Amazon Redshift
Amazon Redshift Architecture
Leader Node
• SQL endpoint
• Stores metadata
• Coordinates query execution
Compute Nodes
• Execute queries in parallel
• Node types to match your
workload: Dense Storage (DS2) or
Dense Compute (DC1)
• Divided into multiple slices
• Local, columnar storage
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC
Amazon Redshift
Column storage
Data compression
Zone maps
Direct-attached storage
With column storage, you only
read the data you need
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
Amazon Redshift
Column storage
Data compression
Zone maps
Direct-attached storage
• COPY compresses
automatically
• You can analyze and override
• More performance, less cost
Amazon Redshift
Column storage
Data compression
Zone maps
Direct-attached storage
• Track the minimum and
maximum value for each block
• Skip over blocks that don’t
contain relevant data
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
Amazon Redshift
Column storage
Data compression
Zone maps
Direct-attached storage
• Local storage for performance
• High scan rates
• Automatic replication
• Continuous backup and
streaming restores to/from
Amazon S3
• User snapshots on demand
• Cross region backups for
disaster recovery
Amazon Redshift online resize
Continue querying during resize
New cluster deployed in the background at no extra cost
Data copied in parallel from node to node
Automatic SQL endpoint switchover via DNS
Snowflake
Star
Amazon Redshift works with existing data models
Distribution Key All
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
All data on
every node
Same key to
same location
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Even
Round robin
distribution
Amazon Redshift data distribution
Sorting data in Amazon Redshift
In the slices (on disk), the data is sorted by a sort key
Choose a sort key that is frequently used in your queries
Data in columns is marked with a min/max value so
Redshift can skip blocks not relevant to the query
A good sort key also prevents reading entire blocks
User Defined Functions
Python 2.7
PostgreSQL UDF Syntax System
Network calls within UDFs are prohibited
Pandas, NumPy, and SciPy pre-installed
Import your own
Interleaved Multi Column Sort
Currently support Compound Sort Keys
• Optimized for applications that filter data by one leading column
Adding support for Interleaved Sort Keys
• Optimized for filtering data by up to eight columns
• No storage overhead unlike an index
• Lower maintenance penalty compared to indexes
Amazon Redshift works with your
existing analysis tools
JDBC/ODBC
Amazon Redshift
Questions?
AWS Summit – Chicago: An exciting, free cloud conference designed to educate and inform new
customers about the AWS platform, best practices and new cloud services.
Details
• July 1, 2015
• Chicago, Illinois
• @ McCormick Place
Featuring
• New product launches
• 36+ sessions, labs, and bootcamps
• Executive and partner networking
Registration is now open
• Come and see what AWS and the cloud can do for you.
• Click here to register: http://amzn.to/1RooPPL

Weitere ähnliche Inhalte

Andere mochten auch

How airlines use technology to improve passenger experience by 2016 - The Air...
How airlines use technology to improve passenger experience by 2016 - The Air...How airlines use technology to improve passenger experience by 2016 - The Air...
How airlines use technology to improve passenger experience by 2016 - The Air...
Tom Knierim
 

Andere mochten auch (17)

Use of Star Schema in Health Care
Use of Star Schema in Health CareUse of Star Schema in Health Care
Use of Star Schema in Health Care
 
Big Data & Analytics: End to End on AWS - Technical 101
Big Data & Analytics: End to End on AWS - Technical 101Big Data & Analytics: End to End on AWS - Technical 101
Big Data & Analytics: End to End on AWS - Technical 101
 
Online Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics WebinarOnline Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics Webinar
 
AWS re:Invent 2016 Recap: What Happened, What It Means
AWS re:Invent 2016 Recap: What Happened, What It MeansAWS re:Invent 2016 Recap: What Happened, What It Means
AWS re:Invent 2016 Recap: What Happened, What It Means
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
 
How airlines use technology to improve passenger experience by 2016 - The Air...
How airlines use technology to improve passenger experience by 2016 - The Air...How airlines use technology to improve passenger experience by 2016 - The Air...
How airlines use technology to improve passenger experience by 2016 - The Air...
 
AWS re:Invent 2016: IoT and Beyond: Building IoT Solutions for Exploring the ...
AWS re:Invent 2016: IoT and Beyond: Building IoT Solutions for Exploring the ...AWS re:Invent 2016: IoT and Beyond: Building IoT Solutions for Exploring the ...
AWS re:Invent 2016: IoT and Beyond: Building IoT Solutions for Exploring the ...
 
AWS re:Invent 2016| GAM401 | Riot Games: Standardizing Application Deployment...
AWS re:Invent 2016| GAM401 | Riot Games: Standardizing Application Deployment...AWS re:Invent 2016| GAM401 | Riot Games: Standardizing Application Deployment...
AWS re:Invent 2016| GAM401 | Riot Games: Standardizing Application Deployment...
 
AWS re:Invent 2016: AWS Training Opportunities (DCS202 )
AWS re:Invent 2016: AWS Training Opportunities (DCS202 )AWS re:Invent 2016: AWS Training Opportunities (DCS202 )
AWS re:Invent 2016: AWS Training Opportunities (DCS202 )
 
AWS re:Invent 2016: Datapipe Open Source: Image Development Pipeline (ARC319)
AWS re:Invent 2016: Datapipe Open Source:  Image Development Pipeline (ARC319)AWS re:Invent 2016: Datapipe Open Source:  Image Development Pipeline (ARC319)
AWS re:Invent 2016: Datapipe Open Source: Image Development Pipeline (ARC319)
 
AWS re:Invent 2016: Voice-enabling Your Home and Devices with Amazon Alexa an...
AWS re:Invent 2016: Voice-enabling Your Home and Devices with Amazon Alexa an...AWS re:Invent 2016: Voice-enabling Your Home and Devices with Amazon Alexa an...
AWS re:Invent 2016: Voice-enabling Your Home and Devices with Amazon Alexa an...
 
AWS re:Invent 2016: Disaster Recovery and Business Continuity for Systemicall...
AWS re:Invent 2016: Disaster Recovery and Business Continuity for Systemicall...AWS re:Invent 2016: Disaster Recovery and Business Continuity for Systemicall...
AWS re:Invent 2016: Disaster Recovery and Business Continuity for Systemicall...
 
AWS re:Invent 2016: Creating Your Virtual Data Center: VPC Fundamentals and C...
AWS re:Invent 2016: Creating Your Virtual Data Center: VPC Fundamentals and C...AWS re:Invent 2016: Creating Your Virtual Data Center: VPC Fundamentals and C...
AWS re:Invent 2016: Creating Your Virtual Data Center: VPC Fundamentals and C...
 
AWS re:Invent 2016: Deep Learning in Alexa (MAC202)
AWS re:Invent 2016: Deep Learning in Alexa (MAC202)AWS re:Invent 2016: Deep Learning in Alexa (MAC202)
AWS re:Invent 2016: Deep Learning in Alexa (MAC202)
 
LDM Webinar: Data Modeling & Business Intelligence
LDM Webinar: Data Modeling & Business IntelligenceLDM Webinar: Data Modeling & Business Intelligence
LDM Webinar: Data Modeling & Business Intelligence
 
AWS re:Invent 2016| HLC301 | Data Science and Healthcare: Running Large Scale...
AWS re:Invent 2016| HLC301 | Data Science and Healthcare: Running Large Scale...AWS re:Invent 2016| HLC301 | Data Science and Healthcare: Running Large Scale...
AWS re:Invent 2016| HLC301 | Data Science and Healthcare: Running Large Scale...
 
Deep Dive on AWS reInvent 2016 Breakout Sessions
Deep Dive on AWS reInvent 2016 Breakout SessionsDeep Dive on AWS reInvent 2016 Breakout Sessions
Deep Dive on AWS reInvent 2016 Breakout Sessions
 

Mehr von Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Kürzlich hochgeladen

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Matt Yanchyshyn, Sr. Manager Solutions Architecture June 17th, 2015 AWS Deep Dive Big Data Analytics and Business Intelligence
  • 2. Analytics and BI on AWS Amazon S3 Amazon Kinesis Amazon DynamoDB Amazon RDS (Aurora) AWS Lambda KCL Apps Amazon EMR Amazon Redshift Amazon Machine Learning Collect Process Analyze Store Data Collection and Storage Data Processing Event Processing Data Analysis
  • 3. Batch processing GBs of logs pushed to Amazon S3 hourly Daily Amazon EMR cluster using Hive to process data Input and output stored in Amazon S3 Load subset into Amazon Redshift
  • 4. Reporting Amazon S3 Log Bucket Amazon EMR Structured log data Amazon Redshift Operational Reports
  • 5. Streaming data processing TBs of logs sent daily Logs stored in Amazon Kinesis Amazon Kinesis Client Library AWS Lambda Amazon EMR Amazon EC2
  • 6. TBs of logs sent daily Logs stored in Amazon S3 Amazon EMR clusters Hive Metastore on Amazon EMR Interactive query
  • 7. Structured data In Amazon Redshift Load predictions into Amazon Redshift -or- Read prediction results directly from S3 Predictions in S3 Query for predictions with Amazon ML batch API Your application Batch predictions
  • 8. Your application Amazon DynamoDB + Trigger event with Lambda + Query for predictions with Amazon ML real-time API Real-time predictions
  • 10. Amazon Machine Learning Easy to use, managed machine learning service built for developers Create models using data stored in AWS Deploy models to production in seconds
  • 11. Powerful machine learning technology Based on Amazon’s battle-hardened internal systems Not just the algorithms: Smart data transformations Input data and model quality alerts Built-in industry best practices Grows with your needs Train on up to 100 GB of data Generate billions of predictions Obtain predictions in batches or real-time
  • 12. Pay-as-you-go and inexpensive Data analysis, model training, and evaluation: $0.42/instance hour Batch predictions: $0.10/1000 Real-time predictions: $0.10/1000 + hourly capacity reservation charge
  • 13. Build & Train model Evaluate and optimize Retrieve predictions 1 2 3 Building smart applications with Amazon ML
  • 15. Create a Datasource object >>> import boto >>> ml = boto.connect_machinelearning() >>> ds = ml.create_data_source_from_s3( data_source_id = ’my_datasource', data_spec= { 'DataLocationS3':'s3://bucket/input/', 'DataSchemaLocationS3':'s3://bucket/input/.schema'}, compute_statistics = True)
  • 18. Train your model >>> import boto >>> ml = boto.connect_machinelearning() >>> model = ml.create_ml_model( ml_model_id=’my_model', ml_model_type='REGRESSION', training_data_source_id='my_datasource')
  • 19. Build & Train model Evaluate and optimize Retrieve predictions 1 2 3 Building smart applications with Amazon ML
  • 22. Build & Train model Evaluate and optimize Retrieve predictions 1 2 3 Building smart applications with Amazon ML
  • 23. Batch predictions Asynchronous, large-volume prediction generation Request through service console or API Best for applications that deal with batches of data records >>> import boto >>> ml = boto.connect_machinelearning() >>> model = ml.create_batch_prediction( batch_prediction_id = 'my_batch_prediction’ batch_prediction_data_source_id = ’my_datasource’ ml_model_id = ’my_model', output_uri = 's3://examplebucket/output/’)
  • 24. Real-time predictions Synchronous, low-latency, high-throughput prediction generation Request through service API or server or mobile SDKs Best for interaction applications that deal with individual data records >>> import boto >>> ml = boto.connect_machinelearning() >>> ml.predict( ml_model_id=’my_model', predict_endpoint=’example_endpoint’, record={’key1':’value1’, ’key2':’value2’}) { 'Prediction': { 'predictedValue': 13.284348, 'details': { 'Algorithm': 'SGD', 'PredictiveModelType': 'REGRESSION’ } } }
  • 26. Why Amazon EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Elastic Easily add or remove capacity Reliable Spend less time monitoring Secure Manage firewalls Flexible Control the cluster
  • 27. The Hadoop ecosystem can run in Amazon EMR
  • 28. Try different configurations to find your optimal architecture CPU c3 family cc1.4xlarge cc2.8xlarge Memory m2 family r3 family Disk/IO d2 family i2 family General m1 family m3 family Choose your instance types Batch Machine Spark and Large process learning interactive HDFS
  • 29. Easy to add/remove compute capacity to your cluster Match compute demands with cluster sizing Resizable clusters
  • 30. Spot Instances for task nodes Up to 90% off Amazon EC2 on-demand pricing On-demand for core nodes Standard Amazon EC2 pricing for on-demand capacity Easy to use Spot Instances Meet SLA at predictable cost Exceed SLA at lower cost
  • 31. Amazon S3 as your persistent data store Separate compute and storage Resize and shut down Amazon EMR clusters with no data loss Point multiple Amazon EMR clusters at same data in Amazon S3 EMR EMR Amazon S3
  • 32. EMRFS makes it easier to use Amazon S3 Read-after-write consistency Very fast list operations Error handling options Support for Amazon S3 encryption Transparent to applications: s3://
  • 33. EMRFS client-side encryption Amazon S3 AmazonS3encryption clients EMRFSenabledfor AmazonS3client-sideencryption Key vendor (AWS KMS or your custom key vendor) (client-side encrypted objects)
  • 34. HDFS is still there if you need it Iterative workloads • If you’re processing the same dataset more than once Disk I/O intensive workloads Persist data on Amazon S3 and use S3DistCp to copy to/from HDFS for processing
  • 36. Amazon Redshift Architecture Leader Node • SQL endpoint • Stores metadata • Coordinates query execution Compute Nodes • Execute queries in parallel • Node types to match your workload: Dense Storage (DS2) or Dense Compute (DC1) • Divided into multiple slices • Local, columnar storage 10 GigE (HPC) Ingestion Backup Restore Customer VPC Internal VPC JDBC/ODBC
  • 37. Amazon Redshift Column storage Data compression Zone maps Direct-attached storage With column storage, you only read the data you need ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375
  • 38. analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw Amazon Redshift Column storage Data compression Zone maps Direct-attached storage • COPY compresses automatically • You can analyze and override • More performance, less cost
  • 39. Amazon Redshift Column storage Data compression Zone maps Direct-attached storage • Track the minimum and maximum value for each block • Skip over blocks that don’t contain relevant data 10 | 13 | 14 | 26 |… … | 100 | 245 | 324 375 | 393 | 417… … 512 | 549 | 623 637 | 712 | 809 … … | 834 | 921 | 959 10 324 375 623 637 959
  • 40. Amazon Redshift Column storage Data compression Zone maps Direct-attached storage • Local storage for performance • High scan rates • Automatic replication • Continuous backup and streaming restores to/from Amazon S3 • User snapshots on demand • Cross region backups for disaster recovery
  • 41. Amazon Redshift online resize Continue querying during resize New cluster deployed in the background at no extra cost Data copied in parallel from node to node Automatic SQL endpoint switchover via DNS
  • 42. Snowflake Star Amazon Redshift works with existing data models
  • 43. Distribution Key All Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 All data on every node Same key to same location Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Even Round robin distribution Amazon Redshift data distribution
  • 44. Sorting data in Amazon Redshift In the slices (on disk), the data is sorted by a sort key Choose a sort key that is frequently used in your queries Data in columns is marked with a min/max value so Redshift can skip blocks not relevant to the query A good sort key also prevents reading entire blocks
  • 45. User Defined Functions Python 2.7 PostgreSQL UDF Syntax System Network calls within UDFs are prohibited Pandas, NumPy, and SciPy pre-installed Import your own
  • 46. Interleaved Multi Column Sort Currently support Compound Sort Keys • Optimized for applications that filter data by one leading column Adding support for Interleaved Sort Keys • Optimized for filtering data by up to eight columns • No storage overhead unlike an index • Lower maintenance penalty compared to indexes
  • 47. Amazon Redshift works with your existing analysis tools JDBC/ODBC Amazon Redshift
  • 49. AWS Summit – Chicago: An exciting, free cloud conference designed to educate and inform new customers about the AWS platform, best practices and new cloud services. Details • July 1, 2015 • Chicago, Illinois • @ McCormick Place Featuring • New product launches • 36+ sessions, labs, and bootcamps • Executive and partner networking Registration is now open • Come and see what AWS and the cloud can do for you. • Click here to register: http://amzn.to/1RooPPL

Hinweis der Redaktion

  1. CloudFront logs arrive out of order.
  2. EMR example #3: EMR for ETL and query engine for investigations which require all raw data
  3. EMR example #3: EMR for ETL and query engine for investigations which require all raw data
  4. Example: my application periodically receives new product descriptions, and needs to classify them into categories. For example, assign genre to movies based on the movie metadata. Example: my application aggregates user activity over time period, and then we call the prediction API to decide which of them will need follow-up. E.g. bizdev followup with customers in free tier.
  5. This is example of consuming real-time predictions Can also add example of consuming batch prediction with EMR
  6. Today, we have announced Amazon ML, the newest addition to the Amazon Web Services family. Amazon ML is easy to use, and intended for developers – people who are already most connected and familiar with data instrumentation, pipelines and storage/ Amazon ML is based on the same robust ML technology that is already used within Amazon’s internal systems, generating billions of predictions weekly Amazon ML is built to make it simple and reliable to use the data that you are already storing in the AWS cloud, in products like Amazon S3, Amazon Redshift and Amazon RD And lastly, Amazon ML is built to eliminate the gap between having models and using these models to build smart applications. Production deployment is only a click away – and sometimes you won’t even need that one click.
  7. Next, let’s talk about technology. Amazon ML is based on the same ML technology that has long been deployed within Amazon, and is used to generate tens of billions of predictions weekly. When I say technology, I mean not just the learning algorithm – which is important but by no means only part of ML systems. Amazon Machine Learning comes with technology that suggests data transformations based on your data that will improve model’s quality – and you are able to use these transformations as they are, or adjust the transformation instructions without writing any data transforming code. Amazon ML also includes functionality to provide alerts when known pitfalls are encountered in your input data – for example, when some attributes have many missing values, or when the model was evaluated with data that is significantly different from what was used to train it. This – ensuring that data is evaluated on a fair dataset is an example of an industry best practice that we have built into the product, among many others, all around the goal of making the resulting models more powerful. Finally, a word about speed and scale. Amazon ML can create models from up to 100 GB of data. You can use it to generate billions of predictions, and obtain them in batches or real-time. I will get to the prediction interfaces soon.
  8. Create a Datasource object pointing to your data Explore and understand your data Transform data and train your model
  9. Understand model quality Adjust model interpretation
  10. Batch predictions Real-time predictions
  11. Six main reasons why Amazon EMR
  12. Amazon EMR is more than just MapReduce. Bootstrap actions available on GitHub
  13. In the next few slides, we’ll talk about data persistence models with Amazon EMR. The first pattern is Amazon S3 as HDFS. With this data persistence model, data gets stored on Amazon S3. HDFS does not play any role in storing data. As a matter of fact, HDFS is only there for temporary storage. Another common thing I hear is that storing data on Amazon S3 instead of HDFS slows my job down a lot because data has to get copied to the HDFS/disk first before processing starts. That’s incorrect. If you tell Hadoop that your data is on Amazon S3, Hadoop reads directly from Amazon S3 and streams data to Mappers without toughing the disk. Not to be completely correct, data does touch HDFS when data has to shuffle from mappers to reducers, but as I mentioned, HDFS acts as the temp space and nothing more. EMRFS is an implementation of HDFS used for reading and writing regular files from Amazon EMR directly to Amazon S3. EMRFS provides the convenience of storing persistent data in Amazon S3 for use with Hadoop while also providing features like Amazon S3 server-side encryption, read-after-write consistency, and list consistency.
  14. And every other feature that comes with Amazon S3. Features such as SSE, LifeCycle, etc. And again keep in mind that Amazon S3 as the storage is the main reason why we can’t build elastic clusters where nodes get added and removed dynamically without any data loss.
  15. In the next few slides, we’ll talk about data persistence models with EMR. The first pattern is Amazon S3 as HDFS. With this data persistence model, data gets stored on Amazon S3. HDFS does not play any role in storing data. As a matter of fact, HDFS is only there for temporary storage. Another common thing I hear is that storing data on Amazon S3 instead of HDFS slows my job down a lot because data has to get copied to HDFS/disk first before processing starts. That’s incorrect. If you tell Hadoop that your data is on Amazon S3, Hadoop reads directly from Amazon S3 and streams data to Mappers without toughing the disk. Not to be completely correct, data does touch HDFS when data has to shuffle from mappers to reducers, but as I mentioned, HDFS acts as the temp space and nothing more. EMRFS is an implementation of HDFS used for reading and writing regular files from Amazon EMR directly to Amazon S3. EMRFS provides the convenience of storing persistent data in Amazon S3 for use with Hadoop while also providing features like Amazon S3 server-side encryption, read-after-write consistency, and list consistency.
  16. Amazon Redshift parallelizes and distributes everything Load in parallel from Amazon S3, DynamoDB, EMR or an SSH host Data automatically distributed and sorted according to DDL Scales linearly with the number of nodes in the cluster
  17. With row storage you do unnecessary I/O To get total amount, you have to read everything
  18. Backups to Amazon S3 are automatic, continuous and incremental Configurable system snapshot retention period. Take user snapshots on-demand Cross region backups for disaster recovery Streaming restores enable you to resume querying faster
  19. Choose a distribution style of KEY for Large data tables, like a FACT table in a star schema Large or rapidly changing tables used in joins or aggregations Improved performance even if the key is not used in join column Choose a distribution style of ALL for tables that Have slowly changing data Reasonable size (i.e., few millions but not 100’s of millions of rows) No common distribution key for frequent joins Typical use case – joined dimension table without a common distribution key Choose a distribution style of EVEN for tables that are not joined and have no aggregate queries
  20. Redshift stores column data in blocks, for the sort key, the data blocks are “marked” with the min and max value of this columns, allowing Redshift to skip reading the blocks that are not relevant to the current query. In the slices (on disk), the data is sorted by a sort key If no sort key exists Redshift uses the data insertion order Choose a sort key that is frequently used in your queries As a query predicate (date, identifier, …) As a join parameter (it can also be the hash key) The sort key allows Redshift to avoid reading entire blocks based on predicates For example, a table containing a timestamp sort key where only recent data is accessed, will skip blocks containing “old” data
  21. We’re enabling User Defined Functions (UDFs) so you can add your own Scalar and Aggregate Functions supported You’ll be able to write UDFs using Python 2.7 Syntax is largely identical to PostgreSQL UDF Syntax System and network calls within UDFs are prohibited Comes with Pandas, NumPy, and SciPy pre-installed You’ll also be able import your own libraries for even more flexibility
  22. Records in Redshift are stored in blocks. For this illustration, let’s assume that four records fill a block Records with a given cust_id are all in one block However, records with a given prod_id are spread across four blocks Records with a given cust_id are spread across two blocks Records with a given prod_id are also spread across two blocks Data is sorted in equal measures for both keys
  23. Redshift works with customer’s BI tool of choice through Postgres drivers and a JDBC, ODBC connection. A number of partners shown here have certified integration with Redshift, meaning they have done testing to validate/build Redshift integration and make using Redshift easy from a UI perspective. If there are tools customer’s use not shown we can work with Redshift on getting them integrated. Custom ODBC and JDBC Drivers: Up to 35% higher performance than open source drivers Supported by Informatica, Microstrategy, Pentaho, Qlik, SAS, Tableau Will continue to support PostgreSQL open source drivers Download drivers from console