SlideShare ist ein Scribd-Unternehmen logo
1 von 47
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
EMR is Hadoop in the Cloud

                                 Hadoop is an open-source framework for
                                 parallel processing huge amounts of data
                                 on a cluster of machines

© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Choose: Hadoop distribution,
                                                                                                                 # of nodes, types of nodes,
                                                                                                                custom configs, Hive/Pig/etc.

   Put the data
     into S3                             Amazon Simple
                                         Storage Service (S3)                                   EMR Cluster


                                                                      011001101
                                                                                                      EMR
                                                                                                                                Launch the cluster using
                                                                                                                                 the EMR console, CLI,
                                                                                                                                      SDK, or APIs
         Get the output                                                   You can also
           from S3                                                      store everything
                                                                            in HDFS
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
EMR Cluster

                                            Amazon S3


                                                                                                      EMR




                                                                                                                                   You can easily add
                                                                                                                                   and remove nodes


© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Amazon S3                                             EMR Cluster




                                                                                                    When processing is complete,
                                                                                                    you can terminate the cluster
                                                                                                         (and stop paying)

© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
options




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Hive                                                                                                   Pig
• Data Warehouse for Hadoop                                                                            • High-level programming
• SQL-like query language                                                                                language (Pig Latin)
  (HiveQL)                                                                                             • Supports UDFs
• Initially developed at                                                                               • Ideal for data flow/ETL
  Facebook
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
HBase                                   Mahout
• Column-oriented database              • Machine learning library
• Runs on top of HDFS                   • Supports recommendation
• Ideal for sparse data                   mining, clustering,
• Random, read/write access               classification, and frequent
• Ideal for very large tables (billions   itemset mining
  of rows, millions of columns)
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Ganglia                                                                                             R
• Scalable distributed monitoring                                                                   • Language and software
• View performance of the cluster                                                                     environment for statistical
  and individual nodes                                                                                computing and graphics
• Open source                                                                                       • Open source

© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Hadoop


 elastic-mapreduce --create --alive 
 --instance-type m1.xlarge 
 --num-instances 5




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Hive


 ./elastic-mapreduce --create --alive 
 --name "Test Hive" 
 --hadoop-version 0.20 
 --num-instances 5 
 --instance-type m1.large 
 --hive-interactive 
 --hive-versions 0.7.1

© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
HBase


 elastic-mapreduce --create --hbase 
 --name "$USER HBase Cluster" 
 --num-instances 2 
 --instance-type cc2.8xlarge 




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
bootstrap action



 elastic-mapreduce --create 
 --bootstrap-action s3://s3bucket/installganglia




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Hive




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Hive




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Hive




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Hive




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Hive




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Data                                 Data
                                                                                                                                   Masking                         Data
                                                                                              Exchange                                                             Quality




                                                                                                                                                                          MDM

                                                                                           Data
                                                                                           Transformation                            Enterprise
                                                                                                                                         Data
                                                                                                                                     Integration




                                                                                                                                                  Identity
                                                                                                     Connectivity                                 Resolution
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
HParser UI
                                                                               - any format
                                                                               - any complexity
                                                                               - easily


Real-world                                                - in Map Reduce
data                                                                                             Hadoop
                                                                           source                             M                                                       results

                                                                                                             M                                 R

                                                                                                            M
 © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
End-to-End Flow
   Construction                                                                                                    Execution
   (Windows)
                                                                                                                   (EMR)

binary records                                    text records


                                                                                                                     Map                                    Reduce
             HParser UI
    in                                              out



                   transform                                                        input                                                                output
                   definition
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Real-World Data
                                                                                                         HParser
       Flat files

   Logs
                                                                                                                                                 Records
    XML, JSON

   Industry standards
   Ex. FIX, SWIFT, X12, ASN.1


     Documents
     Ex. PDF, Excel

© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Minutes                                                     ASN.1 on EMR Cluster
      60

        50

        40

        30                                                                                                                                                                10 GB
                                                                                                                                                                          50 GB
        20

        10

          0
                                4                                   16                                   24                                   32            Nodes
     Notes:
     - These are only Mappers times. Add 60 sec lead time (Start) and 60 sec tail time (Reducer) for each run
     - Amazon XL (Extra Large) instances – 64-bit, 15GB RAM, 1.5TB Storage
   © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Minutes                                         ASN.1 on EMR Cluster – 72 Nodes
        60

        50

        40

        30

        20

        10

          0
                          10 GB                         100 GB                          400 GB                         700 GB                            1 TB           File Size

   Notes:
   - These are only Mappers times. Add 60 sec lead time (Start) and 60 sec tail time (Reducer) for each run
   - Amazon XL (Extra Large) instances – 64-bit, 15GB RAM, 1.5TB Storage

 © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Batch processing                          Interactive analysis                            Stream processing
  Query runtime                            Minutes to hours                          Milliseconds to minutes                         Never-ending
  Data volume                              TBs to PBs                                GBs to PBs                                      Continuous stream
  Programming model                        MapReduce                                 Queries                                         DAG
  Users                                    Developers                                Analysts and developers                         Developers
  Google project                           MapReduce                                 Dremel
  Open source project                      Hadoop MapReduce                                                                          Storm and S4




                                              Introducing Apache Drill…
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Avro IDL
                                                                                                      enum Gender {
                                                                                                        MALE, FEMALE
                                                                                                      }
                                                                                                      record User {
                                                                                                        string name;
                                                                                                        Gender gender;
                                                                                                        long followers;
                                                                                                      }
                                                                                                                    JSON
                                                                                                      {
                                                                                                        "name": "Srivas",
                                                                                                        "gender": "Male",
                                                                                                        "followers": 100
                                                                                                      }
                                                                                                      {
                                                                                                        "name": "Raina",
                                                                                                        "gender": "Female",
                                                                                                        "followers": 200,
                                                                                                        "zip": "94305"
                                                                                                      }
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Flexible                                                              Easy
                     • Pluggable query languages                                           •   Unzip and run
                     • Extensible execution engine                                         •   Zero configuration
                     • Pluggable data formats                                              •   Reverse DNS not needed
                       • Column-based and row-based                                        •   IP addresses can change
                       • Schema and schema-less                                            •   Clear and concise log messages
                     • Pluggable data sources


                     Dependable                                                            Fast
                     • No SPOF                                                             • C/C++ core with Java support
                     • Instant recovery from crashes                                         • Google C++ style guide
                                                                                           • Min latency and max throughput
                                                                                             (limited only by hardware)




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
No RegionServers                                         Instant Recovery                                       High Throughput
             No Manual Splits                                         No Compactions                                         No Garbage Collection
             No Manual Merges                                         Snapshots                                              Consistent Low Latency
             No Manual Administration                                 Mirroring                                              No Practical Scale Limits

© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
HBase

                          JVM


                         DFS                                                          HBase

                          JVM                                                           JVM

                         ext3                                                         MapR                                                         Unified


                        Disks                                                         Disks                                                         Disks


              Other Distributions

© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
50B real-time auctions
     #1 in audience reach
       “M7 is really taking Hadoop to the next level. It allows us to do new things with our data.” - Jan Gelin,
       VP of Technical Operations


     2M+ subscribers
     10B+ records
     “I’m really excited about M7 because it will address both the performance and the day-to-day challenges of
     Hbase.” – Melinda Graham, Sr. Hadoop Engineer


     Global leader in email intelligence

     “M7 is a big win for us. It makes HBase really easy to use. It really helps us make better use of the data we
     have. It allows us to look at use cases we haven't had the opportunity to in the past.” Andy Sautins - CTO

© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
aws.amazon.com/elasticmapreduce
• Online Training
   – Videos
   – Articles/tutorials
• Documentation
   – Getting Started Guide
   – Developer Guide
   – API Reference
• FAQs
• Paid Training
   – 3-day Developer Course
     taught by Think Big Analytics
• On-Site Consulting
   – EMR Bootcamp (for companies processing 1+ TB per day)
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
We are sincerely eager to
 hear your feedback on this
presentation and on re:Invent.

  Please fill out an evaluation
    form when you have a
             chance.


© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

Weitere ähnliche Inhalte

Andere mochten auch

DAT103 Introducing Amazon RedShift - AWS re: Invent 2012
DAT103 Introducing Amazon RedShift - AWS re: Invent 2012DAT103 Introducing Amazon RedShift - AWS re: Invent 2012
DAT103 Introducing Amazon RedShift - AWS re: Invent 2012Amazon Web Services
 
BDT205 Solving Big Problems with Big Data - AWS re: Invent 2012
BDT205 Solving Big Problems with Big Data - AWS re: Invent 2012BDT205 Solving Big Problems with Big Data - AWS re: Invent 2012
BDT205 Solving Big Problems with Big Data - AWS re: Invent 2012Amazon Web Services
 
GMG204 TinyCo’s Best Practices for Developing, Scaling, and Monetizing Games ...
GMG204 TinyCo’s Best Practices for Developing, Scaling, and Monetizing Games ...GMG204 TinyCo’s Best Practices for Developing, Scaling, and Monetizing Games ...
GMG204 TinyCo’s Best Practices for Developing, Scaling, and Monetizing Games ...Amazon Web Services
 
Smartronix - Building Secure Applications on the AWS Cloud
Smartronix - Building Secure Applications on the AWS CloudSmartronix - Building Secure Applications on the AWS Cloud
Smartronix - Building Secure Applications on the AWS CloudAmazon Web Services
 
AWS for Start-ups - Leveraging AWS for the Lean Development Cycle
AWS for Start-ups  - Leveraging AWS for the Lean Development CycleAWS for Start-ups  - Leveraging AWS for the Lean Development Cycle
AWS for Start-ups - Leveraging AWS for the Lean Development CycleAmazon Web Services
 
MED203 Scalable Media Processing - AWS re: Invent 2012
MED203 Scalable Media Processing - AWS re: Invent 2012MED203 Scalable Media Processing - AWS re: Invent 2012
MED203 Scalable Media Processing - AWS re: Invent 2012Amazon Web Services
 
Security and Privacy in the Cloud - Stephen Schmidt - AWS Summit 2012 Australia
Security and Privacy in the Cloud - Stephen Schmidt - AWS Summit 2012 AustraliaSecurity and Privacy in the Cloud - Stephen Schmidt - AWS Summit 2012 Australia
Security and Privacy in the Cloud - Stephen Schmidt - AWS Summit 2012 AustraliaAmazon Web Services
 
Aws for the Retail Industry, Webinar, September 2012
Aws for the Retail Industry, Webinar, September 2012Aws for the Retail Industry, Webinar, September 2012
Aws for the Retail Industry, Webinar, September 2012Amazon Web Services
 
ENT103 Making the Case for Cloud - AWS re: Invent 2012
ENT103 Making the Case for Cloud - AWS re: Invent 2012ENT103 Making the Case for Cloud - AWS re: Invent 2012
ENT103 Making the Case for Cloud - AWS re: Invent 2012Amazon Web Services
 
Journey through the Cloud - Best Practices Getting Started in the AWS Cloud
Journey through the Cloud - Best Practices Getting Started in the AWS CloudJourney through the Cloud - Best Practices Getting Started in the AWS Cloud
Journey through the Cloud - Best Practices Getting Started in the AWS CloudAmazon Web Services
 
Big Data Analytics with AWS and AWS Marketplace Webinar
Big Data Analytics with AWS and AWS Marketplace WebinarBig Data Analytics with AWS and AWS Marketplace Webinar
Big Data Analytics with AWS and AWS Marketplace WebinarAmazon Web Services
 
AWS Customer Presentation: EyeEm.com - Berlin Summit 2012
AWS Customer Presentation: EyeEm.com - Berlin Summit 2012AWS Customer Presentation: EyeEm.com - Berlin Summit 2012
AWS Customer Presentation: EyeEm.com - Berlin Summit 2012Amazon Web Services
 

Andere mochten auch (12)

DAT103 Introducing Amazon RedShift - AWS re: Invent 2012
DAT103 Introducing Amazon RedShift - AWS re: Invent 2012DAT103 Introducing Amazon RedShift - AWS re: Invent 2012
DAT103 Introducing Amazon RedShift - AWS re: Invent 2012
 
BDT205 Solving Big Problems with Big Data - AWS re: Invent 2012
BDT205 Solving Big Problems with Big Data - AWS re: Invent 2012BDT205 Solving Big Problems with Big Data - AWS re: Invent 2012
BDT205 Solving Big Problems with Big Data - AWS re: Invent 2012
 
GMG204 TinyCo’s Best Practices for Developing, Scaling, and Monetizing Games ...
GMG204 TinyCo’s Best Practices for Developing, Scaling, and Monetizing Games ...GMG204 TinyCo’s Best Practices for Developing, Scaling, and Monetizing Games ...
GMG204 TinyCo’s Best Practices for Developing, Scaling, and Monetizing Games ...
 
Smartronix - Building Secure Applications on the AWS Cloud
Smartronix - Building Secure Applications on the AWS CloudSmartronix - Building Secure Applications on the AWS Cloud
Smartronix - Building Secure Applications on the AWS Cloud
 
AWS for Start-ups - Leveraging AWS for the Lean Development Cycle
AWS for Start-ups  - Leveraging AWS for the Lean Development CycleAWS for Start-ups  - Leveraging AWS for the Lean Development Cycle
AWS for Start-ups - Leveraging AWS for the Lean Development Cycle
 
MED203 Scalable Media Processing - AWS re: Invent 2012
MED203 Scalable Media Processing - AWS re: Invent 2012MED203 Scalable Media Processing - AWS re: Invent 2012
MED203 Scalable Media Processing - AWS re: Invent 2012
 
Security and Privacy in the Cloud - Stephen Schmidt - AWS Summit 2012 Australia
Security and Privacy in the Cloud - Stephen Schmidt - AWS Summit 2012 AustraliaSecurity and Privacy in the Cloud - Stephen Schmidt - AWS Summit 2012 Australia
Security and Privacy in the Cloud - Stephen Schmidt - AWS Summit 2012 Australia
 
Aws for the Retail Industry, Webinar, September 2012
Aws for the Retail Industry, Webinar, September 2012Aws for the Retail Industry, Webinar, September 2012
Aws for the Retail Industry, Webinar, September 2012
 
ENT103 Making the Case for Cloud - AWS re: Invent 2012
ENT103 Making the Case for Cloud - AWS re: Invent 2012ENT103 Making the Case for Cloud - AWS re: Invent 2012
ENT103 Making the Case for Cloud - AWS re: Invent 2012
 
Journey through the Cloud - Best Practices Getting Started in the AWS Cloud
Journey through the Cloud - Best Practices Getting Started in the AWS CloudJourney through the Cloud - Best Practices Getting Started in the AWS Cloud
Journey through the Cloud - Best Practices Getting Started in the AWS Cloud
 
Big Data Analytics with AWS and AWS Marketplace Webinar
Big Data Analytics with AWS and AWS Marketplace WebinarBig Data Analytics with AWS and AWS Marketplace Webinar
Big Data Analytics with AWS and AWS Marketplace Webinar
 
AWS Customer Presentation: EyeEm.com - Berlin Summit 2012
AWS Customer Presentation: EyeEm.com - Berlin Summit 2012AWS Customer Presentation: EyeEm.com - Berlin Summit 2012
AWS Customer Presentation: EyeEm.com - Berlin Summit 2012
 

Ähnlich wie EMR is Hadoop in the Cloud

Disaster Recovery with AWS - Simone Brunozzi - AWS Summit 2012 Australia -
Disaster Recovery with AWS - Simone Brunozzi - AWS Summit 2012 Australia - Disaster Recovery with AWS - Simone Brunozzi - AWS Summit 2012 Australia -
Disaster Recovery with AWS - Simone Brunozzi - AWS Summit 2012 Australia - Amazon Web Services
 
Disaster Recovery using Amazon Web Services - Webinar
Disaster Recovery using Amazon Web Services - WebinarDisaster Recovery using Amazon Web Services - Webinar
Disaster Recovery using Amazon Web Services - WebinarAmazon Web Services
 
Meetup - Using CloudSearch with DynamoDB
Meetup - Using CloudSearch with DynamoDBMeetup - Using CloudSearch with DynamoDB
Meetup - Using CloudSearch with DynamoDBAmazon Web Services
 
Building Better Search For Wikipedia: How We Did It Using Amazon CloudSearch ...
Building Better Search For Wikipedia: How We Did It Using Amazon CloudSearch ...Building Better Search For Wikipedia: How We Did It Using Amazon CloudSearch ...
Building Better Search For Wikipedia: How We Did It Using Amazon CloudSearch ...Amazon Web Services
 
Backup and Recovery for Linux With Amazon S3
Backup and Recovery for Linux With Amazon S3Backup and Recovery for Linux With Amazon S3
Backup and Recovery for Linux With Amazon S3Amazon Web Services
 
20120723 aws meister-reloaded-awssd-kfor_ruby-php-python-public
20120723 aws meister-reloaded-awssd-kfor_ruby-php-python-public20120723 aws meister-reloaded-awssd-kfor_ruby-php-python-public
20120723 aws meister-reloaded-awssd-kfor_ruby-php-python-publicAmazon Web Services Japan
 
CIS13: AWS Identity and Access Management
CIS13: AWS Identity and Access ManagementCIS13: AWS Identity and Access Management
CIS13: AWS Identity and Access ManagementCloudIDSummit
 
SEC101 A Guided Tour of AWS Identity and Access Management - AWS re: Invent…
SEC101 A Guided Tour of AWS Identity and Access Management - AWS re: Invent…SEC101 A Guided Tour of AWS Identity and Access Management - AWS re: Invent…
SEC101 A Guided Tour of AWS Identity and Access Management - AWS re: Invent…Amazon Web Services
 
AWS Webinar - Design for Availability-13_09_10
AWS Webinar - Design for Availability-13_09_10AWS Webinar - Design for Availability-13_09_10
AWS Webinar - Design for Availability-13_09_10Amazon Web Services
 
AWS Webcast - Design for Availability
AWS Webcast - Design for AvailabilityAWS Webcast - Design for Availability
AWS Webcast - Design for AvailabilityAmazon Web Services
 
AWS Webcast - Introducing Amazon RDS for PostgreSQL
AWS Webcast - Introducing Amazon RDS for PostgreSQLAWS Webcast - Introducing Amazon RDS for PostgreSQL
AWS Webcast - Introducing Amazon RDS for PostgreSQLAmazon Web Services
 
AWS를 활용한 미디어 스트리밍 서비스
AWS를 활용한 미디어 스트리밍 서비스AWS를 활용한 미디어 스트리밍 서비스
AWS를 활용한 미디어 스트리밍 서비스Amazon Web Services Korea
 
Aras Rightsizing PLM Software Deployments for Scalability
Aras Rightsizing PLM Software Deployments for ScalabilityAras Rightsizing PLM Software Deployments for Scalability
Aras Rightsizing PLM Software Deployments for ScalabilityAras
 
AWS Webcast - High Availability with Route 53 DNS Failover
AWS Webcast - High Availability with Route 53 DNS FailoverAWS Webcast - High Availability with Route 53 DNS Failover
AWS Webcast - High Availability with Route 53 DNS FailoverAmazon Web Services
 
AWS Webcast - Introducing Amazon Redshift
AWS Webcast - Introducing Amazon RedshiftAWS Webcast - Introducing Amazon Redshift
AWS Webcast - Introducing Amazon RedshiftAmazon Web Services
 
AWS Webcast - Data Integration into Amazon Redshift
AWS Webcast - Data Integration into Amazon RedshiftAWS Webcast - Data Integration into Amazon Redshift
AWS Webcast - Data Integration into Amazon RedshiftAmazon Web Services
 
Flash performance tuning (EN)
Flash performance tuning (EN)Flash performance tuning (EN)
Flash performance tuning (EN)Andy Hall
 
ARC201 AWS Database Tier Architecture Best Practices - AWS re: Invent 2012
ARC201 AWS Database Tier Architecture Best Practices - AWS re: Invent 2012ARC201 AWS Database Tier Architecture Best Practices - AWS re: Invent 2012
ARC201 AWS Database Tier Architecture Best Practices - AWS re: Invent 2012Amazon Web Services
 
AWS Webcast - Accelerating Application Performance Using In-Memory Caching in...
AWS Webcast - Accelerating Application Performance Using In-Memory Caching in...AWS Webcast - Accelerating Application Performance Using In-Memory Caching in...
AWS Webcast - Accelerating Application Performance Using In-Memory Caching in...Amazon Web Services
 

Ähnlich wie EMR is Hadoop in the Cloud (20)

Disaster Recovery with AWS - Simone Brunozzi - AWS Summit 2012 Australia -
Disaster Recovery with AWS - Simone Brunozzi - AWS Summit 2012 Australia - Disaster Recovery with AWS - Simone Brunozzi - AWS Summit 2012 Australia -
Disaster Recovery with AWS - Simone Brunozzi - AWS Summit 2012 Australia -
 
Disaster Recovery using Amazon Web Services - Webinar
Disaster Recovery using Amazon Web Services - WebinarDisaster Recovery using Amazon Web Services - Webinar
Disaster Recovery using Amazon Web Services - Webinar
 
Meetup - Using CloudSearch with DynamoDB
Meetup - Using CloudSearch with DynamoDBMeetup - Using CloudSearch with DynamoDB
Meetup - Using CloudSearch with DynamoDB
 
Building Better Search For Wikipedia: How We Did It Using Amazon CloudSearch ...
Building Better Search For Wikipedia: How We Did It Using Amazon CloudSearch ...Building Better Search For Wikipedia: How We Did It Using Amazon CloudSearch ...
Building Better Search For Wikipedia: How We Did It Using Amazon CloudSearch ...
 
Backup and Recovery for Linux With Amazon S3
Backup and Recovery for Linux With Amazon S3Backup and Recovery for Linux With Amazon S3
Backup and Recovery for Linux With Amazon S3
 
20120723 aws meister-reloaded-awssd-kfor_ruby-php-python-public
20120723 aws meister-reloaded-awssd-kfor_ruby-php-python-public20120723 aws meister-reloaded-awssd-kfor_ruby-php-python-public
20120723 aws meister-reloaded-awssd-kfor_ruby-php-python-public
 
CIS13: AWS Identity and Access Management
CIS13: AWS Identity and Access ManagementCIS13: AWS Identity and Access Management
CIS13: AWS Identity and Access Management
 
SEC101 A Guided Tour of AWS Identity and Access Management - AWS re: Invent…
SEC101 A Guided Tour of AWS Identity and Access Management - AWS re: Invent…SEC101 A Guided Tour of AWS Identity and Access Management - AWS re: Invent…
SEC101 A Guided Tour of AWS Identity and Access Management - AWS re: Invent…
 
AWS Webinar - Design for Availability-13_09_10
AWS Webinar - Design for Availability-13_09_10AWS Webinar - Design for Availability-13_09_10
AWS Webinar - Design for Availability-13_09_10
 
AWS Webcast - Design for Availability
AWS Webcast - Design for AvailabilityAWS Webcast - Design for Availability
AWS Webcast - Design for Availability
 
AWS Webcast - Introducing Amazon RDS for PostgreSQL
AWS Webcast - Introducing Amazon RDS for PostgreSQLAWS Webcast - Introducing Amazon RDS for PostgreSQL
AWS Webcast - Introducing Amazon RDS for PostgreSQL
 
AWS를 활용한 미디어 스트리밍 서비스
AWS를 활용한 미디어 스트리밍 서비스AWS를 활용한 미디어 스트리밍 서비스
AWS를 활용한 미디어 스트리밍 서비스
 
Aras Rightsizing PLM Software Deployments for Scalability
Aras Rightsizing PLM Software Deployments for ScalabilityAras Rightsizing PLM Software Deployments for Scalability
Aras Rightsizing PLM Software Deployments for Scalability
 
AWS Webcast - High Availability with Route 53 DNS Failover
AWS Webcast - High Availability with Route 53 DNS FailoverAWS Webcast - High Availability with Route 53 DNS Failover
AWS Webcast - High Availability with Route 53 DNS Failover
 
AWS Webcast - Introducing Amazon Redshift
AWS Webcast - Introducing Amazon RedshiftAWS Webcast - Introducing Amazon Redshift
AWS Webcast - Introducing Amazon Redshift
 
AWS Webcast - Data Integration into Amazon Redshift
AWS Webcast - Data Integration into Amazon RedshiftAWS Webcast - Data Integration into Amazon Redshift
AWS Webcast - Data Integration into Amazon Redshift
 
Flash performance tuning (EN)
Flash performance tuning (EN)Flash performance tuning (EN)
Flash performance tuning (EN)
 
Media Streaming on AWS
Media Streaming on AWSMedia Streaming on AWS
Media Streaming on AWS
 
ARC201 AWS Database Tier Architecture Best Practices - AWS re: Invent 2012
ARC201 AWS Database Tier Architecture Best Practices - AWS re: Invent 2012ARC201 AWS Database Tier Architecture Best Practices - AWS re: Invent 2012
ARC201 AWS Database Tier Architecture Best Practices - AWS re: Invent 2012
 
AWS Webcast - Accelerating Application Performance Using In-Memory Caching in...
AWS Webcast - Accelerating Application Performance Using In-Memory Caching in...AWS Webcast - Accelerating Application Performance Using In-Memory Caching in...
AWS Webcast - Accelerating Application Performance Using In-Memory Caching in...
 

Mehr von Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

EMR is Hadoop in the Cloud

  • 1. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 2. EMR is Hadoop in the Cloud Hadoop is an open-source framework for parallel processing huge amounts of data on a cluster of machines © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 3. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 4. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 5. Choose: Hadoop distribution, # of nodes, types of nodes, custom configs, Hive/Pig/etc. Put the data into S3 Amazon Simple Storage Service (S3) EMR Cluster 011001101 EMR Launch the cluster using the EMR console, CLI, SDK, or APIs Get the output You can also from S3 store everything in HDFS © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 6. EMR Cluster Amazon S3 EMR You can easily add and remove nodes © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 7. Amazon S3 EMR Cluster When processing is complete, you can terminate the cluster (and stop paying) © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 8. options © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 9. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 10. Hive Pig • Data Warehouse for Hadoop • High-level programming • SQL-like query language language (Pig Latin) (HiveQL) • Supports UDFs • Initially developed at • Ideal for data flow/ETL Facebook © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 11. HBase Mahout • Column-oriented database • Machine learning library • Runs on top of HDFS • Supports recommendation • Ideal for sparse data mining, clustering, • Random, read/write access classification, and frequent • Ideal for very large tables (billions itemset mining of rows, millions of columns) © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 12. Ganglia R • Scalable distributed monitoring • Language and software • View performance of the cluster environment for statistical and individual nodes computing and graphics • Open source • Open source © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 13. Hadoop elastic-mapreduce --create --alive --instance-type m1.xlarge --num-instances 5 © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 14. Hive ./elastic-mapreduce --create --alive --name "Test Hive" --hadoop-version 0.20 --num-instances 5 --instance-type m1.large --hive-interactive --hive-versions 0.7.1 © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 15. HBase elastic-mapreduce --create --hbase --name "$USER HBase Cluster" --num-instances 2 --instance-type cc2.8xlarge © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 16. bootstrap action elastic-mapreduce --create --bootstrap-action s3://s3bucket/installganglia © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 17. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 18. Hive © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 19. Hive © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 20. Hive © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 21. Hive © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 22. Hive © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 23. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 24. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 25. Data Data Masking Data Exchange Quality MDM Data Transformation Enterprise Data Integration Identity Connectivity Resolution © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 26. HParser UI - any format - any complexity - easily Real-world - in Map Reduce data Hadoop source M results M R M © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 27. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 28. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 29. End-to-End Flow Construction Execution (Windows) (EMR) binary records text records Map Reduce HParser UI in out transform input output definition © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 30. Real-World Data HParser Flat files Logs Records XML, JSON Industry standards Ex. FIX, SWIFT, X12, ASN.1 Documents Ex. PDF, Excel © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 31. Minutes ASN.1 on EMR Cluster 60 50 40 30 10 GB 50 GB 20 10 0 4 16 24 32 Nodes Notes: - These are only Mappers times. Add 60 sec lead time (Start) and 60 sec tail time (Reducer) for each run - Amazon XL (Extra Large) instances – 64-bit, 15GB RAM, 1.5TB Storage © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 32. Minutes ASN.1 on EMR Cluster – 72 Nodes 60 50 40 30 20 10 0 10 GB 100 GB 400 GB 700 GB 1 TB File Size Notes: - These are only Mappers times. Add 60 sec lead time (Start) and 60 sec tail time (Reducer) for each run - Amazon XL (Extra Large) instances – 64-bit, 15GB RAM, 1.5TB Storage © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 33. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 34. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 35. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 36. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 37. Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to minutes Never-ending Data volume TBs to PBs GBs to PBs Continuous stream Programming model MapReduce Queries DAG Users Developers Analysts and developers Developers Google project MapReduce Dremel Open source project Hadoop MapReduce Storm and S4 Introducing Apache Drill… © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 38. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 39. Avro IDL enum Gender { MALE, FEMALE } record User { string name; Gender gender; long followers; } JSON { "name": "Srivas", "gender": "Male", "followers": 100 } { "name": "Raina", "gender": "Female", "followers": 200, "zip": "94305" } © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 40. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 41. Flexible Easy • Pluggable query languages • Unzip and run • Extensible execution engine • Zero configuration • Pluggable data formats • Reverse DNS not needed • Column-based and row-based • IP addresses can change • Schema and schema-less • Clear and concise log messages • Pluggable data sources Dependable Fast • No SPOF • C/C++ core with Java support • Instant recovery from crashes • Google C++ style guide • Min latency and max throughput (limited only by hardware) © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 42. No RegionServers Instant Recovery High Throughput No Manual Splits No Compactions No Garbage Collection No Manual Merges Snapshots Consistent Low Latency No Manual Administration Mirroring No Practical Scale Limits © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 43. HBase JVM DFS HBase JVM JVM ext3 MapR Unified Disks Disks Disks Other Distributions © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 44. 50B real-time auctions #1 in audience reach “M7 is really taking Hadoop to the next level. It allows us to do new things with our data.” - Jan Gelin, VP of Technical Operations 2M+ subscribers 10B+ records “I’m really excited about M7 because it will address both the performance and the day-to-day challenges of Hbase.” – Melinda Graham, Sr. Hadoop Engineer Global leader in email intelligence “M7 is a big win for us. It makes HBase really easy to use. It really helps us make better use of the data we have. It allows us to look at use cases we haven't had the opportunity to in the past.” Andy Sautins - CTO © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 45. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 46. aws.amazon.com/elasticmapreduce • Online Training – Videos – Articles/tutorials • Documentation – Getting Started Guide – Developer Guide – API Reference • FAQs • Paid Training – 3-day Developer Course taught by Think Big Analytics • On-Site Consulting – EMR Bootcamp (for companies processing 1+ TB per day) © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 47. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.