SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Downloaden Sie, um offline zu lesen
The Platform for Big Data
    Amr Awadallah | CTO, Founder, Cloudera, Inc.
    aaa@cloudera.com, twitter: @awadallah




1
The Problems with Current Data Systems

    BI Reports + Interactive Apps             3. Can’t Explore Original High
                                                    Fidelity Raw Data
     RDBMS (aggregated data)

         ETL Compute Grid

                         1. Moving Data To
                       Compute Doesn’t Scale

              Storage Only Grid (original raw data)
                                                                         2. Archiving
              Mostly Append
                                                                         = Premature
                            Collection                                   Data Death

                        Instrumentation


2                          ©2012 Cloudera, Inc. All Rights Reserved.
The Solution: A Combined Storage/Compute Layer

                                                     3. Data Exploration &
      BI Reports + Interactive Apps                   Advanced Analytics
       RDBMS (aggregated data)

                       1. Scalable Throughput
                       For ETL & Aggregation
                          (ETL Acceleration)
                                                                               2. Keep Data
                  Hadoop: Storage + Compute Grid
                                                                              Alive For Ever
                Mostly Append                                                (Active Archive)

                              Collection

                          Instrumentation


3                            ©2012 Cloudera, Inc. All Rights Reserved.
So What is Apache                                            Hadoop ?
•   A scalable fault-tolerant distributed system for data storage and
    processing (open source under the Apache license).

•   Core Hadoop has two main systems:
    • Hadoop Distributed File System: self-healing high-bandwidth clustered
      storage.
    • MapReduce: distributed fault-tolerant resource management and scheduling
      coupled with a scalable data programming abstraction.

•   Key business values:
    • Flexibility – Store any data, Run any analysis.
    • Scalability – Start at 1TB/3-nodes grow to petabytes/1000s of nodes.
    • Economics – Cost per TB at a fraction of traditional options.


4                           ©2012 Cloudera, Inc. All Rights Reserved.
The Hadoop Big Bang



                                                   • Fastest sort of a TB, 62secs
                                                   over 1,460 nodes
                                                   • Sorted a PB in 16.25hours
                                                   over 3,658 nodes




                                                      Hadoop World 2009,
                                                        500 attendees


5      ©2012 Cloudera, Inc. All Rights Reserved.
The Key Benefit: Agility/Flexibility
Schema-on-Write (RDBMS):                                    Schema-on-Read (Hadoop):
    •   Schema must be created before                   •   Data is simply copied to the file store,
        any data can be loaded.                             no transformation is needed.
    •   An explicit load operation has to               •   A SerDe (Serializer/Deserlizer) is
        take place which transforms data                    applied during read time to extract
        to DB internal serialization format.                the required columns (late binding)
    •   New columns must be added                       •   New data can start flowing anytime
        explicitly before new data for such                 and will appear retroactively once the
        columns can be loaded into the                      SerDe is updated to parse it.
        database.

          •   OLAP is Fast                                     •   Load is Fast
                                                Pros
                                                 Pros
          •   Standards/Governance                             •   Flexibility/Agility


6                                    ©2012 Cloudera, Inc. All Rights Reserved.
Scalability: Scalable Software Development

      Grows without requiring developers to
      re-architect their algorithms/application.


                  AUTO SCALE
                   AUTO SCALE




7                  ©2012 Cloudera, Inc. All Rights Reserved.
Economics: Return on Byte
    •   Return on Byte (ROB) = value to be extracted from that
        byte divided by the cost of storing that byte

    •   If ROB is < 1 then it will be buried into tape wasteland,
        thus we need more economical active storage.

                                                                       High ROB


                                                                       Low ROB



8                          ©2012 Cloudera, Inc. All Rights Reserved.
The Big Data Platform: CDH4 – June 2012

                                   Job Workflow             Data Processing Lib                    Data Mining Lib
                                          APACHE OOZIE                       DataFu for Pig             APACHE MAHOUT
     Build/Test: APACHE BIGTOP




                                   Web Console                  Interactive SQL                       Metadata
                                                 HUE                                     Impala    APACHE HIVE MetaStore


                                                   Batch Processing Languages
                                                                                                           Fast
                                    Data                              APACHE PIG, APACHE HIVE
                                                                                                        Read/Write
                                 Integration
                                                                                                          Access
                                 APACHE FLUME,            Hadoop Core Kernel
                                 APACHE SQOOP                                     MapReduce, HDFS       APACHE HBASE


                                 Cloud Deployment                 Connectivity                      Coordination
                                         APACHE WHIRR          ODBC/JDBC/FUSE/HTTPS                  APACHE ZOOKEEPER

                                     Cloudera Manager Free Edition (Installation Wizard)


9                                                      ©2012 Cloudera, Inc. All Rights Reserved.
CDH in the Enterprise Data Stack
                                       ENGINEERS      DATA SCIENTISTS      ANALYSTS          BUSINESS USERS


       DATA            SYSTEM
     ARCHITECTS       OPERATORS
                                                        Modeling
                                                         Modeling            BI / /
                                                                              BI              Enterprise
                                                                                               Enterprise
                                         IDEs
                                          IDEs           Tools
                                                          Tools            Analytics
                                                                            Analytics         Reporting
                                                                                               Reporting

     Meta Data/
      Meta Data/      Cloudera
                       Cloudera
      ETL Tools
       ETL Tools      Manager
                       Manager


                                        ODBC, JDBC,
                                         NFS, HTTP

                                                                                      Enterprise Data
                                                                        Sqoop           Warehouse


                                                                                   Online Serving
                                                                        Sqoop        Systems

              Flume          Flume        Flume              Sqoop
                                                                                                        CUSTOMERS

                                                      Relational
                                                       Relational                       Web/Mobile
                                                                                         Web/Mobile
          Logs
           Logs           Files
                           Files     Web Data
                                     Web Data         Databases                         Applications
                                                       Databases                         Applications


10
                                     ©2012 Cloudera, Inc. All Rights Reserved.
HBase versus HDFS
HDFS:                                                    HBase:
Optimized For:                                            Optimized For:
•    Large Files                                          •   Small Records
•    Sequential Access (Hi Throughput)                    •   Random Access (Lo Latency)
•    Append Only                                          •   Atomic Record Updates

Use For:                                                  Use For:
•    Fact tables that are mostly append only              •   Dimension tables which are updated
     and require sequential full table scans.
                                                              frequently and require random low-
                                                              latency lookups.

                                                          Not Suitable For:
                                                          •   Low Latency Interactive OLAP.


11
                                    ©2012 Cloudera, Inc. All Rights Reserved.
Use Case Examples

 •   Retail: Price Optimization
 •   Media: Content Targeting
 •   Finance: Fraud Detection
 •   Manufacturing: Diagnostics
 •   Info Services: Satellite Imagery
 •   Agriculture: Seed Optimization
 •   Power: Smart Consumption

12             ©2012 Cloudera, Inc. All Rights Reserved.
Core Benefits of the Platform for Big Data

        1. FLEXIBILITY
        STORE ANY DATA
        RUN ANY ANALYSIS
        KEEP’S PACE WITH THE RATE OF CHANGE OF INCOMING DATA



        2. SCALABILITY
        PROVEN GROWTH TO PBS/1,000s OF NODES
        NO NEED TO REWRITE QUERIES, AUTOMATICALLY SCALES
        KEEP’S PACE WITH THE RATE OF GROWTH OF INCOMING DATA



        3. ECONOMICS
        COST PER TB AT A FRACTION OF OTHER OPTIONS
        KEEP ALL OF YOUR DATA ALIVE IN AN ACTIVE ARCHIVE
        POWERING THE DATA BEATS ALGORITHM MOVEMENT



13                        ©2012 Cloudera, Inc. All Rights Reserved.
Thank you!
Amr Awadallah, CTO, Founder, Cloudera, Inc. <aaa@cloudera.com>   @awadallah

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Apache Hadoop Now Next and Beyond
Apache Hadoop Now Next and BeyondApache Hadoop Now Next and Beyond
Apache Hadoop Now Next and BeyondDataWorks Summit
 
Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopHortonworks
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookA Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookBigDataCloud
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hortonworks
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Cloudera, Inc.
 
Azure_Business_Opportunity
Azure_Business_OpportunityAzure_Business_Opportunity
Azure_Business_OpportunityNojan Emad
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championAmeet Paranjape
 
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the EnterpriseHadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the EnterpriseCloudera, Inc.
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applicationsrussell_jurney
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14iwrigley
 
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in ProductionUpgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in ProductionCloudera, Inc.
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Hortonworks
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMongoDB
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 

Was ist angesagt? (20)

Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Apache Hadoop Now Next and Beyond
Apache Hadoop Now Next and BeyondApache Hadoop Now Next and Beyond
Apache Hadoop Now Next and Beyond
 
Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache Hadoop
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookA Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
 
Azure_Business_Opportunity
Azure_Business_OpportunityAzure_Business_Opportunity
Azure_Business_Opportunity
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the EnterpriseHadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applications
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
 
Introduction to h base
Introduction to h baseIntroduction to h base
Introduction to h base
 
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in ProductionUpgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
 
Oracle in Database Hadoop
Oracle in Database HadoopOracle in Database Hadoop
Oracle in Database Hadoop
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
 
hadoop_module6
hadoop_module6hadoop_module6
hadoop_module6
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDB
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 

Andere mochten auch

Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHanborq Inc.
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Designing an IT Solution
Designing an IT SolutionDesigning an IT Solution
Designing an IT SolutionPhilippe Julio
 
Big Data
Big DataBig Data
Big DataNGDATA
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Andere mochten auch (19)

Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
What is big data?
What is big data?What is big data?
What is big data?
 
Designing an IT Solution
Designing an IT SolutionDesigning an IT Solution
Designing an IT Solution
 
Big Data
Big DataBig Data
Big Data
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Ähnlich wie Data Science Day New York: The Platform for Big Data

Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopCloudera, Inc.
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Cloudera, Inc.
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Jonathan Seidman
 
Amr Awadallah, unSEXY Presentation
Amr Awadallah, unSEXY PresentationAmr Awadallah, unSEXY Presentation
Amr Awadallah, unSEXY Presentation500 Startups
 
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesWebinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesCloudera, Inc.
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
 
Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01eimhee
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use casesJoey Echeverria
 
Get started with hadoop hive hive ql languages
Get started with hadoop hive hive ql languagesGet started with hadoop hive hive ql languages
Get started with hadoop hive hive ql languagesJanBask Training
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Big data and hadoop product page
Big data and hadoop product pageBig data and hadoop product page
Big data and hadoop product pageJanu Jahnavi
 
Integrating Hadoop Into the Enterprise
Integrating Hadoop Into the EnterpriseIntegrating Hadoop Into the Enterprise
Integrating Hadoop Into the EnterpriseDataWorks Summit
 

Ähnlich wie Data Science Day New York: The Platform for Big Data (20)

Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
 
Amr Awadallah, unSEXY Presentation
Amr Awadallah, unSEXY PresentationAmr Awadallah, unSEXY Presentation
Amr Awadallah, unSEXY Presentation
 
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesWebinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
 
Get started with hadoop hive hive ql languages
Get started with hadoop hive hive ql languagesGet started with hadoop hive hive ql languages
Get started with hadoop hive hive ql languages
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Big data and hadoop product page
Big data and hadoop product pageBig data and hadoop product page
Big data and hadoop product page
 
Integrating Hadoop Into the Enterprise
Integrating Hadoop Into the EnterpriseIntegrating Hadoop Into the Enterprise
Integrating Hadoop Into the Enterprise
 

Mehr von Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Mehr von Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Data Science Day New York: The Platform for Big Data

  • 1. The Platform for Big Data Amr Awadallah | CTO, Founder, Cloudera, Inc. aaa@cloudera.com, twitter: @awadallah 1
  • 2. The Problems with Current Data Systems BI Reports + Interactive Apps 3. Can’t Explore Original High Fidelity Raw Data RDBMS (aggregated data) ETL Compute Grid 1. Moving Data To Compute Doesn’t Scale Storage Only Grid (original raw data) 2. Archiving Mostly Append = Premature Collection Data Death Instrumentation 2 ©2012 Cloudera, Inc. All Rights Reserved.
  • 3. The Solution: A Combined Storage/Compute Layer 3. Data Exploration & BI Reports + Interactive Apps Advanced Analytics RDBMS (aggregated data) 1. Scalable Throughput For ETL & Aggregation (ETL Acceleration) 2. Keep Data Hadoop: Storage + Compute Grid Alive For Ever Mostly Append (Active Archive) Collection Instrumentation 3 ©2012 Cloudera, Inc. All Rights Reserved.
  • 4. So What is Apache Hadoop ? • A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license). • Core Hadoop has two main systems: • Hadoop Distributed File System: self-healing high-bandwidth clustered storage. • MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction. • Key business values: • Flexibility – Store any data, Run any analysis. • Scalability – Start at 1TB/3-nodes grow to petabytes/1000s of nodes. • Economics – Cost per TB at a fraction of traditional options. 4 ©2012 Cloudera, Inc. All Rights Reserved.
  • 5. The Hadoop Big Bang • Fastest sort of a TB, 62secs over 1,460 nodes • Sorted a PB in 16.25hours over 3,658 nodes Hadoop World 2009, 500 attendees 5 ©2012 Cloudera, Inc. All Rights Reserved.
  • 6. The Key Benefit: Agility/Flexibility Schema-on-Write (RDBMS): Schema-on-Read (Hadoop): • Schema must be created before • Data is simply copied to the file store, any data can be loaded. no transformation is needed. • An explicit load operation has to • A SerDe (Serializer/Deserlizer) is take place which transforms data applied during read time to extract to DB internal serialization format. the required columns (late binding) • New columns must be added • New data can start flowing anytime explicitly before new data for such and will appear retroactively once the columns can be loaded into the SerDe is updated to parse it. database. • OLAP is Fast • Load is Fast Pros Pros • Standards/Governance • Flexibility/Agility 6 ©2012 Cloudera, Inc. All Rights Reserved.
  • 7. Scalability: Scalable Software Development Grows without requiring developers to re-architect their algorithms/application. AUTO SCALE AUTO SCALE 7 ©2012 Cloudera, Inc. All Rights Reserved.
  • 8. Economics: Return on Byte • Return on Byte (ROB) = value to be extracted from that byte divided by the cost of storing that byte • If ROB is < 1 then it will be buried into tape wasteland, thus we need more economical active storage. High ROB Low ROB 8 ©2012 Cloudera, Inc. All Rights Reserved.
  • 9. The Big Data Platform: CDH4 – June 2012 Job Workflow Data Processing Lib Data Mining Lib APACHE OOZIE DataFu for Pig APACHE MAHOUT Build/Test: APACHE BIGTOP Web Console Interactive SQL Metadata HUE Impala APACHE HIVE MetaStore Batch Processing Languages Fast Data APACHE PIG, APACHE HIVE Read/Write Integration Access APACHE FLUME, Hadoop Core Kernel APACHE SQOOP MapReduce, HDFS APACHE HBASE Cloud Deployment Connectivity Coordination APACHE WHIRR ODBC/JDBC/FUSE/HTTPS APACHE ZOOKEEPER Cloudera Manager Free Edition (Installation Wizard) 9 ©2012 Cloudera, Inc. All Rights Reserved.
  • 10. CDH in the Enterprise Data Stack ENGINEERS DATA SCIENTISTS ANALYSTS BUSINESS USERS DATA SYSTEM ARCHITECTS OPERATORS Modeling Modeling BI / / BI Enterprise Enterprise IDEs IDEs Tools Tools Analytics Analytics Reporting Reporting Meta Data/ Meta Data/ Cloudera Cloudera ETL Tools ETL Tools Manager Manager ODBC, JDBC, NFS, HTTP Enterprise Data Sqoop Warehouse Online Serving Sqoop Systems Flume Flume Flume Sqoop CUSTOMERS Relational Relational Web/Mobile Web/Mobile Logs Logs Files Files Web Data Web Data Databases Applications Databases Applications 10 ©2012 Cloudera, Inc. All Rights Reserved.
  • 11. HBase versus HDFS HDFS: HBase: Optimized For: Optimized For: • Large Files • Small Records • Sequential Access (Hi Throughput) • Random Access (Lo Latency) • Append Only • Atomic Record Updates Use For: Use For: • Fact tables that are mostly append only • Dimension tables which are updated and require sequential full table scans. frequently and require random low- latency lookups. Not Suitable For: • Low Latency Interactive OLAP. 11 ©2012 Cloudera, Inc. All Rights Reserved.
  • 12. Use Case Examples • Retail: Price Optimization • Media: Content Targeting • Finance: Fraud Detection • Manufacturing: Diagnostics • Info Services: Satellite Imagery • Agriculture: Seed Optimization • Power: Smart Consumption 12 ©2012 Cloudera, Inc. All Rights Reserved.
  • 13. Core Benefits of the Platform for Big Data 1. FLEXIBILITY STORE ANY DATA RUN ANY ANALYSIS KEEP’S PACE WITH THE RATE OF CHANGE OF INCOMING DATA 2. SCALABILITY PROVEN GROWTH TO PBS/1,000s OF NODES NO NEED TO REWRITE QUERIES, AUTOMATICALLY SCALES KEEP’S PACE WITH THE RATE OF GROWTH OF INCOMING DATA 3. ECONOMICS COST PER TB AT A FRACTION OF OTHER OPTIONS KEEP ALL OF YOUR DATA ALIVE IN AN ACTIVE ARCHIVE POWERING THE DATA BEATS ALGORITHM MOVEMENT 13 ©2012 Cloudera, Inc. All Rights Reserved.
  • 14. Thank you! Amr Awadallah, CTO, Founder, Cloudera, Inc. <aaa@cloudera.com> @awadallah

Hinweis der Redaktion

  1. Open Source – 100% Open Source, 100% Apache licensed, 100% Free. Enterprise Ready – Predictable releases, Documentation, Hotfix Patches, Intensive QA. Proven at Scale – Deployed at hundreds of enterprises across many industries. Integrated – All required component versions &amp; dependencies are properly managed. Industry Standard – Existing RDBMS, ETL and BI systems work with it. Many Form Factors – Public Cloud, Private Cloud, RHEL, Ubuntu, 32/64bit, etc.