SlideShare a Scribd company logo
1 of 30
Download to read offline
11/16/2011, Stanford EE380 Computer Systems Colloquium
Introducing Apache Hadoop:
The Modern Data Operating System

Dr. Amr Awadallah | Founder, CTO, VP of Engineering
aaa@cloudera.com, twitter: @awadallah
Limitations of Existing Data Analytics Architecture

    BI Reports + Interactive Apps                       Can’t Explore Original
                                                        High Fidelity Raw Data
     RDBMS (aggregated data)

         ETL Compute Grid

                    Moving Data To
                    Compute Doesn’t Scale

            Storage Only Grid (original raw data)
                                                                             Archiving =
              Mostly Append
                                                                             Premature
                         Collection                                          Data Death

                      Instrumentation


2
                            ©2011 Cloudera, Inc. All Rights Reserved.
So What is Apache                                             Hadoop ?
• A scalable fault-tolerant distributed system for
  data storage and processing (open source
  under the Apache license).

• Core Hadoop has two main systems:
    – Hadoop Distributed File System: self-healing
      high-bandwidth clustered storage.
    – MapReduce: distributed fault-tolerant resource
      management and scheduling coupled with a
      scalable data programming abstraction.

3
                      ©2011 Cloudera, Inc. All Rights Reserved.
The Key Benefit: Agility/Flexibility
Schema-on-Write (RDBMS):                                      Schema-on-Read (Hadoop):
•   Schema must be created before                         •   Data is simply copied to the file
    any data can be loaded.                                   store, no transformation is needed.
•   An explicit load operation has to                     •   A SerDe (Serializer/Deserlizer) is
    take place which transforms                               applied during read time to extract
    data to DB internal structure.                            the required columns (late binding)
•   New columns must be added                             •   New data can start flowing anytime
    explicitly before new data for                            and will appear retroactively once
    such columns can be loaded                                the SerDe is updated to parse it.
    into the database.

         •   Read is Fast                                          •   Load is Fast
                                                Pros
         •   Standards/Governance                                  •   Flexibility/Agility


    4
                                    ©2011 Cloudera, Inc. All Rights Reserved.
Innovation: Explore Original Raw Data

    Data Committee                                          Data Scientist




5
                ©2011 Cloudera, Inc. All Rights Reserved.
Flexibility: Complex Data Processing
1. Java MapReduce: Most flexibility and performance, but tedious
   development cycle (the assembly language of Hadoop).
2. Streaming MapReduce (aka Pipes): Allows you to develop in
   any programming language of your choice, but slightly lower
   performance and less flexibility than native Java MapReduce.
3. Crunch: A library for multi-stage MapReduce pipelines in Java
   (modeled After Google’s FlumeJava)
4. Pig Latin: A high-level language out of Yahoo, suitable for batch
   data flow workloads.
5. Hive: A SQL interpreter out of Facebook, also includes a meta-
   store mapping files to their schemas and associated SerDes.
6. Oozie: A PDL XML workflow engine that enables creating a
   workflow of jobs composed of any of the above.

6
                          ©2011 Cloudera, Inc. All Rights Reserved.
Scalability: Scalable Software Development

     Grows without requiring developers to
     re-architect their algorithms/application.



                 AUTO SCALE




 7
                    ©2011 Cloudera, Inc. All Rights Reserved.
Scalability: Data Beats Algorithm

      Smarter Algos                                                              More Data




A. Halevy et al, “The Unreasonable Effectiveness of Data”, IEEE Intelligent Systems, March 2009


  8
                                     ©2011 Cloudera, Inc. All Rights Reserved.
Scalability: Keep All Data Alive Forever
Archive to Tape and                         Extract Value From
 Never See It Again                             All Your Data




9
                ©2011 Cloudera, Inc. All Rights Reserved.
Use The Right Tool For The Right Job
    Relational Databases:                             Hadoop:




Use when:                                               Use when:
•   Interactive OLAP Analytics (<1sec)                  •   Structured or Not (Flexibility)
•   Multistep ACID Transactions                         •   Scalability of Storage/Compute
•   100% SQL Compliance                                 •   Complex Data Processing


10
                               ©2011 Cloudera, Inc. All Rights Reserved.
HDFS: Hadoop Distributed File System
A given file is broken down into blocks
(default=64MB), then blocks are
replicated across cluster (default=3).
Optimized for:
• Throughput
• Put/Get/Delete
• Appends

Block Replication for:
• Durability
• Availability
• Throughput

Block Replicas are distributed
across servers and racks.

 11
                                 ©2011 Cloudera, Inc. All Rights Reserved.
MapReduce: Computational Framework
cat *.txt | mapper.pl | sort | reducer.pl > out.txt
                                      (words, counts)
  Split 1   (docid, text)   Map 1                                  (sorted words, counts)
                                                                                                                                 Output
                                                 Be, 5                                         Reduce 1
                                                                                                              (sorted words,
                                                                                                              sum of counts)
                                                                                                                                 File 1

            “To Be
            Or Not                                                                                              Be, 30
            To Be?”
                                     Be, 12
                                                                                                                                 Output
                                                                                                              (sorted words,
                                                                                               Reduce i                           File i
  Split i   (docid, text)    Map i                                                                            sum of counts)




                                     Be, 7
                                     Be, 6
                                                      Shuffle                                                                    Output
                                                                                                              (sorted words,
                                                                                               Reduce R                          File R
                                                                                                              sum of counts)
  Split N   (docid, text)   Map M      (words, counts)             (sorted words, counts)

 Map(in_key, in_value) => list of (out_key, intermediate_value)               Reduce(out_key, list of intermediate_values) => out_value(s)



 12
                                                   ©2011 Cloudera, Inc. All Rights Reserved.
MapReduce: Resource Manager / Scheduler
A given job is broken down into tasks,
then tasks are scheduled to be as
close to data as possible.

Three levels of data locality:
• Same server as data (local disk)
• Same rack as data (rack/leaf switch)
• Wherever there is a free slot (cross rack)

Optimized for:
• Batch Processing
• Failure Recovery
System detects laggard tasks and
speculatively executes parallel tasks
on the same slice of data.

 13
                              ©2011 Cloudera, Inc. All Rights Reserved.
But Networks Are Faster Than Disks!

Yes, however, core and disk density per server
are going up very quickly:

•    1 Hard Disk = 100MB/sec (~1Gbps)
•    Server = 12 Hard Disks = 1.2GB/sec (~12Gbps)
•    Rack = 20 Servers = 24GB/sec (~240Gbps)
•    Avg. Cluster = 6 Racks = 144GB/sec (~1.4Tbps)
•    Large Cluster = 200 Racks = 4.8TB/sec (~48Tbps)
•    Scanning 4.8TB at 100MB/sec takes 13 hours.

14
                      ©2011 Cloudera, Inc. All Rights Reserved.
Hadoop High-Level Architecture
                                               Hadoop Client
                                       Contacts Name Node for data
                                       or Job Tracker to submit jobs




             Name Node                                                                      Job Tracker
     Maintains mapping of file names                                                Tracks resources and schedules
      to blocks to data node slaves.                                                jobs across task tracker slaves.




              Data Node                                                                    Task Tracker
           Stores and serves                                                            Runs tasks (work units)
             blocks of data                                                                  within a job
                                       Share Physical Node




15
                                        ©2011 Cloudera, Inc. All Rights Reserved.
Changes for Better Availability/Scalability
                                        Hadoop Client
Federation partitions            Contacts Name Node for data                              Each job has its own
                                 or Job Tracker to submit jobs
out the name space,                                                                       Application Manager,
High Availability via                                                                     Resource Manager is
an Active Standby.                                                                        decoupled from MR.


             Name Node                                                          Job Tracker




               Data Node                                                         Task Tracker
             Stores and serves                                                Runs tasks (work units)
               blocks of data                                                      within a job
                                 Share Physical Node




   16
                                  ©2011 Cloudera, Inc. All Rights Reserved.
CDH: Cloudera’s Distribution Including Apache Hadoop


                                  File System Mount       UI Framework/SDK                             Data Mining
      Build/Test: APACHE BIGTOP


                                           FUSE-DFS                                           HUE       APACHE MAHOUT


                                     Workflow                     Scheduling                            Metadata
                                       APACHE OOZIE                       APACHE OOZIE                     APACHE HIVE



                                                      Languages / Compilers                                 Fast
                                     Data                     APACHE PIG, APACHE HIVE
                                                                                                         Read/Write
                                  Integration
                                                                                                          Access
                                  APACHE FLUME,
                                  APACHE SQOOP                                                           APACHE HBASE


                                                               Coordination                         APACHE ZOOKEEPER


                                        SCM Express (Installation Wizard for CDH)


 17
                                                       ©2011 Cloudera, Inc. All Rights Reserved.
Books




18
     ©2011 Cloudera, Inc. All Rights Reserved.
Conclusion
• The Key Benefits of Apache Hadoop:
   – Agility/Flexibility (Quickest Time to Insight).
   – Complex Data Processing (Any Language, Any Problem).
   – Scalability of Storage/Compute (Freedom to Grow).
   – Economical Storage (Keep All Your Data Alive Forever).

• The Key Systems for Apache Hadoop are:
   – Hadoop Distributed File System: self-healing high-
     bandwidth clustered storage.
   – MapReduce: distributed fault-tolerant resource
     management coupled with scalable data processing.

 19
                        ©2011 Cloudera, Inc. All Rights Reserved.
Appendix




 BACKUP SLIDES



20
           ©2011 Cloudera, Inc. All Rights Reserved.
Unstructured Data is Exploding




                                                                                 Complex, Unstructured




                                                                                 Relational



• 2,500 exabytes of new information in 2012 with Internet as primary driver
• Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2
“zettabytes” this year
                                       Source: IDC White Paper - sponsored by EMC.
                    As the Economy Contracts, the Digital Universe Expands. May 2009.
21
                                     ©2011 Cloudera, Inc. All Rights Reserved.
Hadoop Creation History


                                                        • Fastest sort of a TB,
                                                        62secs over 1,460 nodes
                                                        • Sorted a PB in 16.25hours
                                                        over 3,658 nodes




22
            ©2011 Cloudera, Inc. All Rights Reserved.
Hadoop in the Enterprise Data Stack

                                      Data Scientists         Analysts              Business Users
                                                                                        Enterprise
                                           IDEs             BI, Analytics
                                                                                        Reporting

                                     Development Tools                   Business Intelligence Tools
                     System
                    Operators
                                                  ODBC, JDBC,
                     Cloudera
                                                   NFS, Native
                    Mgmt Suite                                                                  Enterprise
        ETL Tools




                                                                                                  Data
                                                                                                Warehouse
                                                                                      Sqoop
  Data
Architects                                                                                                   Customers
                                                                                               Low-Latency     Web
           Flume                 Flume            Flume                   Sqoop
                                                                                                 Serving     Application
                                                                        Relational               Systems
                Logs              Files       Web Data
                                                                        Databases


   23
                                                   ©2011 Cloudera, Inc. All Rights Reserved.
MapReduce Next Gen
Main idea is to split up the JobTracker functions:
• Cluster resource management (for tracking and
  allocating nodes)
• Application life-cycle management (for MapReduce
  scheduling and execution)



Enables:
• High Availability
• Better Scalability
• Efficient Slot Allocation
• Rolling Upgrades
• Non-MapReduce Apps




24
                              ©2011 Cloudera, Inc. All Rights Reserved.
Two Core Use Cases Common Across Many Industries


Use Case                    Application                     Industry                               Application             Use Case

                                                                Web
   ADVANCED ANALYTICS




                        Social Network Analysis                                               Clickstream Sessionization




                                                                                                                              DATA PROCESSING
                         Content Optimization                 Media                           Clickstream Sessionization

                          Network Analytics                    Telco                                  Mediation

                         Loyalty & Promotions                  Retail                               Data Factory

                            Fraud Analysis                 Financial                             Trade Reconciliation

                            Entity Analysis                  Federal                                   SIGINT

                         Sequencing Analysis        Bioinformatics                                Genome Mapping

                           Product Quality           Manufacturing                              Mfg Process Tracking



  25
                                                  ©2011 Cloudera, Inc. All Rights Reserved.
What is Cloudera Enterprise?
Cloudera Enterprise makes open                             CLOUDERA ENTERPRISE COMPONENTS
source Apache Hadoop enterprise-easy

 Simplify and Accelerate Hadoop Deployment                     Cloudera                     Production-
                                                               Management                   Level Support
 Reduce Adoption Costs and Risks
                                                                  Suite
 Lower the Cost of Administration
                                                               Comprehensive                Our Team of Experts
 Increase the Transparency & Control of Hadoop                                             On-Call to Help You
                                                              Toolset for Hadoop
 Leverage the Experience of Our Experts                        Administration               Meet Your SLAs




       3 of the top 5 telecommunications, mobile services, defense &
intelligence, banking, media and retail organizations depend on Cloudera

           EFFECTIVENESS                                                           EFFICIENCY
           Ensuring Repeatable Value from                                          Enabling Apache Hadoop to be
           Apache Hadoop Deployments                                               Affordably Run in Production



26
                                     ©2011 Cloudera, Inc. All Rights Reserved.
Hive vs Pig Latin (count distinct values > 0)
• Hive Syntax:
  SELECT COUNT(DISTINCT col1)
  FROM mytable
  WHERE col1 > 0;

• Pig Latin Syntax:
  mytable = LOAD ‘myfile’ AS (col1, col2, col3);
  mytable = FOREACH mytable GENERATE col1;
  mytable = FILTER mytable BY col1 > 0;
  mytable = DISTINCT col1;
  mytable = GROUP mytable BY col1;
  mytable = FOREACH mytable GENERATE COUNT(mytable);
  DUMP mytable;


27
                    ©2011 Cloudera, Inc. All Rights Reserved.
Apache Hive Key Features

•    A subset of SQL covering the most common statements
•    JDBC/ODBC support
•    Agile data types: Array, Map, Struct, and JSON objects
•    Pluggable SerDe system to work on unstructured files directly
•    User Defined Functions and Aggregates
•    Regular Expression support
•    MapReduce support
•    Partitions and Buckets (for performance optimization)
•    Microstrategy/Tableau Compatibility (through ODBC)
•    In The Works: Indices, Columnar Storage, Views, Explode/Collect
•    More details: http://wiki.apache.org/hadoop/Hive



28
                            ©2011 Cloudera, Inc. All Rights Reserved.
Hive Agile Data Types

• STRUCTS:
     – SELECT mytable.mycolumn.myfield FROM …
• MAPS (Hashes):
     – SELECT mytable.mycolumn[mykey] FROM …
• ARRAYS:
     – SELECT mytable.mycolumn[5] FROM …
• JSON:
     – SELECT get_json_object(mycolumn, objpath) FROM …




29
                        ©2011 Cloudera, Inc. All Rights Reserved.
Introducing Apache Hadoop: The Modern Data Operating System  - Stanford EE380

More Related Content

Recently uploaded

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 

Recently uploaded (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 

Featured

Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 

Featured (20)

Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 

Introducing Apache Hadoop: The Modern Data Operating System - Stanford EE380

  • 1. 11/16/2011, Stanford EE380 Computer Systems Colloquium Introducing Apache Hadoop: The Modern Data Operating System Dr. Amr Awadallah | Founder, CTO, VP of Engineering aaa@cloudera.com, twitter: @awadallah
  • 2. Limitations of Existing Data Analytics Architecture BI Reports + Interactive Apps Can’t Explore Original High Fidelity Raw Data RDBMS (aggregated data) ETL Compute Grid Moving Data To Compute Doesn’t Scale Storage Only Grid (original raw data) Archiving = Mostly Append Premature Collection Data Death Instrumentation 2 ©2011 Cloudera, Inc. All Rights Reserved.
  • 3. So What is Apache Hadoop ? • A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license). • Core Hadoop has two main systems: – Hadoop Distributed File System: self-healing high-bandwidth clustered storage. – MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction. 3 ©2011 Cloudera, Inc. All Rights Reserved.
  • 4. The Key Benefit: Agility/Flexibility Schema-on-Write (RDBMS): Schema-on-Read (Hadoop): • Schema must be created before • Data is simply copied to the file any data can be loaded. store, no transformation is needed. • An explicit load operation has to • A SerDe (Serializer/Deserlizer) is take place which transforms applied during read time to extract data to DB internal structure. the required columns (late binding) • New columns must be added • New data can start flowing anytime explicitly before new data for and will appear retroactively once such columns can be loaded the SerDe is updated to parse it. into the database. • Read is Fast • Load is Fast Pros • Standards/Governance • Flexibility/Agility 4 ©2011 Cloudera, Inc. All Rights Reserved.
  • 5. Innovation: Explore Original Raw Data Data Committee Data Scientist 5 ©2011 Cloudera, Inc. All Rights Reserved.
  • 6. Flexibility: Complex Data Processing 1. Java MapReduce: Most flexibility and performance, but tedious development cycle (the assembly language of Hadoop). 2. Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. 3. Crunch: A library for multi-stage MapReduce pipelines in Java (modeled After Google’s FlumeJava) 4. Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads. 5. Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDes. 6. Oozie: A PDL XML workflow engine that enables creating a workflow of jobs composed of any of the above. 6 ©2011 Cloudera, Inc. All Rights Reserved.
  • 7. Scalability: Scalable Software Development Grows without requiring developers to re-architect their algorithms/application. AUTO SCALE 7 ©2011 Cloudera, Inc. All Rights Reserved.
  • 8. Scalability: Data Beats Algorithm Smarter Algos More Data A. Halevy et al, “The Unreasonable Effectiveness of Data”, IEEE Intelligent Systems, March 2009 8 ©2011 Cloudera, Inc. All Rights Reserved.
  • 9. Scalability: Keep All Data Alive Forever Archive to Tape and Extract Value From Never See It Again All Your Data 9 ©2011 Cloudera, Inc. All Rights Reserved.
  • 10. Use The Right Tool For The Right Job Relational Databases: Hadoop: Use when: Use when: • Interactive OLAP Analytics (<1sec) • Structured or Not (Flexibility) • Multistep ACID Transactions • Scalability of Storage/Compute • 100% SQL Compliance • Complex Data Processing 10 ©2011 Cloudera, Inc. All Rights Reserved.
  • 11. HDFS: Hadoop Distributed File System A given file is broken down into blocks (default=64MB), then blocks are replicated across cluster (default=3). Optimized for: • Throughput • Put/Get/Delete • Appends Block Replication for: • Durability • Availability • Throughput Block Replicas are distributed across servers and racks. 11 ©2011 Cloudera, Inc. All Rights Reserved.
  • 12. MapReduce: Computational Framework cat *.txt | mapper.pl | sort | reducer.pl > out.txt (words, counts) Split 1 (docid, text) Map 1 (sorted words, counts) Output Be, 5 Reduce 1 (sorted words, sum of counts) File 1 “To Be Or Not Be, 30 To Be?” Be, 12 Output (sorted words, Reduce i File i Split i (docid, text) Map i sum of counts) Be, 7 Be, 6 Shuffle Output (sorted words, Reduce R File R sum of counts) Split N (docid, text) Map M (words, counts) (sorted words, counts) Map(in_key, in_value) => list of (out_key, intermediate_value) Reduce(out_key, list of intermediate_values) => out_value(s) 12 ©2011 Cloudera, Inc. All Rights Reserved.
  • 13. MapReduce: Resource Manager / Scheduler A given job is broken down into tasks, then tasks are scheduled to be as close to data as possible. Three levels of data locality: • Same server as data (local disk) • Same rack as data (rack/leaf switch) • Wherever there is a free slot (cross rack) Optimized for: • Batch Processing • Failure Recovery System detects laggard tasks and speculatively executes parallel tasks on the same slice of data. 13 ©2011 Cloudera, Inc. All Rights Reserved.
  • 14. But Networks Are Faster Than Disks! Yes, however, core and disk density per server are going up very quickly: • 1 Hard Disk = 100MB/sec (~1Gbps) • Server = 12 Hard Disks = 1.2GB/sec (~12Gbps) • Rack = 20 Servers = 24GB/sec (~240Gbps) • Avg. Cluster = 6 Racks = 144GB/sec (~1.4Tbps) • Large Cluster = 200 Racks = 4.8TB/sec (~48Tbps) • Scanning 4.8TB at 100MB/sec takes 13 hours. 14 ©2011 Cloudera, Inc. All Rights Reserved.
  • 15. Hadoop High-Level Architecture Hadoop Client Contacts Name Node for data or Job Tracker to submit jobs Name Node Job Tracker Maintains mapping of file names Tracks resources and schedules to blocks to data node slaves. jobs across task tracker slaves. Data Node Task Tracker Stores and serves Runs tasks (work units) blocks of data within a job Share Physical Node 15 ©2011 Cloudera, Inc. All Rights Reserved.
  • 16. Changes for Better Availability/Scalability Hadoop Client Federation partitions Contacts Name Node for data Each job has its own or Job Tracker to submit jobs out the name space, Application Manager, High Availability via Resource Manager is an Active Standby. decoupled from MR. Name Node Job Tracker Data Node Task Tracker Stores and serves Runs tasks (work units) blocks of data within a job Share Physical Node 16 ©2011 Cloudera, Inc. All Rights Reserved.
  • 17. CDH: Cloudera’s Distribution Including Apache Hadoop File System Mount UI Framework/SDK Data Mining Build/Test: APACHE BIGTOP FUSE-DFS HUE APACHE MAHOUT Workflow Scheduling Metadata APACHE OOZIE APACHE OOZIE APACHE HIVE Languages / Compilers Fast Data APACHE PIG, APACHE HIVE Read/Write Integration Access APACHE FLUME, APACHE SQOOP APACHE HBASE Coordination APACHE ZOOKEEPER SCM Express (Installation Wizard for CDH) 17 ©2011 Cloudera, Inc. All Rights Reserved.
  • 18. Books 18 ©2011 Cloudera, Inc. All Rights Reserved.
  • 19. Conclusion • The Key Benefits of Apache Hadoop: – Agility/Flexibility (Quickest Time to Insight). – Complex Data Processing (Any Language, Any Problem). – Scalability of Storage/Compute (Freedom to Grow). – Economical Storage (Keep All Your Data Alive Forever). • The Key Systems for Apache Hadoop are: – Hadoop Distributed File System: self-healing high- bandwidth clustered storage. – MapReduce: distributed fault-tolerant resource management coupled with scalable data processing. 19 ©2011 Cloudera, Inc. All Rights Reserved.
  • 20. Appendix BACKUP SLIDES 20 ©2011 Cloudera, Inc. All Rights Reserved.
  • 21. Unstructured Data is Exploding Complex, Unstructured Relational • 2,500 exabytes of new information in 2012 with Internet as primary driver • Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year Source: IDC White Paper - sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009. 21 ©2011 Cloudera, Inc. All Rights Reserved.
  • 22. Hadoop Creation History • Fastest sort of a TB, 62secs over 1,460 nodes • Sorted a PB in 16.25hours over 3,658 nodes 22 ©2011 Cloudera, Inc. All Rights Reserved.
  • 23. Hadoop in the Enterprise Data Stack Data Scientists Analysts Business Users Enterprise IDEs BI, Analytics Reporting Development Tools Business Intelligence Tools System Operators ODBC, JDBC, Cloudera NFS, Native Mgmt Suite Enterprise ETL Tools Data Warehouse Sqoop Data Architects Customers Low-Latency Web Flume Flume Flume Sqoop Serving Application Relational Systems Logs Files Web Data Databases 23 ©2011 Cloudera, Inc. All Rights Reserved.
  • 24. MapReduce Next Gen Main idea is to split up the JobTracker functions: • Cluster resource management (for tracking and allocating nodes) • Application life-cycle management (for MapReduce scheduling and execution) Enables: • High Availability • Better Scalability • Efficient Slot Allocation • Rolling Upgrades • Non-MapReduce Apps 24 ©2011 Cloudera, Inc. All Rights Reserved.
  • 25. Two Core Use Cases Common Across Many Industries Use Case Application Industry Application Use Case Web ADVANCED ANALYTICS Social Network Analysis Clickstream Sessionization DATA PROCESSING Content Optimization Media Clickstream Sessionization Network Analytics Telco Mediation Loyalty & Promotions Retail Data Factory Fraud Analysis Financial Trade Reconciliation Entity Analysis Federal SIGINT Sequencing Analysis Bioinformatics Genome Mapping Product Quality Manufacturing Mfg Process Tracking 25 ©2011 Cloudera, Inc. All Rights Reserved.
  • 26. What is Cloudera Enterprise? Cloudera Enterprise makes open CLOUDERA ENTERPRISE COMPONENTS source Apache Hadoop enterprise-easy  Simplify and Accelerate Hadoop Deployment Cloudera Production- Management Level Support  Reduce Adoption Costs and Risks Suite  Lower the Cost of Administration Comprehensive Our Team of Experts  Increase the Transparency & Control of Hadoop On-Call to Help You Toolset for Hadoop  Leverage the Experience of Our Experts Administration Meet Your SLAs 3 of the top 5 telecommunications, mobile services, defense & intelligence, banking, media and retail organizations depend on Cloudera EFFECTIVENESS EFFICIENCY Ensuring Repeatable Value from Enabling Apache Hadoop to be Apache Hadoop Deployments Affordably Run in Production 26 ©2011 Cloudera, Inc. All Rights Reserved.
  • 27. Hive vs Pig Latin (count distinct values > 0) • Hive Syntax: SELECT COUNT(DISTINCT col1) FROM mytable WHERE col1 > 0; • Pig Latin Syntax: mytable = LOAD ‘myfile’ AS (col1, col2, col3); mytable = FOREACH mytable GENERATE col1; mytable = FILTER mytable BY col1 > 0; mytable = DISTINCT col1; mytable = GROUP mytable BY col1; mytable = FOREACH mytable GENERATE COUNT(mytable); DUMP mytable; 27 ©2011 Cloudera, Inc. All Rights Reserved.
  • 28. Apache Hive Key Features • A subset of SQL covering the most common statements • JDBC/ODBC support • Agile data types: Array, Map, Struct, and JSON objects • Pluggable SerDe system to work on unstructured files directly • User Defined Functions and Aggregates • Regular Expression support • MapReduce support • Partitions and Buckets (for performance optimization) • Microstrategy/Tableau Compatibility (through ODBC) • In The Works: Indices, Columnar Storage, Views, Explode/Collect • More details: http://wiki.apache.org/hadoop/Hive 28 ©2011 Cloudera, Inc. All Rights Reserved.
  • 29. Hive Agile Data Types • STRUCTS: – SELECT mytable.mycolumn.myfield FROM … • MAPS (Hashes): – SELECT mytable.mycolumn[mykey] FROM … • ARRAYS: – SELECT mytable.mycolumn[5] FROM … • JSON: – SELECT get_json_object(mycolumn, objpath) FROM … 29 ©2011 Cloudera, Inc. All Rights Reserved.