SlideShare ist ein Scribd-Unternehmen logo
1 von 27
The Hadoop Ecosystem


                       J Singh, DataThinks.org

                                   March 12, 2012
The Hadoop Ecosystem
• Introduction
   – What Hadoop is, and what it’s not
   – Origins and History
   – Hello Hadoop
• The Hadoop Bestiary
• The Hadoop Providers
• Hosted Hadoop Frameworks




© J Singh, 2011                          2
                                  2
What Hadoop is, and what it’s not
• A Framework for Map Reduce

• A Top-level Apache Project

• Hadoop is                               • Hadoop is not
    A Framework, not a “solution”             A painless replacement for SQL
        • Think Linux or J2EE


    Scalable                                  Uniformly fast or efficient

    Great for pipelining massive              Great for ad hoc Analysis
     amounts of data to achieve the
     end result

    Sometimes the only option


© J Singh, 2011                                                                 3
                                      3
You are ready for Hadoop when…
• You no longer get enthused by the prospect of more data
   – Rate of data accumulation is increasing
   – The idea of moving data from hither to yon is positively scary
   – A hit man threatens to delete your data in the middle of the night
        • And you want to pay him to do it


• Seriously, you are ready for Hadoop when analysis is the bottleneck
   –   Could   be   because   of data size
   –   Could   be   because   of the complexity of the data
   –   Could   be   because   of the level of analysis required
   –   Could   be   because   the analysis requirements are fluid




© J Singh, 2011                                                           4
                                             4
MapReduce Conceptual Underpinnings
• Based on Functional Programming model
   – From Lisp
        • (map square '(1 2 3 4))   (1 4 9 16)
        • (reduce plus '(1 4 9 16))   30
   – From APL
        • +/ N    N  1 2 3 4


• Easy to distribute (based on each element of the vector)

• New for Map/Reduce: Nice failure/retry semantics
   – Hundreds and thousands of low-end servers are running at the
     same time



© J Singh, 2011                                                     5
                                  5
MapReduce Flow

                   Word Count Example




                     MapOut
                     foo 1
Lines                                   Result
                     bar 1
foo bar                                 foo 3
                     quux 1
quux foo                                labs 1
                     foo 1
foo labs                                quux 2
                     foo 1
quux                                    bar 1
                     labs 1
                     quux 1



 © J Singh, 2011                                 6
                              6
Hello Hadoop
• Word Count
   – Example with Unstructured Data
   – Load 5 books from Gutenberg.org
     into /tmp/gutenberg
   – Load them into HDFS
   – Run Hadoop
        • Results are put into HDFS
   – Copy results into file system

   – What could be simpler?

   – DIY instructions for Amazon EC2
     available on DataThinks.org blog




© J Singh, 2011                             7
                                        7
The Hadoop Ecosystem
• Introduction
• The Hadoop Bestiary
   –   Core: Hadoop Map Reduce and Hadoop Distributed File System
   –   Data Access: HBase, Pig, Hive
   –   Algorithms: Mahout
   –   Data Import: Flume, Sqoop and Nutch
• The Hadoop Providers
• Hosted Hadoop Frameworks




© J Singh, 2011                                                     8
                                  8
The Core: Hadoop and HDFS
• Hadoop                                     • Hadoop Distributed File System
   – One master, n slaves                       – Robust Data Storage across
   – Master                                       machines, insulating against
        • Schedules mappers & reducers            failure
        • Connects pipeline stages              – Keeps n copies of each file
        • Handles failure semantics                 • Configurable number of copies
                                                    • Distributes copies across racks
                                                      and locations




© J Singh, 2011                                                                         9
                                         9
Hadoop Bestiary (p1a): Hbase, Pig
• Database Primitives                   • Processing
   – Hbase                                  – Pig
        • Wide column data structure            • A high(-ish) level data-flow
          built on HDFS                           language and execution
                                                  framework for parallel
                                                  computation
                                                • Accesses HDFS and Hbase
                                                • Batch as well as Interactive
                                                • Integrates UDFs written in
                                                  Java, Python, JavaScript
                                                • Compiles to map & reduce
                                                  functions – not 100% efficiently




© J Singh, 2011                                                                  10
                                       10
In Pig (Latin)

   Users    = load ‘users’ as (name, age);
   Filtered = filter Users by
                     age >= 18 and age <= 25;
   Pages    = load ‘pages’ as (user, url);
   Joined   = join Filtered by name, Pages by user;
   Grouped = group Joined by url;
   Summed   = foreach Grouped generate group,
                      count(Joined) as clicks;
   Sorted   = order Summed by clicks desc;
   Top5     = limit Sorted 5;

   store Top5 into ‘top5sites’;


© J Singh, 2011                                                                                                               11
                                                     11
                  Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
Pig Translation into Map Reduce


 Load Users                       Load Pages
                                                                  Users = load …
 Filter by age
                                                                  Fltrd = filter …
                                                                  Pages = load …
  Job 1           Join on name                                    Joined = join …
                  Group on url
                                                                  Grouped = group …
                                                                  Summed = … count()…
          Job 2 Count clicks                                      Sorted = order …
                                                                  Top5 = limit …
              Order by clicks

          Job 3 Take top 5


© J Singh, 2011        Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt   12
                                                        12
Hadoop Bestiary (p1b): Hbase, Hive
• Database Primitives                   • Processing
   – Hbase                                  – Hive
        • Wide column data structure           • Data Warehouse Infrastructure
          built on HDFS                        • QL, a subset of SQL that
                                                 supports primitives supportable
                                                 by Map Reduce
                                               • Support for custom mappers
                                                 and reducers for more
                                                 sophisticated analysis
                                               • Compiles to map & reduce
                                                 functions – not 100% efficiently

            Hive Example
        CREATE TABLE page_view(viewTime INT, userid BIGINT,
                         page_url STRING, referrer_url STRING,
                         ip STRING COMMENT 'IP Address of the User')
        :: ::
        STORED AS SEQUENCEFILE;

© J Singh, 2011                                                                 13
                                       13
Hadoop Bestiary (p2): Mahout
• Algorithms                               • Examples
   – Mahout                                    – Clustering Algorithms
        • Scalable machine learning and            • Canopy Clustering
          data mining                              • K-Means Clustering
        • Runs on top of Hadoop                    • …
        • Written in Java
        • In active development                – Recommenders / Collaborative
            – Algorithms being added
                                                 Filtering Algorithms

                                               – Other
                                                   • Regression Algorithms
                                                   • Neural Networks
                                                   • Hidden Markov Models




© J Singh, 2011                                                                 14
                                          14
Hadoop Bestiary (p3): Data Import
• Data Import Mechanisms      • Data Import
   – Sqoop: Structured Data        – Sqoop
   – Flume: Streams                   • Import from RDBMS to HDFS
                                      • Export too
                                   – Flume
                                      • Import streams
                                         – Text Files
                                         – System Logs
                                   – Nutch
                                      • Import from Web
                                      • Note: Nutch + Hadoop = Lucene




© J Singh, 2011                                                         15
                              15
Hadoop Bestiary (p4): Complete Picture




© J Singh, 2011                          16
                        16
The Hadoop Ecosystem
• Introduction
• The Hadoop Bestiary
• The Hadoop Providers
   – Apache
   – Cloudera
   – Options when your data lives in a Database
• Hosted Hadoop Frameworks




© J Singh, 2011                                   17
                                  17
Apache Distribution
• The Definitive Repository
   – The hub for Code, Documentation, Tutorials

   – Many contributors, for example
        • Pig was a Yahoo! Contribution
        • Hive came from Facebook
        • Sqoop came from Cloudera


• Bare metal install option:
   – Download to your machine(s) from Apache
   – Install and Operate
        • Modify to fit your business better




© J Singh, 2011                                     18
                                               18
Cloudera
• Cloudera : Hadoop :: Red Hat : Linux

• Cloudera’s Distribution Including Apache Hadoop (CDH)
   – A packaged set of Hadoop modules that work together
   – Now at CDH3
   – Largest contributor of code to Apache Hadoop


• $76M in Venture funding so far




© J Singh, 2011                                            19
                                    19
When the data lives in a Database…

• Objective: keeping Analytics and Data as close as possible


• Options for RDBMS :                • Options for NoSQL Databases
   – Sqoop data to/from HDFS             – Sqoop-like connectors
        • Need to move the data              • Need to move the data
                                             • Can utilize all parts of Hadoop
   – In-database analytics
        • Available for TeraData,        – Built-in Map Reduce available
          Greenplum, etc.                  for most NoSQL databases
        • If you have the need               • Knows about and tuned to the
            – And the $$$                      storage mechanism
                                             • But typically only offers map
                                               and reduce
                                                 – No Pig, Hive, …



© J Singh, 2011                                                                  20
                                    20
The Hadoop Ecosystem
• Introduction
• The Hadoop Bestiary
• The Hadoop Providers
• Hadoop Platforms as a Service
   –   Amazon Elastic MapReduce
   –   Hadoop in Windows Azure
   –   Google App Engine
   –   Other
        • Infochimps
        • IBM SmartCloud




© J Singh, 2011                        21
                                  21
Amazon Elastic Map Reduce (EMR)
• Hosted Map Reduce
   – CLI on your laptop
        • Control over size of cluster
        • Automatic spin-up/down instances


   – Map & Reduce programs on S3
        • Pig, Hive or
        • Custom in Java, Ruby, Python,
          Perl, PHP, R, C++, Cascading


   – Data In/Out on S3 or
   – Data In/Out on DynamoDB


• Keep in mind:
   – Hadoop on EC2 is also an option

© J Singh, 2011                                22
                                          22
Hadoop in Windows Azure
• Basic Level
   – Hive Add-in for Excel
   – Hive ODBC Driver


• Hadoop-based Distribution for Windows Server and Azure
   – Strategic Partnership with HortonWorks
   – Windows-based CLI on your laptop


• Broadest Level
   – JavaScript framework for Hadoop
   – Hadoop connectors for SQL Server and Parallel Data Warehouse




© J Singh, 2011                                                     23
                                 23
Google App Engine MapReduce
• Map Reduce as a Service
   – Distinct from Google’s internal Map Reduce
   – Part of Google App Engine


• Works with Google Datastore
   – A Wide Column Store


• A “purely programmatic” environment
   – Write Map and Reduce functions in Python / Java




© J Singh, 2011                                        24
                                  24
Map Reduce Use at Google




© J Singh, 2011            25
                      25
Take Aways
• There are many flavors of
  Hadoop.
   – The important part is
     Functional Programming and
     Map Reduce

   – Don’t let the proliferation of
     choices stump you.

   – Experiment with it!




© J Singh, 2011                            26
                                      26
Thank you
• J Singh
   – President, Early Stage IT
        • Technology Services and Strategy for Startups


• DataThinks.org is a new service of Early Stage IT
   – “Big Data” analytics solutions




© J Singh, 2011                                           27
                                      27

Weitere ähnliche Inhalte

Was ist angesagt?

Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika
 

Was ist angesagt? (20)

Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive Tutorial
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop
Hadoop Hadoop
Hadoop
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introduction
 
Hive
HiveHive
Hive
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 

Andere mochten auch (6)

Media Buying Platform Ecosystem
Media Buying Platform EcosystemMedia Buying Platform Ecosystem
Media Buying Platform Ecosystem
 
Creating an Ecosystem Platform with Vertical PaaS
Creating an Ecosystem Platform with Vertical PaaSCreating an Ecosystem Platform with Vertical PaaS
Creating an Ecosystem Platform with Vertical PaaS
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Understanding the Online Advertising Technology Landscape
Understanding the Online Advertising Technology Landscape Understanding the Online Advertising Technology Landscape
Understanding the Online Advertising Technology Landscape
 
Business Ecosystem Design
Business Ecosystem DesignBusiness Ecosystem Design
Business Ecosystem Design
 

Ähnlich wie The Hadoop Ecosystem

Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
Jesus Rodriguez
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
elliando dias
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
NetajiGandi1
 

Ähnlich wie The Hadoop Ecosystem (20)

Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User Group
 
Hadoop
HadoopHadoop
Hadoop
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worlds
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Presentation
PresentationPresentation
Presentation
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pig
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
SpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache HadoopSpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache Hadoop
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010
 
Hadoop
HadoopHadoop
Hadoop
 

Mehr von J Singh

PaaS - google app engine
PaaS  - google app enginePaaS  - google app engine
PaaS - google app engine
J Singh
 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)
J Singh
 
Data Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and TradeoffsData Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and Tradeoffs
J Singh
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed Commit
J Singh
 
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlCS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency Control
J Singh
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
J Singh
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
J Singh
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
J Singh
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
J Singh
 
CS 542 Database Index Structures
CS 542 Database Index StructuresCS 542 Database Index Structures
CS 542 Database Index Structures
J Singh
 
CS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceCS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and Performance
J Singh
 
CS 542 Overview of query processing
CS 542 Overview of query processingCS 542 Overview of query processing
CS 542 Overview of query processing
J Singh
 

Mehr von J Singh (20)

OpenLSH - a framework for locality sensitive hashing
OpenLSH  - a framework for locality sensitive hashingOpenLSH  - a framework for locality sensitive hashing
OpenLSH - a framework for locality sensitive hashing
 
Designing analytics for big data
Designing analytics for big dataDesigning analytics for big data
Designing analytics for big data
 
Open LSH - september 2014 update
Open LSH  - september 2014 updateOpen LSH  - september 2014 update
Open LSH - september 2014 update
 
PaaS - google app engine
PaaS  - google app enginePaaS  - google app engine
PaaS - google app engine
 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)
 
Data Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and TradeoffsData Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and Tradeoffs
 
Facebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/ReduceFacebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/Reduce
 
Big Data Laboratory
Big Data LaboratoryBig Data Laboratory
Big Data Laboratory
 
Social Media Mining using GAE Map Reduce
Social Media Mining using GAE Map ReduceSocial Media Mining using GAE Map Reduce
Social Media Mining using GAE Map Reduce
 
High Throughput Data Analysis
High Throughput Data AnalysisHigh Throughput Data Analysis
High Throughput Data Analysis
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduce
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed Commit
 
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlCS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency Control
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
 
CS 542 Database Index Structures
CS 542 Database Index StructuresCS 542 Database Index Structures
CS 542 Database Index Structures
 
CS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceCS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and Performance
 
CS 542 Overview of query processing
CS 542 Overview of query processingCS 542 Overview of query processing
CS 542 Overview of query processing
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

The Hadoop Ecosystem

  • 1. The Hadoop Ecosystem J Singh, DataThinks.org March 12, 2012
  • 2. The Hadoop Ecosystem • Introduction – What Hadoop is, and what it’s not – Origins and History – Hello Hadoop • The Hadoop Bestiary • The Hadoop Providers • Hosted Hadoop Frameworks © J Singh, 2011 2 2
  • 3. What Hadoop is, and what it’s not • A Framework for Map Reduce • A Top-level Apache Project • Hadoop is • Hadoop is not  A Framework, not a “solution” A painless replacement for SQL • Think Linux or J2EE  Scalable Uniformly fast or efficient  Great for pipelining massive Great for ad hoc Analysis amounts of data to achieve the end result  Sometimes the only option © J Singh, 2011 3 3
  • 4. You are ready for Hadoop when… • You no longer get enthused by the prospect of more data – Rate of data accumulation is increasing – The idea of moving data from hither to yon is positively scary – A hit man threatens to delete your data in the middle of the night • And you want to pay him to do it • Seriously, you are ready for Hadoop when analysis is the bottleneck – Could be because of data size – Could be because of the complexity of the data – Could be because of the level of analysis required – Could be because the analysis requirements are fluid © J Singh, 2011 4 4
  • 5. MapReduce Conceptual Underpinnings • Based on Functional Programming model – From Lisp • (map square '(1 2 3 4)) (1 4 9 16) • (reduce plus '(1 4 9 16)) 30 – From APL • +/ N N  1 2 3 4 • Easy to distribute (based on each element of the vector) • New for Map/Reduce: Nice failure/retry semantics – Hundreds and thousands of low-end servers are running at the same time © J Singh, 2011 5 5
  • 6. MapReduce Flow Word Count Example MapOut foo 1 Lines Result bar 1 foo bar foo 3 quux 1 quux foo labs 1 foo 1 foo labs quux 2 foo 1 quux bar 1 labs 1 quux 1 © J Singh, 2011 6 6
  • 7. Hello Hadoop • Word Count – Example with Unstructured Data – Load 5 books from Gutenberg.org into /tmp/gutenberg – Load them into HDFS – Run Hadoop • Results are put into HDFS – Copy results into file system – What could be simpler? – DIY instructions for Amazon EC2 available on DataThinks.org blog © J Singh, 2011 7 7
  • 8. The Hadoop Ecosystem • Introduction • The Hadoop Bestiary – Core: Hadoop Map Reduce and Hadoop Distributed File System – Data Access: HBase, Pig, Hive – Algorithms: Mahout – Data Import: Flume, Sqoop and Nutch • The Hadoop Providers • Hosted Hadoop Frameworks © J Singh, 2011 8 8
  • 9. The Core: Hadoop and HDFS • Hadoop • Hadoop Distributed File System – One master, n slaves – Robust Data Storage across – Master machines, insulating against • Schedules mappers & reducers failure • Connects pipeline stages – Keeps n copies of each file • Handles failure semantics • Configurable number of copies • Distributes copies across racks and locations © J Singh, 2011 9 9
  • 10. Hadoop Bestiary (p1a): Hbase, Pig • Database Primitives • Processing – Hbase – Pig • Wide column data structure • A high(-ish) level data-flow built on HDFS language and execution framework for parallel computation • Accesses HDFS and Hbase • Batch as well as Interactive • Integrates UDFs written in Java, Python, JavaScript • Compiles to map & reduce functions – not 100% efficiently © J Singh, 2011 10 10
  • 11. In Pig (Latin) Users = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, count(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into ‘top5sites’; © J Singh, 2011 11 11 Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
  • 12. Pig Translation into Map Reduce Load Users Load Pages Users = load … Filter by age Fltrd = filter … Pages = load … Job 1 Join on name Joined = join … Group on url Grouped = group … Summed = … count()… Job 2 Count clicks Sorted = order … Top5 = limit … Order by clicks Job 3 Take top 5 © J Singh, 2011 Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt 12 12
  • 13. Hadoop Bestiary (p1b): Hbase, Hive • Database Primitives • Processing – Hbase – Hive • Wide column data structure • Data Warehouse Infrastructure built on HDFS • QL, a subset of SQL that supports primitives supportable by Map Reduce • Support for custom mappers and reducers for more sophisticated analysis • Compiles to map & reduce functions – not 100% efficiently Hive Example CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') :: :: STORED AS SEQUENCEFILE; © J Singh, 2011 13 13
  • 14. Hadoop Bestiary (p2): Mahout • Algorithms • Examples – Mahout – Clustering Algorithms • Scalable machine learning and • Canopy Clustering data mining • K-Means Clustering • Runs on top of Hadoop • … • Written in Java • In active development – Recommenders / Collaborative – Algorithms being added Filtering Algorithms – Other • Regression Algorithms • Neural Networks • Hidden Markov Models © J Singh, 2011 14 14
  • 15. Hadoop Bestiary (p3): Data Import • Data Import Mechanisms • Data Import – Sqoop: Structured Data – Sqoop – Flume: Streams • Import from RDBMS to HDFS • Export too – Flume • Import streams – Text Files – System Logs – Nutch • Import from Web • Note: Nutch + Hadoop = Lucene © J Singh, 2011 15 15
  • 16. Hadoop Bestiary (p4): Complete Picture © J Singh, 2011 16 16
  • 17. The Hadoop Ecosystem • Introduction • The Hadoop Bestiary • The Hadoop Providers – Apache – Cloudera – Options when your data lives in a Database • Hosted Hadoop Frameworks © J Singh, 2011 17 17
  • 18. Apache Distribution • The Definitive Repository – The hub for Code, Documentation, Tutorials – Many contributors, for example • Pig was a Yahoo! Contribution • Hive came from Facebook • Sqoop came from Cloudera • Bare metal install option: – Download to your machine(s) from Apache – Install and Operate • Modify to fit your business better © J Singh, 2011 18 18
  • 19. Cloudera • Cloudera : Hadoop :: Red Hat : Linux • Cloudera’s Distribution Including Apache Hadoop (CDH) – A packaged set of Hadoop modules that work together – Now at CDH3 – Largest contributor of code to Apache Hadoop • $76M in Venture funding so far © J Singh, 2011 19 19
  • 20. When the data lives in a Database… • Objective: keeping Analytics and Data as close as possible • Options for RDBMS : • Options for NoSQL Databases – Sqoop data to/from HDFS – Sqoop-like connectors • Need to move the data • Need to move the data • Can utilize all parts of Hadoop – In-database analytics • Available for TeraData, – Built-in Map Reduce available Greenplum, etc. for most NoSQL databases • If you have the need • Knows about and tuned to the – And the $$$ storage mechanism • But typically only offers map and reduce – No Pig, Hive, … © J Singh, 2011 20 20
  • 21. The Hadoop Ecosystem • Introduction • The Hadoop Bestiary • The Hadoop Providers • Hadoop Platforms as a Service – Amazon Elastic MapReduce – Hadoop in Windows Azure – Google App Engine – Other • Infochimps • IBM SmartCloud © J Singh, 2011 21 21
  • 22. Amazon Elastic Map Reduce (EMR) • Hosted Map Reduce – CLI on your laptop • Control over size of cluster • Automatic spin-up/down instances – Map & Reduce programs on S3 • Pig, Hive or • Custom in Java, Ruby, Python, Perl, PHP, R, C++, Cascading – Data In/Out on S3 or – Data In/Out on DynamoDB • Keep in mind: – Hadoop on EC2 is also an option © J Singh, 2011 22 22
  • 23. Hadoop in Windows Azure • Basic Level – Hive Add-in for Excel – Hive ODBC Driver • Hadoop-based Distribution for Windows Server and Azure – Strategic Partnership with HortonWorks – Windows-based CLI on your laptop • Broadest Level – JavaScript framework for Hadoop – Hadoop connectors for SQL Server and Parallel Data Warehouse © J Singh, 2011 23 23
  • 24. Google App Engine MapReduce • Map Reduce as a Service – Distinct from Google’s internal Map Reduce – Part of Google App Engine • Works with Google Datastore – A Wide Column Store • A “purely programmatic” environment – Write Map and Reduce functions in Python / Java © J Singh, 2011 24 24
  • 25. Map Reduce Use at Google © J Singh, 2011 25 25
  • 26. Take Aways • There are many flavors of Hadoop. – The important part is Functional Programming and Map Reduce – Don’t let the proliferation of choices stump you. – Experiment with it! © J Singh, 2011 26 26
  • 27. Thank you • J Singh – President, Early Stage IT • Technology Services and Strategy for Startups • DataThinks.org is a new service of Early Stage IT – “Big Data” analytics solutions © J Singh, 2011 27 27

Hinweis der Redaktion

  1. Sources: Top 5 Reasons Not to Use Hadoop for AnalyticsThe Dark Side of HadoopHadoopDon’t’s: What not to do to harvest Hadoop’s full potential
  2. Get started with Hadoop
  3. http://pig.apache.org/docs/r0.9.2/index.htmlApache HadoopCascading
  4. http://pig.apache.org/docs/r0.9.2/index.html
  5. Flume Users GuideThrift PaperThrift Paper
  6. Missing components:Cascading