SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
The Family of Hadoop
            Nham Xuan Nam
     nhamxuannam [at] gmail.com
     http://namnham.blogspot.com




    Barcamp Saigon, December 13 2009
Content
   History
   Sub-projects
   HDFS
   Map Reduce
   HBase
   Hive
History
   created by Doug Cutting, the creator of
    Lucene.
   Lucene: open source index & search library.
   Nutch: Lucene-based web crawler.
   Jun 2003, there was a successful 100
    million page Nutch demo system.
   Nutch problem: its architecture could not
    scale to the billions of pages.
History
 Oct 2003, Google published the paper
“The Google File System”.
   In 2004, Nutch team wrote an open source implementation
    of GFS, called Nutch Distributed File System (NDFS).
   Dec 2004, Google published the paper “MapReduce:
    Simplified Data Processing on Large Clusters”.
   In 2005, Nutch team implemented MapReduce in Nutch.
   Mid 2005, all the major Nutch algorithms had been ported
    to run using MapReduce and NDFS.
History
   Feb 2006, Nutch's NDFS and the MapReduce
    implementation formed Hadoop project.
   Doug Cutting joined Yahoo!.
   Jan 2008, Hadoop became Apache top-level
    project.
   Feb 2008, Yahoo! production search index
    was generated by a 10,000-core Hadoop
    cluster.
History




Source: http://wiki.apache.org/hadoop/PoweredBy
Sub-projects
Architecture
Data Model
   File stored as blocks (default size: 64M)
   Reliability through replication
    – Each block is replicated to several datanodes
Namenode & Datanodes
   Namenode (master)
    – manages the filesystem namespace
    – maintains the filesystem tree and metadata for all the
      files and directories in the tree.

   Datanodes (slaves)
    – store data in the local file system
    – Periodically report back to the namenode with lists of all
      existing blocks

   Clients communicate with both namenode and datanodes.
Data Flow
Data Flow
Accessibility
   FileSystem Java API
    – org.apache.hadoop.fs.*

   Web Interface

   Commands for HDFS users
$ hadoop dfs ­mkdir /barcamp

$ hadoop dfs ­ls /barcamp

   Commands for HDFS admins
$ hadoop dfsadmin ­report

$ hadoop dfsadmin ­refreshNodes
Programming Model
Programming Model
   Data is a stream of keys and values
   Map

    – Input: <key1,value1> pairs from data source

    – Output: immediate <key2,value2> pairs

   Reduce
    – Called once per a key, in sorted order
       Input: <key2, list of value2>

       Output: <key3,value3> pairs
Data Flow
WordCount Example
 File01:                                  File02:
 Hello Barcamp Hello Everyone             Hello Hadoop Hello Everyone

<_, Hello Barcamp Hello Everyone>       <_, Hello Hadoop Hello Everyone>




         <Hello,     2>                          <Hello,     2>
         <Barcamp, 1>                            <Hadoop,    1>
         <Everyone,  1>                          <Everyone,  1>

                          <Barcamp,    [1]>
                          <Hadoop,     [1]>
                          <Hello,      [2,2]>
                          <Everyone,   [1,1]>




                            <Barcamp, 1>
                            <Hadoop,    1>
                            <Hello,     4>
                            <Everyone,  2>
MapReduce in Hadoop
   JobTracker (master)
    – handling all jobs.
    – scheduling tasks on the slaves.
    – monitoring & re-executing tasks.

   TaskTrackers (slaves)
    – execute the tasks.

   Task
    – run an individual map or reduce.
MapReduce in Hadoop
Introduction
   Nov 2006, Google released the paper “Bigtable: A
    Distributed Storage System for Structured Data”
   BigTable: distributed, column-oriented store, built on top of
    Google File System.
   HBase: open source implementation of BigTable, built on
    top of HDFS.
Data Model
   Data are stored in tables of rows and columns.
   Cells are ”versioned”
→ Data are addressed by row/column/version key.
   Table rows are sorted by row key, the table's primary key.
   Columns are grouped into column families.
→ A column name has the form “<family>:<label>”
   Tables are stored in regions.
   Region: a row range [start-key : end-key)
Data Model
Architecture
Architecture
   Master Server
    – assigns regions to regionservers
    – monitors the health of regionservers
    – handles administrative funtions

   RegionServers
     – contain regions and handle client read/write requests

   Catalog Tables (ROOT and META)
     – maintain the current list, state, recent history, and
       location of all regions.
Accessibility
   Client API
org.apache.hadoop.hbase
.client.*

   HBase Shell
$ bin/hbase shell
hbase> 

   Web Interface
Introduction
   started at Facebook
   an open source data warehousing solution
    built on top of Hadoop
   for managing and querying structured data
   Hive QL: SQL-like query language
    – compiled into map-reduce jobs
   log processing, data mining,...
Data Model
   Tables
    – analogous to tables in RDBMS
    – rows are organized into typed columns
    – all the data is stored in a directory in HDFS

   Partitions
    – determine the distribution of data within sub-directories
      of the table directory

   Buckets
    – based on the hash of a column in the table
    – Each bucket is stored as a file in the partition directory
Architecture
Architecture
   Metastore
    – contains metadata about data stored in Hive.
    – stored in any SQL backend or an embedded Derby.
    – Database: a namespace for tables
    – Table metadata: column types, physical layout,...
    – Partition metadata

   Compiler

   Excution Engine

   Shell
Hive Query Language
   Data Definition (DDL) statements
    – CREATE/DROP/ALTER TABLE
    – SHOW TABLE/PARTITIONS

   Data Manipulation (DML) statements
    – LOAD DATA
    – INSERT
    – SELECT

   User Defined functions: UDF/UDAF
Hive @ Facebook
The End




Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an exampleNikita Kesharwani
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorialawesomesos
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basicHafizur Rahman
 

Was ist angesagt? (20)

Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an example
 
Pptx present
Pptx presentPptx present
Pptx present
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Hadoop
HadoopHadoop
Hadoop
 
HDFS
HDFSHDFS
HDFS
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

Andere mochten auch

Funeral insurance quotes
Funeral insurance quotesFuneral insurance quotes
Funeral insurance quotesfgary20
 
Innovation in the telecommunication Industry
Innovation in the telecommunication IndustryInnovation in the telecommunication Industry
Innovation in the telecommunication IndustryLeonard Raphael
 
Local commercial insurance
Local commercial insuranceLocal commercial insurance
Local commercial insuranceBob Foresite
 
5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...
5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...
5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...SAGE Publishing
 
Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...
Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...
Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...FHIR Developer Days
 
MRO Market Update and Industry Trends
MRO Market Update and Industry TrendsMRO Market Update and Industry Trends
MRO Market Update and Industry TrendsICF
 
FREE Law 531 final exam
FREE Law 531 final examFREE Law 531 final exam
FREE Law 531 final examRogue Phoenix
 
Kristen's cookie company
Kristen's cookie companyKristen's cookie company
Kristen's cookie companyRahul Biradar
 
Neerogi - A Patient Information Management System (PIMS)
Neerogi - A Patient Information Management System (PIMS)Neerogi - A Patient Information Management System (PIMS)
Neerogi - A Patient Information Management System (PIMS)Imesh Gunaratne
 
Mrs.Wishy-Washy
Mrs.Wishy-WashyMrs.Wishy-Washy
Mrs.Wishy-WashyJoanGascon
 
Preventaloss Loss Adjusters - Proposal
Preventaloss Loss Adjusters - ProposalPreventaloss Loss Adjusters - Proposal
Preventaloss Loss Adjusters - ProposalGerhard29046
 
Lender Essentials: Environmental Liability Insurance
Lender Essentials: Environmental Liability InsuranceLender Essentials: Environmental Liability Insurance
Lender Essentials: Environmental Liability InsuranceEDR
 
Summary -First Break All The Rules
Summary -First Break All The RulesSummary -First Break All The Rules
Summary -First Break All The RulesGMR Group
 
Chemical plant design &amp; construction 2016
Chemical plant design &amp; construction 2016Chemical plant design &amp; construction 2016
Chemical plant design &amp; construction 2016Nhật Nguyễn
 
Marketing of Financial Products and Services
Marketing of Financial Products and Services Marketing of Financial Products and Services
Marketing of Financial Products and Services Trinity Dwarka
 
Viral infections of Oral Cavity
Viral infections of Oral CavityViral infections of Oral Cavity
Viral infections of Oral CavityRavi Kumar
 

Andere mochten auch (20)

Funeral insurance quotes
Funeral insurance quotesFuneral insurance quotes
Funeral insurance quotes
 
Innovation in the telecommunication Industry
Innovation in the telecommunication IndustryInnovation in the telecommunication Industry
Innovation in the telecommunication Industry
 
Local commercial insurance
Local commercial insuranceLocal commercial insurance
Local commercial insurance
 
5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...
5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...
5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...
 
Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...
Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...
Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...
 
MRO Market Update and Industry Trends
MRO Market Update and Industry TrendsMRO Market Update and Industry Trends
MRO Market Update and Industry Trends
 
Beginners SharePoint introduction
Beginners SharePoint introductionBeginners SharePoint introduction
Beginners SharePoint introduction
 
FREE Law 531 final exam
FREE Law 531 final examFREE Law 531 final exam
FREE Law 531 final exam
 
Kristen's cookie company
Kristen's cookie companyKristen's cookie company
Kristen's cookie company
 
Neerogi - A Patient Information Management System (PIMS)
Neerogi - A Patient Information Management System (PIMS)Neerogi - A Patient Information Management System (PIMS)
Neerogi - A Patient Information Management System (PIMS)
 
Mrs.Wishy-Washy
Mrs.Wishy-WashyMrs.Wishy-Washy
Mrs.Wishy-Washy
 
Preventaloss Loss Adjusters - Proposal
Preventaloss Loss Adjusters - ProposalPreventaloss Loss Adjusters - Proposal
Preventaloss Loss Adjusters - Proposal
 
Lender Essentials: Environmental Liability Insurance
Lender Essentials: Environmental Liability InsuranceLender Essentials: Environmental Liability Insurance
Lender Essentials: Environmental Liability Insurance
 
Summary -First Break All The Rules
Summary -First Break All The RulesSummary -First Break All The Rules
Summary -First Break All The Rules
 
Chemical plant design &amp; construction 2016
Chemical plant design &amp; construction 2016Chemical plant design &amp; construction 2016
Chemical plant design &amp; construction 2016
 
Marketing of Financial Products and Services
Marketing of Financial Products and Services Marketing of Financial Products and Services
Marketing of Financial Products and Services
 
Hr value proposition
Hr value proposition  Hr value proposition
Hr value proposition
 
Hydraulics actuation system
Hydraulics actuation systemHydraulics actuation system
Hydraulics actuation system
 
Viral infections of Oral Cavity
Viral infections of Oral CavityViral infections of Oral Cavity
Viral infections of Oral Cavity
 
Prostate Cancer
Prostate CancerProstate Cancer
Prostate Cancer
 

Ähnlich wie The Family of Hadoop

Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...AyeeshaParveen
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...AyeeshaParveen
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryIJRESJOURNAL
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introductionXuan-Chao Huang
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopSteve Watt
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 

Ähnlich wie The Family of Hadoop (20)

Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
HADOOP
HADOOPHADOOP
HADOOP
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 

Kürzlich hochgeladen

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Kürzlich hochgeladen (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

The Family of Hadoop

  • 1. The Family of Hadoop Nham Xuan Nam nhamxuannam [at] gmail.com http://namnham.blogspot.com Barcamp Saigon, December 13 2009
  • 2. Content  History  Sub-projects  HDFS  Map Reduce  HBase  Hive
  • 3. History  created by Doug Cutting, the creator of Lucene.  Lucene: open source index & search library.  Nutch: Lucene-based web crawler.  Jun 2003, there was a successful 100 million page Nutch demo system.  Nutch problem: its architecture could not scale to the billions of pages.
  • 4. History  Oct 2003, Google published the paper “The Google File System”.  In 2004, Nutch team wrote an open source implementation of GFS, called Nutch Distributed File System (NDFS).  Dec 2004, Google published the paper “MapReduce: Simplified Data Processing on Large Clusters”.  In 2005, Nutch team implemented MapReduce in Nutch.  Mid 2005, all the major Nutch algorithms had been ported to run using MapReduce and NDFS.
  • 5. History  Feb 2006, Nutch's NDFS and the MapReduce implementation formed Hadoop project.  Doug Cutting joined Yahoo!.  Jan 2008, Hadoop became Apache top-level project.  Feb 2008, Yahoo! production search index was generated by a 10,000-core Hadoop cluster.
  • 8.
  • 10. Data Model  File stored as blocks (default size: 64M)  Reliability through replication – Each block is replicated to several datanodes
  • 11. Namenode & Datanodes  Namenode (master) – manages the filesystem namespace – maintains the filesystem tree and metadata for all the files and directories in the tree.  Datanodes (slaves) – store data in the local file system – Periodically report back to the namenode with lists of all existing blocks  Clients communicate with both namenode and datanodes.
  • 14. Accessibility  FileSystem Java API – org.apache.hadoop.fs.*  Web Interface  Commands for HDFS users $ hadoop dfs ­mkdir /barcamp $ hadoop dfs ­ls /barcamp  Commands for HDFS admins $ hadoop dfsadmin ­report $ hadoop dfsadmin ­refreshNodes
  • 15.
  • 17. Programming Model  Data is a stream of keys and values  Map – Input: <key1,value1> pairs from data source – Output: immediate <key2,value2> pairs  Reduce – Called once per a key, in sorted order  Input: <key2, list of value2>  Output: <key3,value3> pairs
  • 19. WordCount Example File01: File02: Hello Barcamp Hello Everyone Hello Hadoop Hello Everyone <_, Hello Barcamp Hello Everyone> <_, Hello Hadoop Hello Everyone> <Hello, 2> <Hello, 2> <Barcamp, 1> <Hadoop, 1> <Everyone,  1> <Everyone,  1> <Barcamp, [1]> <Hadoop, [1]> <Hello, [2,2]> <Everyone, [1,1]> <Barcamp, 1> <Hadoop, 1> <Hello,  4> <Everyone,  2>
  • 20. MapReduce in Hadoop  JobTracker (master) – handling all jobs. – scheduling tasks on the slaves. – monitoring & re-executing tasks.  TaskTrackers (slaves) – execute the tasks.  Task – run an individual map or reduce.
  • 22.
  • 23. Introduction  Nov 2006, Google released the paper “Bigtable: A Distributed Storage System for Structured Data”  BigTable: distributed, column-oriented store, built on top of Google File System.  HBase: open source implementation of BigTable, built on top of HDFS.
  • 24. Data Model  Data are stored in tables of rows and columns.  Cells are ”versioned” → Data are addressed by row/column/version key.  Table rows are sorted by row key, the table's primary key.  Columns are grouped into column families. → A column name has the form “<family>:<label>”  Tables are stored in regions.  Region: a row range [start-key : end-key)
  • 27. Architecture  Master Server – assigns regions to regionservers – monitors the health of regionservers – handles administrative funtions  RegionServers – contain regions and handle client read/write requests  Catalog Tables (ROOT and META) – maintain the current list, state, recent history, and location of all regions.
  • 28. Accessibility  Client API org.apache.hadoop.hbase .client.*  HBase Shell $ bin/hbase shell hbase>   Web Interface
  • 29.
  • 30. Introduction  started at Facebook  an open source data warehousing solution built on top of Hadoop  for managing and querying structured data  Hive QL: SQL-like query language – compiled into map-reduce jobs  log processing, data mining,...
  • 31. Data Model  Tables – analogous to tables in RDBMS – rows are organized into typed columns – all the data is stored in a directory in HDFS  Partitions – determine the distribution of data within sub-directories of the table directory  Buckets – based on the hash of a column in the table – Each bucket is stored as a file in the partition directory
  • 33. Architecture  Metastore – contains metadata about data stored in Hive. – stored in any SQL backend or an embedded Derby. – Database: a namespace for tables – Table metadata: column types, physical layout,... – Partition metadata  Compiler  Excution Engine  Shell
  • 34. Hive Query Language  Data Definition (DDL) statements – CREATE/DROP/ALTER TABLE – SHOW TABLE/PARTITIONS  Data Manipulation (DML) statements – LOAD DATA – INSERT – SELECT  User Defined functions: UDF/UDAF