SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
The Family of Hadoop
            Nham Xuan Nam
     nhamxuannam [at] gmail.com
     http://namnham.blogspot.com




    Barcamp Saigon, December 13 2009
Content
   History
   Sub-projects
   HDFS
   Map Reduce
   HBase
   Hive
History
   created by Doug Cutting, the creator of
    Lucene.
   Lucene: open source index & search library.
   Nutch: Lucene-based web crawler.
   Jun 2003, there was a successful 100
    million page Nutch demo system.
   Nutch problem: its architecture could not
    scale to the billions of pages.
History
 Oct 2003, Google published the paper
“The Google File System”.
   In 2004, Nutch team wrote an open source implementation
    of GFS, called Nutch Distributed File System (NDFS).
   Dec 2004, Google published the paper “MapReduce:
    Simplified Data Processing on Large Clusters”.
   In 2005, Nutch team implemented MapReduce in Nutch.
   Mid 2005, all the major Nutch algorithms had been ported
    to run using MapReduce and NDFS.
History
   Feb 2006, Nutch's NDFS and the MapReduce
    implementation formed Hadoop project.
   Doug Cutting joined Yahoo!.
   Jan 2008, Hadoop became Apache top-level
    project.
   Feb 2008, Yahoo! production search index
    was generated by a 10,000-core Hadoop
    cluster.
History




Source: http://wiki.apache.org/hadoop/PoweredBy
Sub-projects
Architecture
Data Model
   File stored as blocks (default size: 64M)
   Reliability through replication
    – Each block is replicated to several datanodes
Namenode & Datanodes
   Namenode (master)
    – manages the filesystem namespace
    – maintains the filesystem tree and metadata for all the
      files and directories in the tree.

   Datanodes (slaves)
    – store data in the local file system
    – Periodically report back to the namenode with lists of all
      existing blocks

   Clients communicate with both namenode and datanodes.
Data Flow
Data Flow
Accessibility
   FileSystem Java API
    – org.apache.hadoop.fs.*

   Web Interface

   Commands for HDFS users
$ hadoop dfs ­mkdir /barcamp

$ hadoop dfs ­ls /barcamp

   Commands for HDFS admins
$ hadoop dfsadmin ­report

$ hadoop dfsadmin ­refreshNodes
Programming Model
Programming Model
   Data is a stream of keys and values
   Map

    – Input: <key1,value1> pairs from data source

    – Output: immediate <key2,value2> pairs

   Reduce
    – Called once per a key, in sorted order
       Input: <key2, list of value2>

       Output: <key3,value3> pairs
Data Flow
WordCount Example
 File01:                                  File02:
 Hello Barcamp Hello Everyone             Hello Hadoop Hello Everyone

<_, Hello Barcamp Hello Everyone>       <_, Hello Hadoop Hello Everyone>




         <Hello,     2>                          <Hello,     2>
         <Barcamp, 1>                            <Hadoop,    1>
         <Everyone,  1>                          <Everyone,  1>

                          <Barcamp,    [1]>
                          <Hadoop,     [1]>
                          <Hello,      [2,2]>
                          <Everyone,   [1,1]>




                            <Barcamp, 1>
                            <Hadoop,    1>
                            <Hello,     4>
                            <Everyone,  2>
MapReduce in Hadoop
   JobTracker (master)
    – handling all jobs.
    – scheduling tasks on the slaves.
    – monitoring & re-executing tasks.

   TaskTrackers (slaves)
    – execute the tasks.

   Task
    – run an individual map or reduce.
MapReduce in Hadoop
Introduction
   Nov 2006, Google released the paper “Bigtable: A
    Distributed Storage System for Structured Data”
   BigTable: distributed, column-oriented store, built on top of
    Google File System.
   HBase: open source implementation of BigTable, built on
    top of HDFS.
Data Model
   Data are stored in tables of rows and columns.
   Cells are ”versioned”
→ Data are addressed by row/column/version key.
   Table rows are sorted by row key, the table's primary key.
   Columns are grouped into column families.
→ A column name has the form “<family>:<label>”
   Tables are stored in regions.
   Region: a row range [start-key : end-key)
Data Model
Architecture
Architecture
   Master Server
    – assigns regions to regionservers
    – monitors the health of regionservers
    – handles administrative funtions

   RegionServers
     – contain regions and handle client read/write requests

   Catalog Tables (ROOT and META)
     – maintain the current list, state, recent history, and
       location of all regions.
Accessibility
   Client API
org.apache.hadoop.hbase
.client.*

   HBase Shell
$ bin/hbase shell
hbase> 

   Web Interface
Introduction
   started at Facebook
   an open source data warehousing solution
    built on top of Hadoop
   for managing and querying structured data
   Hive QL: SQL-like query language
    – compiled into map-reduce jobs
   log processing, data mining,...
Data Model
   Tables
    – analogous to tables in RDBMS
    – rows are organized into typed columns
    – all the data is stored in a directory in HDFS

   Partitions
    – determine the distribution of data within sub-directories
      of the table directory

   Buckets
    – based on the hash of a column in the table
    – Each bucket is stored as a file in the partition directory
Architecture
Architecture
   Metastore
    – contains metadata about data stored in Hive.
    – stored in any SQL backend or an embedded Derby.
    – Database: a namespace for tables
    – Table metadata: column types, physical layout,...
    – Partition metadata

   Compiler

   Excution Engine

   Shell
Hive Query Language
   Data Definition (DDL) statements
    – CREATE/DROP/ALTER TABLE
    – SHOW TABLE/PARTITIONS

   Data Manipulation (DML) statements
    – LOAD DATA
    – INSERT
    – SELECT

   User Defined functions: UDF/UDAF
Hive @ Facebook
The End




Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an exampleNikita Kesharwani
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorialawesomesos
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basicHafizur Rahman
 

Was ist angesagt? (20)

Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an example
 
Pptx present
Pptx presentPptx present
Pptx present
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Hadoop
HadoopHadoop
Hadoop
 
HDFS
HDFSHDFS
HDFS
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

Andere mochten auch

Funeral insurance quotes
Funeral insurance quotesFuneral insurance quotes
Funeral insurance quotesfgary20
 
Innovation in the telecommunication Industry
Innovation in the telecommunication IndustryInnovation in the telecommunication Industry
Innovation in the telecommunication IndustryLeonard Raphael
 
Local commercial insurance
Local commercial insuranceLocal commercial insurance
Local commercial insuranceBob Foresite
 
5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...
5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...
5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...SAGE Publishing
 
Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...
Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...
Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...FHIR Developer Days
 
MRO Market Update and Industry Trends
MRO Market Update and Industry TrendsMRO Market Update and Industry Trends
MRO Market Update and Industry TrendsICF
 
FREE Law 531 final exam
FREE Law 531 final examFREE Law 531 final exam
FREE Law 531 final examRogue Phoenix
 
Kristen's cookie company
Kristen's cookie companyKristen's cookie company
Kristen's cookie companyRahul Biradar
 
Neerogi - A Patient Information Management System (PIMS)
Neerogi - A Patient Information Management System (PIMS)Neerogi - A Patient Information Management System (PIMS)
Neerogi - A Patient Information Management System (PIMS)Imesh Gunaratne
 
Mrs.Wishy-Washy
Mrs.Wishy-WashyMrs.Wishy-Washy
Mrs.Wishy-WashyJoanGascon
 
Preventaloss Loss Adjusters - Proposal
Preventaloss Loss Adjusters - ProposalPreventaloss Loss Adjusters - Proposal
Preventaloss Loss Adjusters - ProposalGerhard29046
 
Lender Essentials: Environmental Liability Insurance
Lender Essentials: Environmental Liability InsuranceLender Essentials: Environmental Liability Insurance
Lender Essentials: Environmental Liability InsuranceEDR
 
Summary -First Break All The Rules
Summary -First Break All The RulesSummary -First Break All The Rules
Summary -First Break All The RulesGMR Group
 
Chemical plant design &amp; construction 2016
Chemical plant design &amp; construction 2016Chemical plant design &amp; construction 2016
Chemical plant design &amp; construction 2016Nhật Nguyễn
 
Marketing of Financial Products and Services
Marketing of Financial Products and Services Marketing of Financial Products and Services
Marketing of Financial Products and Services Trinity Dwarka
 
Viral infections of Oral Cavity
Viral infections of Oral CavityViral infections of Oral Cavity
Viral infections of Oral CavityRavi Kumar
 

Andere mochten auch (20)

Funeral insurance quotes
Funeral insurance quotesFuneral insurance quotes
Funeral insurance quotes
 
Innovation in the telecommunication Industry
Innovation in the telecommunication IndustryInnovation in the telecommunication Industry
Innovation in the telecommunication Industry
 
Local commercial insurance
Local commercial insuranceLocal commercial insurance
Local commercial insurance
 
5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...
5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...
5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...
 
Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...
Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...
Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...
 
MRO Market Update and Industry Trends
MRO Market Update and Industry TrendsMRO Market Update and Industry Trends
MRO Market Update and Industry Trends
 
Beginners SharePoint introduction
Beginners SharePoint introductionBeginners SharePoint introduction
Beginners SharePoint introduction
 
FREE Law 531 final exam
FREE Law 531 final examFREE Law 531 final exam
FREE Law 531 final exam
 
Kristen's cookie company
Kristen's cookie companyKristen's cookie company
Kristen's cookie company
 
Neerogi - A Patient Information Management System (PIMS)
Neerogi - A Patient Information Management System (PIMS)Neerogi - A Patient Information Management System (PIMS)
Neerogi - A Patient Information Management System (PIMS)
 
Mrs.Wishy-Washy
Mrs.Wishy-WashyMrs.Wishy-Washy
Mrs.Wishy-Washy
 
Preventaloss Loss Adjusters - Proposal
Preventaloss Loss Adjusters - ProposalPreventaloss Loss Adjusters - Proposal
Preventaloss Loss Adjusters - Proposal
 
Lender Essentials: Environmental Liability Insurance
Lender Essentials: Environmental Liability InsuranceLender Essentials: Environmental Liability Insurance
Lender Essentials: Environmental Liability Insurance
 
Summary -First Break All The Rules
Summary -First Break All The RulesSummary -First Break All The Rules
Summary -First Break All The Rules
 
Chemical plant design &amp; construction 2016
Chemical plant design &amp; construction 2016Chemical plant design &amp; construction 2016
Chemical plant design &amp; construction 2016
 
Marketing of Financial Products and Services
Marketing of Financial Products and Services Marketing of Financial Products and Services
Marketing of Financial Products and Services
 
Hr value proposition
Hr value proposition  Hr value proposition
Hr value proposition
 
Hydraulics actuation system
Hydraulics actuation systemHydraulics actuation system
Hydraulics actuation system
 
Viral infections of Oral Cavity
Viral infections of Oral CavityViral infections of Oral Cavity
Viral infections of Oral Cavity
 
Prostate Cancer
Prostate CancerProstate Cancer
Prostate Cancer
 

Ähnlich wie The Family of Hadoop

Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...AyeeshaParveen
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...AyeeshaParveen
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryIJRESJOURNAL
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introductionXuan-Chao Huang
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopSteve Watt
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 

Ähnlich wie The Family of Hadoop (20)

Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
HADOOP
HADOOPHADOOP
HADOOP
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 

Kürzlich hochgeladen

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Kürzlich hochgeladen (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

The Family of Hadoop

  • 1. The Family of Hadoop Nham Xuan Nam nhamxuannam [at] gmail.com http://namnham.blogspot.com Barcamp Saigon, December 13 2009
  • 2. Content  History  Sub-projects  HDFS  Map Reduce  HBase  Hive
  • 3. History  created by Doug Cutting, the creator of Lucene.  Lucene: open source index & search library.  Nutch: Lucene-based web crawler.  Jun 2003, there was a successful 100 million page Nutch demo system.  Nutch problem: its architecture could not scale to the billions of pages.
  • 4. History  Oct 2003, Google published the paper “The Google File System”.  In 2004, Nutch team wrote an open source implementation of GFS, called Nutch Distributed File System (NDFS).  Dec 2004, Google published the paper “MapReduce: Simplified Data Processing on Large Clusters”.  In 2005, Nutch team implemented MapReduce in Nutch.  Mid 2005, all the major Nutch algorithms had been ported to run using MapReduce and NDFS.
  • 5. History  Feb 2006, Nutch's NDFS and the MapReduce implementation formed Hadoop project.  Doug Cutting joined Yahoo!.  Jan 2008, Hadoop became Apache top-level project.  Feb 2008, Yahoo! production search index was generated by a 10,000-core Hadoop cluster.
  • 8.
  • 10. Data Model  File stored as blocks (default size: 64M)  Reliability through replication – Each block is replicated to several datanodes
  • 11. Namenode & Datanodes  Namenode (master) – manages the filesystem namespace – maintains the filesystem tree and metadata for all the files and directories in the tree.  Datanodes (slaves) – store data in the local file system – Periodically report back to the namenode with lists of all existing blocks  Clients communicate with both namenode and datanodes.
  • 14. Accessibility  FileSystem Java API – org.apache.hadoop.fs.*  Web Interface  Commands for HDFS users $ hadoop dfs ­mkdir /barcamp $ hadoop dfs ­ls /barcamp  Commands for HDFS admins $ hadoop dfsadmin ­report $ hadoop dfsadmin ­refreshNodes
  • 15.
  • 17. Programming Model  Data is a stream of keys and values  Map – Input: <key1,value1> pairs from data source – Output: immediate <key2,value2> pairs  Reduce – Called once per a key, in sorted order  Input: <key2, list of value2>  Output: <key3,value3> pairs
  • 19. WordCount Example File01: File02: Hello Barcamp Hello Everyone Hello Hadoop Hello Everyone <_, Hello Barcamp Hello Everyone> <_, Hello Hadoop Hello Everyone> <Hello, 2> <Hello, 2> <Barcamp, 1> <Hadoop, 1> <Everyone,  1> <Everyone,  1> <Barcamp, [1]> <Hadoop, [1]> <Hello, [2,2]> <Everyone, [1,1]> <Barcamp, 1> <Hadoop, 1> <Hello,  4> <Everyone,  2>
  • 20. MapReduce in Hadoop  JobTracker (master) – handling all jobs. – scheduling tasks on the slaves. – monitoring & re-executing tasks.  TaskTrackers (slaves) – execute the tasks.  Task – run an individual map or reduce.
  • 22.
  • 23. Introduction  Nov 2006, Google released the paper “Bigtable: A Distributed Storage System for Structured Data”  BigTable: distributed, column-oriented store, built on top of Google File System.  HBase: open source implementation of BigTable, built on top of HDFS.
  • 24. Data Model  Data are stored in tables of rows and columns.  Cells are ”versioned” → Data are addressed by row/column/version key.  Table rows are sorted by row key, the table's primary key.  Columns are grouped into column families. → A column name has the form “<family>:<label>”  Tables are stored in regions.  Region: a row range [start-key : end-key)
  • 27. Architecture  Master Server – assigns regions to regionservers – monitors the health of regionservers – handles administrative funtions  RegionServers – contain regions and handle client read/write requests  Catalog Tables (ROOT and META) – maintain the current list, state, recent history, and location of all regions.
  • 28. Accessibility  Client API org.apache.hadoop.hbase .client.*  HBase Shell $ bin/hbase shell hbase>   Web Interface
  • 29.
  • 30. Introduction  started at Facebook  an open source data warehousing solution built on top of Hadoop  for managing and querying structured data  Hive QL: SQL-like query language – compiled into map-reduce jobs  log processing, data mining,...
  • 31. Data Model  Tables – analogous to tables in RDBMS – rows are organized into typed columns – all the data is stored in a directory in HDFS  Partitions – determine the distribution of data within sub-directories of the table directory  Buckets – based on the hash of a column in the table – Each bucket is stored as a file in the partition directory
  • 33. Architecture  Metastore – contains metadata about data stored in Hive. – stored in any SQL backend or an embedded Derby. – Database: a namespace for tables – Table metadata: column types, physical layout,... – Partition metadata  Compiler  Excution Engine  Shell
  • 34. Hive Query Language  Data Definition (DDL) statements – CREATE/DROP/ALTER TABLE – SHOW TABLE/PARTITIONS  Data Manipulation (DML) statements – LOAD DATA – INSERT – SELECT  User Defined functions: UDF/UDAF