SlideShare ist ein Scribd-Unternehmen logo
1 von 59
Downloaden Sie, um offline zu lesen
Apache Hadoop
         Presented By,

  Darpan Dekivadiya(09BCE008)
What is Hadoop?
• A framework for storing and processing big data on
  lots of commodity machines.
   o Up to 4,000 machines in a cluster
   o Up to 20 PB in a cluster

• Open Source Apache project
• High reliability done in software
   o Automated fail-over for data and computation

• Implemented in Java



                                                    28-10-2012   2
Hadoop development
• Hadoop was created by Doug Cutting
• This is named as Hadoop from his son‟s toy
  elephant.
• It is originally developed to support Nutch search
  engine project.
• After that, So many companies adopted it and
  contributed in this project.




                                                28-10-2012   3
Hadoop Echo system
• Apache Hadoop is a collection of open-source software
  for reliable, scalable, distributed computing.
• Hadoop Common: The common utilities that support the
  other Hadoop subprojects.
• HDFS: A distributed file system that provides high
  throughput access to application data.
• MapReduce: A software framework for distributed
  processing of large data sets on compute clusters.
• Pig: A high-level data-flow language and execution
  framework for parallel computation.
• HBase: A scalable, distributed database that supports
  structured data storage for large tables.

                                                28-10-2012   4
28-10-2012   5
Hadoop, Why?
• Need to process Multi Petabyte Datasets
• Expensive to build reliability in each application.
• Nodes fail every day
  – Failure is expected, rather than exceptional.
  – The number of nodes in a cluster is not constant.
• Need common infrastructure
  –Efficient, reliable, Open Source Apache License
• The above goals are same as Condor, but
   o Workloads are IO bound and not CPU bound




                                                        28-10-2012   6
Hadoop History
•   Dec 2004 – Google GFS paper published
•   July 2005 – Nutch(Search engine) uses MapReduce
•   Feb 2006 – Starts as a Lucene subproject
•   Apr 2007 – Yahoo! on 1000-node cluster
•   Jan 2008 – An Apache Top Level Project
•   May 2009 – Hadoop sorts Petabyte in 17 hours
•   Aug 2010 – World‟s Largest Hadoop cluster at
    o Facebook
    o 2900 nodes, 30+ PetaByte




                                             28-10-2012   7
Who uses Hadoop?
•   Amazon/A9
•   Facebook
•   Google
•   IBM
•   Joost
•   Last.fm
•   New York Times
•   PowerSet
•   Veoh
•   Yahoo!

                           28-10-2012   8
Applications of Hadoop
• Search
  o Yahoo, Amazon, Zvents

• Log processing
  o Facebook, Yahoo, ContextWeb. Joost, Last.fm

• Recommendation Systems
  o Facebook

• Data Warehouse
  o Facebook, AOL

• Video and Image Analysis
  o New York Times, Eyealike




                                                  28-10-2012   9
Who generates the data?
• Lots of data is generated on Facebook
   o 500+ million active users
   o 30 billion pieces of content shared every month (news stories, photos,
     blogs, etc)


• Lots of data is generated for Yahoo search engine.
• Lots of data is generated at Amazon S3 cloud
  service.




                                                                      28-10-2012   10
Data usage
• Data Usage
  o   Statistics per day:
  o   20 TB of compressed new data added per day
  o   3 PB of compressed data scanned per day
  o   20K jobs on production cluster per day
  o   480K compute hours per day

• Barrier to entry is significantly reduced:
  o New engineers go though a Hadoop/Hive training session
  o 300+ people run jobs on Hadoop
  o Analysts (non-engineers) use Hadoop through Hive


                                                    28-10-2012   11
HDFS
Hadoop Distributed File System




                                 28-10-2012   12
Based on Google File System




                              28-10-2012   13
Redundant storage




                    28-10-2012   14
Commodity Hardware



• Typically in 2 level architecture
   o   Nodes are commodity PCs
   o   20-40 nodes/rack
   o   The default size of Apache Hadoop block is 64 MB.
   o   Relational databases typically store data blocks in sizes ranging from 4KB
       to 32KB.

                                                                        28-10-2012   15
How does HDFS maintain everything?
 • Two types of nodes
   o Single NameNode and a number of DataNodes


 • Namenode
   o File names, permissions, modified flags, etc.
   o Data locations exposed so that computations can

 • Datanode
   o Store and retrieve blocks when they are told to .
   o HDFS is built using the Java language; any machine that supports Java
     can run the NameNode or the DataNode software




                                                                    28-10-2012   16
How HDFS works?




                  28-10-2012   17
• The NameNode executes file system namespace
  operations like opening, closing, and renaming files
  and directories. It also determines the mapping of
  blocks to DataNodes.
• The DataNodes are responsible for serving read and
  write requests from the file system‟s clients.




                                               28-10-2012   18
MapReduce
Google‟s MapReduce Technique




                               28-10-2012   19
MapReduce Overview
• Provides a clean abstraction for programmers to
  write distributed application.
• Factors out many reliability concerns from
  application logic
• A batch data processing system
• Automatic parallelization & distribution
• Fault-tolerance
• Status and monitoring tools




                                              28-10-2012   20
Programming Model
• Programmer has to implement interface of two
  functions:

– map (in_key, in_value) ->
       (out_key, intermediate_value) list

– reduce (out_key, intermediate_value list) ->
    out_value list




                                                 28-10-2012   21
MapReduce Flow




                 28-10-2012   22
Mapper(indexing
            example)
• Input is the line no and the actual line.

• Input 1 : (“100”,“I Love India ”)
• Output 1 : (“I”,“100”), (“Love”,“100”),
  (“India”,“100”)


• Input 2 : (“101”,“I Love eBay”)
• Output 2 : (“I”,“101”), (“Love”,“101”),
  (“eBay”,“101”)

                                              28-10-2012   23
Reducer (indexing
               example)
• Input is word and the line nos.

•   Input   1 : (“I”,“100”,”101”)
•   Input   2 : (“Love”,“100”,”101”)
•   Input   3 : (“India”, “100”)
•   Input   4 : (“eBay”, “101”)

• Output, the words are stored along with the line
  nos.



                                               28-10-2012   24
Google Page Rank
              example
• Mapper
  o Input is a link and the html content
  o Output is a list of outgoing link and pagerank of this page

• Reducer
   o Input is a link and a list of pagranks of pages linking to this
     page
   o Output is the pagerank of this page, which is the weighted
     average of all input pageranks




                                                            28-10-2012   25
Conti.
• Limited atomicity and transaction support.
  o HBase supports multiple batched mutations of
    single rows only.
  o Data is unstructured and untyped.
• No accessed or manipulated via SQL.
  o Programmatic access via Java, REST, or Thrift APIs.
  o Scripting via JRuby.




                                                28-10-2012   26
Introduction
  of HBase
OVERVIEW
• HBase is an Apache open source project
  whose goal is to provide storage for the
  Hadoop Distributed Computing
  Environment.
• Data is logically organized into tables, rows
  and columns.




                                           28-10-2012   28
Outline
• Data Model

• Architecture and Implementation

• Examples & Tests




                                    28-10-2012   29
Conceptual                                                  <family>:<label>

     View             Row key
                                    Time  Column
                                   Stamp “contents:”
                                                            Column “anchor:”


• A data row has                    t12   “<html>…”
  a sortable row      “com.apach
  key and an            e.www”
                                    t11   “<html>…”

  arbitrary number                  t10
                                                       “anchor:apache.
                                                              com”
                                                                            “APACHE”
  of columns.
                                    t15                “anchor:cnnsi.com”    “CNN”
• A Time Stamp is
                                                       “anchor:my.look.c
  designated                        t13
                                                                a”
                                                                            “CNN.com”

  automatically if    “com.cnn.w
                                     t6    “<html>…”
  not artificially.
                         ww”


• <family>:<label>
                                     t5    “<html>…”


                                     t3    “<html>…”
HStore
    Physical Storage View                               Column
                                Row key        TS
                                                         “contents:”
• Physically, tables are
                                               t12     “<html>…”
  stored on a per-column      “com.apache.w
                                     ww”
  family basis.                                t11     “<html>…”
                                                                    HStore
                                               t6      “<html>…”
• Empty cells are not
  stored in a column-         “com.cn.www”     t5      “<html>…”

  oriented storage                             t3      “<html>…”
  format.
                               Row key        TS       Column “anchor:”
• Each column family is
  managed by an HStore.       “com.apache.
                                  www”        t10
                                                       “anchor:
                                                     apache.com”       “APACHE”


  Data MapFile    Key/Value
                                              t9
                                                      “anchor:
                                                                        “CNN”
                                                     cnnsi.com”
 Index MapFile    Index key
                              com.cn.www”
                                                      “anchor:         “CNN.co
                                              t8
                                                     my.look.ca”         m”
   Memcache
Time  Column

Row Ranges: RegionsRow key
                             Stamp “contents:”
                                                   Column “anchor:”


                              t15                anchor:cc      value
• Row key/ Column
                           t13  ba
  ascending, Timestamp
  descending         aaaa  t12  bb

• Physically, tables are broken
                           t11                   anchor:cd      value

  into row ranges contain rowsbc
                           t10

  from start-key to end-key
                     aaab  t14

                    aaac                         anchor:be      value

                    aaad                         anchor:ad      value

                               t5       ae
                    aaae
                               t3       af
Outline
• Data Model
• Architecture and Implementation
• Examples & Tests
Three major components
• The HBaseMaster

• The HRegionServer

• The HBase client
Master



HBaseMaster
                                  2 META Region                                      2 META Region
                                                                             2 META Region
                                            2 META Region

                                                             1 ROOT Region

 • Assign regions to
    HRegionServers.
 1. ROOT region locates all the    Server         Server         Server          Server        Server
    META regions.
 2. META region maps a number
    of user regions.                                                             USER Region

 3. Assign user regions to the
    HRegionServers.                                        META Region


 • Enable/Disable table and
    change table schema           ROOT Region                                    USER Region


 • Monitor the health of each                              META Region
    Server
                                                                                 USER Region
HBase Client
ROOT Region
HBase Client
HBase Client
META Region
User Region
            HBase Client




Information cached
Outline
• Data Model
• Architecture and Implementation
• Examples & Tests
Row Key   Create columnFamily1: columnFamily2:
              Timestamp MyTable
HBaseAdmin admin= new HBaseAdmin(config);
HColumnDescriptor []column;
column= new HColumnDescriptor[2];
column[0]=new HColumnDescriptor("columnFamily1:");
column[1]=new HColumnDescriptor("columnFamily2:");
HTableDescriptor desc= new
  HTableDescriptor(Bytes.toBytes("MyTable"));
desc.addFamily(column[0]);
desc.addFamily(column[1]);
admin.createTable(desc);
Insert Values
BatchUpdate batchUpdate = new
  BatchUpdate("myRow",timestamp);
batchUpdate.put("columnFamily1:labela",Bytes.toByt
  es("labela value"));
batchUpdate.put("columnFamily1:labelb",Bytes.toByt
  es(“labelb value"));
table.commit(batchUpdate);
Row Key   Timestamp       columnFamily1:

          ts1         labela   labela value

myRow
          ts2         labelb   labelb value
Select value from table where
                   key=‘com.apache.www’ AND
                            Search
                   label=‘anchor:apache.com’


                     Time
    Row key                                  Column “anchor:”
                    Stamp

                     t12

                     t11
“com.apache.www”

                     t10         “anchor:apache.com”            “APACHE”


                     t9           “anchor:cnnsi.com”             “CNN”


                     t8           “anchor:my.look.ca”           “CNN.com”
 “com.cnn.www”
                     t6

                     t5

                     t3
Select value from table
                   Scanner
                           Search     where anchor=‘cnnsi.com’


                    Time
    Row key                              Column “anchor:”
                   Stamp

                    t12

                    t11
“com.apache.www”

                    t10      “anchor:apache.com”            “APACHE”

                    t9        “anchor:cnnsi.com”             “CNN”

                    t8        “anchor:my.look.ca”           “CNN.com”
 “com.cnn.www”
                    t6

                    t5

                    t3
PIG
Programming Language for Hadoop Framework




                                            28-10-2012   45
Introduction
• Pig was initially developed at Yahoo!
• Pig programming language is designed
  to handle any kind of data-hence the
  name!
• Pig is made of two components:
   Language itself, which is called PigLatin .
   Runtime Environment where PigLatin programs
    are executed.


                                          28-10-2012   46
Why PigLatin?
• Map Reduce is very powerful, but:
   o It requires a Java programmer.
   o User has to re-invent common functionality (join, filter, etc.).
• For non-java programmers Pig Latin is introduced.
• Pig Latin is a data flow language rather than
  procedural or declarative.
• User code and existing binaries can be included
  almost anywhere.
• Metadata not required, but used when available.
• Support for nested types.
• Operates on files in HDFS.

                                                            28-10-2012   47
Pig Latin Overview
• Pig provides a higher level language,
  Pig Latin, that:
  o Increases productivity.
  o In one test 10 lines of Pig Latin ≈ 200 lines of Java.
• What took 4 hours to write in Java took
  15 minutes in Pig Latin.
  o Opens the system to non-Java programmers.
  o Provides common operations like join, group,
    filter, sort.

                                                   28-10-2012   48
Load Data
• The objects that are being worked on by Hadoop
  are stored in HDFS.
• To access this data, the program must first tell Pig
  what file (or files) it will use.
• That‟s done through the LOAD ‘data_file’
  command .
• If the data is stored in a file format that is not
  natively accessible to Pig,
• Add the “USING” function to the LOAD statement to
  specify a user-defined function that can read in
  and interpret the data.

                                               28-10-2012   49
Transform Data
• The transform logic is where all the
  data manipulation happens.
• For example :
   FILTER out rows that are not of interest.
   JOIN two sets of data files .
   GROUP data to build aggregations .
   ORDER results .




                                                28-10-2012   50
Example of Pig Program
• file composed of Twitter feeds, selects only those
  tweets that are using en(English) iso_language
  code, then groups them by the user who is
  tweeting, and displays the sum of the number of the
  re tweets of that user‟s tweets.

  L = LOAD „hdfs//node/tweet_data‟;
  FL = FILTER L BY iso_language_code EQ „en‟;
  G = GROUP FL BY from_user;
  RT = FOREACH G GENERATE group, SUM(retweets);


                                              28-10-2012   51
DUMP and STORE
• DUMP or STORE command generates the results of a
  Pig program.
• DUMP command sends the output to the screen,
  while debugging Pig programs.
• DUMP command can be used anywhere in
  program to dump intermediate result sets to the
  screen.
• STORE command will store results from running
  programs in a file for further processing and analysis.



                                                  28-10-2012   52
Pig Runtime Environment
• Pig runtime is used when Pig program need to run in
  the Hadoop environment .
• There are three ways to run a Pig program:
    Embedded in a Script.
    Embedded in Java Program.
    From the Pig Command line, called Grunt.
• The Pig runtime environment translates the program
  into a set of map and reduce tasks and runs.
• This greatly simplifies the work associated with the
  analysis of large amounts of data.


                                                28-10-2012   53
PIG is used for?
• Web log processing.
• Data processing for web search platforms.
• Ad hoc queries across large data sets.
• Rapid prototyping of algorithms for processing large
  data sets




                                               28-10-2012   54
Hadoop@BIG
Statistics of Hadoop used at giant structure




                                               28-10-2012   55
Hadoop@Facebook
• Production cluster
   o   4800 cores, 600 machines, 16GB per machine – April 2009
   o   8000 cores, 1000 machines, 32 GB per machine – July 2009
   o   4 SATA disks of 1 TB each per machine
   o   2 level network hierarchy, 40 machines per rack
   o   Total cluster size is 2 PB, projected to be 12 PB in Q3 2009

• Test cluster
   • 800 cores, 16GB each




                                                            28-10-2012   56
Hadoop@Yahoo
• World's largest Hadoop production application.
• The Yahoo! Search Webmap is a Hadoop
  application that runs on a more than 10,000 core
  Linux cluster
• Biggest contributor to Hadoop.
• Converting All its batches to Hadoop.




                                              28-10-2012   57
Hadoop@Amazon
• Hadoop can be run on Amazon Elastic Compute
  Cloud (EC2) and Amazon Simple Storage Service
  (S3)
• The New York Times used 100 Amazon EC2 instances
  and a Hadoop application to process 4TB of raw
  image TIFF data (stored in S3) into 11 million finished
  PDFs in the space of 24 hours at a computation cost
  of about $240
• Amazon Elastic MapReduce is a new web service
  that enables businesses, researchers, data analysts,
  and developers to easily and cost-effectively
  process vast amounts of data. It utilizes a hosted
  Hadoop framework.
                                                  28-10-2012   58
Thank You



            28-10-2012   59

Weitere ähnliche Inhalte

Was ist angesagt?

Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Scott Leberknight
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Swiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillSwiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillMapR Technologies
 
HBaseCon 2015: Analyzing HBase Data with Apache Hive
HBaseCon 2015: Analyzing HBase Data with Apache  HiveHBaseCon 2015: Analyzing HBase Data with Apache  Hive
HBaseCon 2015: Analyzing HBase Data with Apache HiveHBaseCon
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoopmarkgrover
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
 
Hive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly CompetitionHive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly CompetitionXplenty
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoopChirag Ahuja
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hiveReza Ameri
 
HiveServer2 for Apache Hive
HiveServer2 for Apache HiveHiveServer2 for Apache Hive
HiveServer2 for Apache HiveCarl Steinbach
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
 
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase Cloudera, Inc.
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014Hortonworks
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 

Was ist angesagt? (20)

Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Swiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillSwiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache Drill
 
HUG France - Apache Drill
HUG France - Apache DrillHUG France - Apache Drill
HUG France - Apache Drill
 
HBaseCon 2015: Analyzing HBase Data with Apache Hive
HBaseCon 2015: Analyzing HBase Data with Apache  HiveHBaseCon 2015: Analyzing HBase Data with Apache  Hive
HBaseCon 2015: Analyzing HBase Data with Apache Hive
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
SQL on Accumulo
SQL on AccumuloSQL on Accumulo
SQL on Accumulo
 
Hive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly CompetitionHive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly Competition
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoop
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
HiveServer2 for Apache Hive
HiveServer2 for Apache HiveHiveServer2 for Apache Hive
HiveServer2 for Apache Hive
 
NoSQL & HBase overview
NoSQL & HBase overviewNoSQL & HBase overview
NoSQL & HBase overview
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Mar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBaseMar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBase
 

Andere mochten auch

Guagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadoopGuagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadooppengshanzhang
 
Human monoclonal antibody development
Human monoclonal antibody developmentHuman monoclonal antibody development
Human monoclonal antibody developmenthumanantibodies
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop frameworkTu Pham
 
Zachman Framework As Enterprise Architecture Ontology
Zachman Framework As Enterprise Architecture OntologyZachman Framework As Enterprise Architecture Ontology
Zachman Framework As Enterprise Architecture OntologyOsama Abandeh
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Zachman framework
Zachman frameworkZachman framework
Zachman frameworkJoao Santos
 
Enterprise Architecture Framework: Chase Global Bank
Enterprise Architecture Framework: Chase Global BankEnterprise Architecture Framework: Chase Global Bank
Enterprise Architecture Framework: Chase Global BankHampus Ahlqvist
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 

Andere mochten auch (16)

Blended+learning+&+ibt zul
Blended+learning+&+ibt zulBlended+learning+&+ibt zul
Blended+learning+&+ibt zul
 
Guagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadoopGuagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadoop
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Human monoclonal antibody development
Human monoclonal antibody developmentHuman monoclonal antibody development
Human monoclonal antibody development
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
 
Critical Appreciative Inquiry
Critical Appreciative InquiryCritical Appreciative Inquiry
Critical Appreciative Inquiry
 
Zachman Framework As Enterprise Architecture Ontology
Zachman Framework As Enterprise Architecture OntologyZachman Framework As Enterprise Architecture Ontology
Zachman Framework As Enterprise Architecture Ontology
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Zachman framework
Zachman frameworkZachman framework
Zachman framework
 
Zachman Framework
Zachman FrameworkZachman Framework
Zachman Framework
 
Enterprise Architecture Framework: Chase Global Bank
Enterprise Architecture Framework: Chase Global BankEnterprise Architecture Framework: Chase Global Bank
Enterprise Architecture Framework: Chase Global Bank
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 

Ähnlich wie Apache hadoop

An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache HadoopKMS Technology
 
Etl with apache impala by athemaster
Etl with apache impala by athemasterEtl with apache impala by athemaster
Etl with apache impala by athemasterAthemaster Co., Ltd.
 
Big data berlin
Big data berlinBig data berlin
Big data berlinkammeyer
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
The modern analytics architecture
The modern analytics architectureThe modern analytics architecture
The modern analytics architectureJoseph D'Antoni
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azureDavid Giard
 
Big Data Day LA 2015 - Tips for Building Self Service Data Science Platform b...
Big Data Day LA 2015 - Tips for Building Self Service Data Science Platform b...Big Data Day LA 2015 - Tips for Building Self Service Data Science Platform b...
Big Data Day LA 2015 - Tips for Building Self Service Data Science Platform b...Data Con LA
 

Ähnlich wie Apache hadoop (20)

An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
 
Etl with apache impala by athemaster
Etl with apache impala by athemasterEtl with apache impala by athemaster
Etl with apache impala by athemaster
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
The modern analytics architecture
The modern analytics architectureThe modern analytics architecture
The modern analytics architecture
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
Apache drill
Apache drillApache drill
Apache drill
 
Big Data Day LA 2015 - Tips for Building Self Service Data Science Platform b...
Big Data Day LA 2015 - Tips for Building Self Service Data Science Platform b...Big Data Day LA 2015 - Tips for Building Self Service Data Science Platform b...
Big Data Day LA 2015 - Tips for Building Self Service Data Science Platform b...
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 

Mehr von Darpan Dekivadiya

Performance Analysis of Routing Protocols of Wireless Sensor Networks
Performance Analysis of Routing Protocols of Wireless Sensor NetworksPerformance Analysis of Routing Protocols of Wireless Sensor Networks
Performance Analysis of Routing Protocols of Wireless Sensor NetworksDarpan Dekivadiya
 
Radio Frequency Identification
Radio Frequency IdentificationRadio Frequency Identification
Radio Frequency IdentificationDarpan Dekivadiya
 
Routing Protocols for Wireless Sensor Networks
Routing Protocols for Wireless Sensor NetworksRouting Protocols for Wireless Sensor Networks
Routing Protocols for Wireless Sensor NetworksDarpan Dekivadiya
 

Mehr von Darpan Dekivadiya (7)

Performance Analysis of Routing Protocols of Wireless Sensor Networks
Performance Analysis of Routing Protocols of Wireless Sensor NetworksPerformance Analysis of Routing Protocols of Wireless Sensor Networks
Performance Analysis of Routing Protocols of Wireless Sensor Networks
 
Radio Frequency Identification
Radio Frequency IdentificationRadio Frequency Identification
Radio Frequency Identification
 
Routing Protocols for Wireless Sensor Networks
Routing Protocols for Wireless Sensor NetworksRouting Protocols for Wireless Sensor Networks
Routing Protocols for Wireless Sensor Networks
 
Intel 80486 Microprocessor
Intel 80486 MicroprocessorIntel 80486 Microprocessor
Intel 80486 Microprocessor
 
Routing Protocols in WSN
Routing Protocols in WSNRouting Protocols in WSN
Routing Protocols in WSN
 
Ad hoc networks
Ad hoc networksAd hoc networks
Ad hoc networks
 
Ad hoc Networks
Ad hoc NetworksAd hoc Networks
Ad hoc Networks
 

Kürzlich hochgeladen

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersChitralekhaTherkar
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 

Kürzlich hochgeladen (20)

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of Powders
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 

Apache hadoop

  • 1. Apache Hadoop Presented By, Darpan Dekivadiya(09BCE008)
  • 2. What is Hadoop? • A framework for storing and processing big data on lots of commodity machines. o Up to 4,000 machines in a cluster o Up to 20 PB in a cluster • Open Source Apache project • High reliability done in software o Automated fail-over for data and computation • Implemented in Java 28-10-2012 2
  • 3. Hadoop development • Hadoop was created by Doug Cutting • This is named as Hadoop from his son‟s toy elephant. • It is originally developed to support Nutch search engine project. • After that, So many companies adopted it and contributed in this project. 28-10-2012 3
  • 4. Hadoop Echo system • Apache Hadoop is a collection of open-source software for reliable, scalable, distributed computing. • Hadoop Common: The common utilities that support the other Hadoop subprojects. • HDFS: A distributed file system that provides high throughput access to application data. • MapReduce: A software framework for distributed processing of large data sets on compute clusters. • Pig: A high-level data-flow language and execution framework for parallel computation. • HBase: A scalable, distributed database that supports structured data storage for large tables. 28-10-2012 4
  • 6. Hadoop, Why? • Need to process Multi Petabyte Datasets • Expensive to build reliability in each application. • Nodes fail every day – Failure is expected, rather than exceptional. – The number of nodes in a cluster is not constant. • Need common infrastructure –Efficient, reliable, Open Source Apache License • The above goals are same as Condor, but o Workloads are IO bound and not CPU bound 28-10-2012 6
  • 7. Hadoop History • Dec 2004 – Google GFS paper published • July 2005 – Nutch(Search engine) uses MapReduce • Feb 2006 – Starts as a Lucene subproject • Apr 2007 – Yahoo! on 1000-node cluster • Jan 2008 – An Apache Top Level Project • May 2009 – Hadoop sorts Petabyte in 17 hours • Aug 2010 – World‟s Largest Hadoop cluster at o Facebook o 2900 nodes, 30+ PetaByte 28-10-2012 7
  • 8. Who uses Hadoop? • Amazon/A9 • Facebook • Google • IBM • Joost • Last.fm • New York Times • PowerSet • Veoh • Yahoo! 28-10-2012 8
  • 9. Applications of Hadoop • Search o Yahoo, Amazon, Zvents • Log processing o Facebook, Yahoo, ContextWeb. Joost, Last.fm • Recommendation Systems o Facebook • Data Warehouse o Facebook, AOL • Video and Image Analysis o New York Times, Eyealike 28-10-2012 9
  • 10. Who generates the data? • Lots of data is generated on Facebook o 500+ million active users o 30 billion pieces of content shared every month (news stories, photos, blogs, etc) • Lots of data is generated for Yahoo search engine. • Lots of data is generated at Amazon S3 cloud service. 28-10-2012 10
  • 11. Data usage • Data Usage o Statistics per day: o 20 TB of compressed new data added per day o 3 PB of compressed data scanned per day o 20K jobs on production cluster per day o 480K compute hours per day • Barrier to entry is significantly reduced: o New engineers go though a Hadoop/Hive training session o 300+ people run jobs on Hadoop o Analysts (non-engineers) use Hadoop through Hive 28-10-2012 11
  • 12. HDFS Hadoop Distributed File System 28-10-2012 12
  • 13. Based on Google File System 28-10-2012 13
  • 14. Redundant storage 28-10-2012 14
  • 15. Commodity Hardware • Typically in 2 level architecture o Nodes are commodity PCs o 20-40 nodes/rack o The default size of Apache Hadoop block is 64 MB. o Relational databases typically store data blocks in sizes ranging from 4KB to 32KB. 28-10-2012 15
  • 16. How does HDFS maintain everything? • Two types of nodes o Single NameNode and a number of DataNodes • Namenode o File names, permissions, modified flags, etc. o Data locations exposed so that computations can • Datanode o Store and retrieve blocks when they are told to . o HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software 28-10-2012 16
  • 17. How HDFS works? 28-10-2012 17
  • 18. • The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. • The DataNodes are responsible for serving read and write requests from the file system‟s clients. 28-10-2012 18
  • 20. MapReduce Overview • Provides a clean abstraction for programmers to write distributed application. • Factors out many reliability concerns from application logic • A batch data processing system • Automatic parallelization & distribution • Fault-tolerance • Status and monitoring tools 28-10-2012 20
  • 21. Programming Model • Programmer has to implement interface of two functions: – map (in_key, in_value) -> (out_key, intermediate_value) list – reduce (out_key, intermediate_value list) -> out_value list 28-10-2012 21
  • 22. MapReduce Flow 28-10-2012 22
  • 23. Mapper(indexing example) • Input is the line no and the actual line. • Input 1 : (“100”,“I Love India ”) • Output 1 : (“I”,“100”), (“Love”,“100”), (“India”,“100”) • Input 2 : (“101”,“I Love eBay”) • Output 2 : (“I”,“101”), (“Love”,“101”), (“eBay”,“101”) 28-10-2012 23
  • 24. Reducer (indexing example) • Input is word and the line nos. • Input 1 : (“I”,“100”,”101”) • Input 2 : (“Love”,“100”,”101”) • Input 3 : (“India”, “100”) • Input 4 : (“eBay”, “101”) • Output, the words are stored along with the line nos. 28-10-2012 24
  • 25. Google Page Rank example • Mapper o Input is a link and the html content o Output is a list of outgoing link and pagerank of this page • Reducer o Input is a link and a list of pagranks of pages linking to this page o Output is the pagerank of this page, which is the weighted average of all input pageranks 28-10-2012 25
  • 26. Conti. • Limited atomicity and transaction support. o HBase supports multiple batched mutations of single rows only. o Data is unstructured and untyped. • No accessed or manipulated via SQL. o Programmatic access via Java, REST, or Thrift APIs. o Scripting via JRuby. 28-10-2012 26
  • 28. OVERVIEW • HBase is an Apache open source project whose goal is to provide storage for the Hadoop Distributed Computing Environment. • Data is logically organized into tables, rows and columns. 28-10-2012 28
  • 29. Outline • Data Model • Architecture and Implementation • Examples & Tests 28-10-2012 29
  • 30. Conceptual <family>:<label> View Row key Time Column Stamp “contents:” Column “anchor:” • A data row has t12 “<html>…” a sortable row “com.apach key and an e.www” t11 “<html>…” arbitrary number t10 “anchor:apache. com” “APACHE” of columns. t15 “anchor:cnnsi.com” “CNN” • A Time Stamp is “anchor:my.look.c designated t13 a” “CNN.com” automatically if “com.cnn.w t6 “<html>…” not artificially. ww” • <family>:<label> t5 “<html>…” t3 “<html>…”
  • 31. HStore Physical Storage View Column Row key TS “contents:” • Physically, tables are t12 “<html>…” stored on a per-column “com.apache.w ww” family basis. t11 “<html>…” HStore t6 “<html>…” • Empty cells are not stored in a column- “com.cn.www” t5 “<html>…” oriented storage t3 “<html>…” format. Row key TS Column “anchor:” • Each column family is managed by an HStore. “com.apache. www” t10 “anchor: apache.com” “APACHE” Data MapFile Key/Value t9 “anchor: “CNN” cnnsi.com” Index MapFile Index key com.cn.www” “anchor: “CNN.co t8 my.look.ca” m” Memcache
  • 32. Time Column Row Ranges: RegionsRow key Stamp “contents:” Column “anchor:” t15 anchor:cc value • Row key/ Column t13 ba ascending, Timestamp descending aaaa t12 bb • Physically, tables are broken t11 anchor:cd value into row ranges contain rowsbc t10 from start-key to end-key aaab t14 aaac anchor:be value aaad anchor:ad value t5 ae aaae t3 af
  • 33. Outline • Data Model • Architecture and Implementation • Examples & Tests
  • 34. Three major components • The HBaseMaster • The HRegionServer • The HBase client
  • 35. Master HBaseMaster 2 META Region 2 META Region 2 META Region 2 META Region 1 ROOT Region • Assign regions to HRegionServers. 1. ROOT region locates all the Server Server Server Server Server META regions. 2. META region maps a number of user regions. USER Region 3. Assign user regions to the HRegionServers. META Region • Enable/Disable table and change table schema ROOT Region USER Region • Monitor the health of each META Region Server USER Region
  • 39. User Region HBase Client Information cached
  • 40. Outline • Data Model • Architecture and Implementation • Examples & Tests
  • 41. Row Key Create columnFamily1: columnFamily2: Timestamp MyTable HBaseAdmin admin= new HBaseAdmin(config); HColumnDescriptor []column; column= new HColumnDescriptor[2]; column[0]=new HColumnDescriptor("columnFamily1:"); column[1]=new HColumnDescriptor("columnFamily2:"); HTableDescriptor desc= new HTableDescriptor(Bytes.toBytes("MyTable")); desc.addFamily(column[0]); desc.addFamily(column[1]); admin.createTable(desc);
  • 42. Insert Values BatchUpdate batchUpdate = new BatchUpdate("myRow",timestamp); batchUpdate.put("columnFamily1:labela",Bytes.toByt es("labela value")); batchUpdate.put("columnFamily1:labelb",Bytes.toByt es(“labelb value")); table.commit(batchUpdate); Row Key Timestamp columnFamily1: ts1 labela labela value myRow ts2 labelb labelb value
  • 43. Select value from table where key=‘com.apache.www’ AND Search label=‘anchor:apache.com’ Time Row key Column “anchor:” Stamp t12 t11 “com.apache.www” t10 “anchor:apache.com” “APACHE” t9 “anchor:cnnsi.com” “CNN” t8 “anchor:my.look.ca” “CNN.com” “com.cnn.www” t6 t5 t3
  • 44. Select value from table Scanner Search where anchor=‘cnnsi.com’ Time Row key Column “anchor:” Stamp t12 t11 “com.apache.www” t10 “anchor:apache.com” “APACHE” t9 “anchor:cnnsi.com” “CNN” t8 “anchor:my.look.ca” “CNN.com” “com.cnn.www” t6 t5 t3
  • 45. PIG Programming Language for Hadoop Framework 28-10-2012 45
  • 46. Introduction • Pig was initially developed at Yahoo! • Pig programming language is designed to handle any kind of data-hence the name! • Pig is made of two components:  Language itself, which is called PigLatin .  Runtime Environment where PigLatin programs are executed. 28-10-2012 46
  • 47. Why PigLatin? • Map Reduce is very powerful, but: o It requires a Java programmer. o User has to re-invent common functionality (join, filter, etc.). • For non-java programmers Pig Latin is introduced. • Pig Latin is a data flow language rather than procedural or declarative. • User code and existing binaries can be included almost anywhere. • Metadata not required, but used when available. • Support for nested types. • Operates on files in HDFS. 28-10-2012 47
  • 48. Pig Latin Overview • Pig provides a higher level language, Pig Latin, that: o Increases productivity. o In one test 10 lines of Pig Latin ≈ 200 lines of Java. • What took 4 hours to write in Java took 15 minutes in Pig Latin. o Opens the system to non-Java programmers. o Provides common operations like join, group, filter, sort. 28-10-2012 48
  • 49. Load Data • The objects that are being worked on by Hadoop are stored in HDFS. • To access this data, the program must first tell Pig what file (or files) it will use. • That‟s done through the LOAD ‘data_file’ command . • If the data is stored in a file format that is not natively accessible to Pig, • Add the “USING” function to the LOAD statement to specify a user-defined function that can read in and interpret the data. 28-10-2012 49
  • 50. Transform Data • The transform logic is where all the data manipulation happens. • For example :  FILTER out rows that are not of interest.  JOIN two sets of data files .  GROUP data to build aggregations .  ORDER results . 28-10-2012 50
  • 51. Example of Pig Program • file composed of Twitter feeds, selects only those tweets that are using en(English) iso_language code, then groups them by the user who is tweeting, and displays the sum of the number of the re tweets of that user‟s tweets. L = LOAD „hdfs//node/tweet_data‟; FL = FILTER L BY iso_language_code EQ „en‟; G = GROUP FL BY from_user; RT = FOREACH G GENERATE group, SUM(retweets); 28-10-2012 51
  • 52. DUMP and STORE • DUMP or STORE command generates the results of a Pig program. • DUMP command sends the output to the screen, while debugging Pig programs. • DUMP command can be used anywhere in program to dump intermediate result sets to the screen. • STORE command will store results from running programs in a file for further processing and analysis. 28-10-2012 52
  • 53. Pig Runtime Environment • Pig runtime is used when Pig program need to run in the Hadoop environment . • There are three ways to run a Pig program:  Embedded in a Script.  Embedded in Java Program.  From the Pig Command line, called Grunt. • The Pig runtime environment translates the program into a set of map and reduce tasks and runs. • This greatly simplifies the work associated with the analysis of large amounts of data. 28-10-2012 53
  • 54. PIG is used for? • Web log processing. • Data processing for web search platforms. • Ad hoc queries across large data sets. • Rapid prototyping of algorithms for processing large data sets 28-10-2012 54
  • 55. Hadoop@BIG Statistics of Hadoop used at giant structure 28-10-2012 55
  • 56. Hadoop@Facebook • Production cluster o 4800 cores, 600 machines, 16GB per machine – April 2009 o 8000 cores, 1000 machines, 32 GB per machine – July 2009 o 4 SATA disks of 1 TB each per machine o 2 level network hierarchy, 40 machines per rack o Total cluster size is 2 PB, projected to be 12 PB in Q3 2009 • Test cluster • 800 cores, 16GB each 28-10-2012 56
  • 57. Hadoop@Yahoo • World's largest Hadoop production application. • The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster • Biggest contributor to Hadoop. • Converting All its batches to Hadoop. 28-10-2012 57
  • 58. Hadoop@Amazon • Hadoop can be run on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) • The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 • Amazon Elastic MapReduce is a new web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework. 28-10-2012 58
  • 59. Thank You 28-10-2012 59