SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Anatomy of
distributed
computing with
Hadoop
What is Hadoop?
   Hadoop was started out as a subproject of Nutch by
    Doug Cutting

   Hadoop boosted Nutch’s scalability

   Enhanced by Yahoo! and became Apache top level
    project

   System for distributed big data processing
       Big data is Terabytes and
                        Petabytes and
                                    more

       Exabytes, Zettabytes datasets?
Why anyone needs Hadoop?
Hadoop use cases
Hadoop use cases
Hadoop use cases
Hadoop basics
 Implements    Google’s whitepaper:
   http://research.google.com/archive/mapreduce.html



 Hadoop   is a combination of:
         HDFS                      Storage
       MapReduce                 Computation
HDFS
Hadoop Distributed File System
   It’s a file system
    bin/hadoop dfs <command> <options>



                   <command>
cat              expunge         put
chgrp            get             rm
chmod            getmerge        rmr
chown            ls              setrep
copyFromLocal    lsr             stat
copyToLocal      mkdir           tail
cp               moveFromLocal   test
du               moveToLocal     text
dus              mv              touchz
Hadoop Distributed File System
   It’s accessible
Hadoop Distributed File System
   It’s distributed
   It employs masterslave architecture
Hadoop Distributed File System
   Name Node:
    Stores file system metadata

   Secondary Name Node(s):
    Periodically merges file system image

   Data Node(s):
    Stores actual data (blocks)
    Allows data to be replicated
MapReduce
      A programming model for distributed data
       processing

      A data processing primitives are functions:
             Mappers and Reducers
MapReduce

!   To decompose MapReduce think of data in
    terms of keys and values:

<key, value>
<user id, user profile>
<timestamp, apache log entry>
<tag, list of tagged images>
MapReduce
 Mapper
 Function that takes key and value and emits
 zero or more keys and values

 Reducer
 Function that takes key and all “mapped”
 values and emits zero or more new keys and
 value
MapReduce example
 “Hello World” for Hadoop:
       http://wiki.apache.org/hadoop/WordCount


 “Tag   Cloud” example for Hadoop:

 tag1 tag2 tag3
 tag1 tag3        weight(tagi)
 tag3
 tag4 tag5 tag6
Tag Cloud example
   Input is taggable content (images, posts,
    videos) with space separated tags:
    <posti, “tag1 tag2 
 tagn”>

   Output is tagi with it’s count and total tags:
    <tagi, tag count>
    <total tags, total tags count>

   Results:
    weight(tagi)=tagi count/total tags
    font(tagi)=fn(weight(tagi))
Tag Cloud Mapper
    Mapper implements interface:
    org.apache.hadoop.mapreduce.Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>

   Mapper input:
       <post1, “tag1 tag3”>
       <post2, “tag3”>
       <post3, “tag2 tag3 tag4”>
       <post4, “tag1 tag2 tag3”>

                   simplify model & make line number a key

       <line1, “tag1 tag3”>
       <line2, “tag3”>
       <line3, “tag2 tag3 tag4”>
       <line4, “tag1 tag2 tag3”>

                    write raw tags to input file
Tag Cloud Mapper
      Mapper input:                                                              Mapper output:

   <0, “tag1 tag3”>                                                            <“total tags”, 2>
   <1, “tag3”>                                                                 <“tag1”, 1>
   <2, “tag2 tag3 tag4”>                                                       <“tag3”, 1>
   <3, “tag1 tag2 tag3”>
                                                                               <“total tags”, 1>
             read values - tags from file (line number is a key)               <“tag3”, 1>

                                 “tag1 tag3” // space separated tags           <“total tags”, 3>
                                                                               <“tag2”, 1>
String line = value.toString();                                                <“tag3”, 1>
StringTokenizer tokenizer = new StringTokenizer(line, ” ");                    <“tag4”, 1>
context.write(TOTAL_TAGS_KEY,                                context.write()
                  new IntWritable(tokenizer.countTokens()));                   <“total tags”, 3>
while (tokenizer.hasMoreTokens()) {                                            <“tag1”, 1>
    Text tag = new Text(tokenizer.nextToken());                                <“tag2”, 1>
    context.write(tag, new IntWritable(1)); // write to HDFS                   <“tag3”, 1>
}
Reducer phases
   1. Shuffle or Copy phase:
    Copies output from Mapper to Reducer local file system

   2. Sort phase:
    Sort Mapper output by keys. This becomes Reducer input
           Mapper output:                          Reducer input:
           <“total tags”, 2>                       <“tag1”, 1>
           <“tag1”, 1>                             <“tag1”, 1>
           <“tag3”, 1>
                                                   <“tag2”, 1>
           <“total tags”, 1>                       <“tag2”, 1>
           <“tag3”, 1>
                               shuffle & sort by
           <“total tags”, 3>   key                 <“tag3”, 1>
           <“tag2”, 1>                             <“tag3”, 1>
           <“tag3”, 1>                             <“tag3”, 1>
           <“tag4”, 1>                             <“tag3”, 1>

           <“total tags”, 3>                       <“tag4”, 1>
           <“tag1”, 1>
           <“tag2”, 1>                             <“total tags”, 2>
           <“tag3”, 1>                             <“total tags”, 1>
                                                   <“total tags”, 3>
                                                   <“total tags”, 3>
   3. Reduce or Emit phase:
    Performs reduce() for each sorted <key, value> input groups
Tag Cloud Reduce phase
  Reducer implements interface:
org.apache.hadoop.mapreduce.Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>

  Reducer input:                                                [<“tag1”, 1>, <“tag1”, 1>]
<“tag1”, 1>
<“tag1”, 1>                              int tagsCount = 0;
                 pairs grouped by tagi   for (IntWritable value : values) {
<“tag2”, 1>                                 tagsCount += value.get();
<“tag2”, 1>                              }
                                         context.write(key, new IntWritable(tagsCount));
<“tag3”, 1>
<“tag3”, 1>                                                    context.write()
<“tag3”, 1>
<“tag3”, 1>
                                                      Reducer output:
<“tag4”, 1>                                           <tag1, 2>
                                                      <tag2, 2>
<“total tags”, 2>                                     <tag3, 4>
<“total tags”, 1>                                     <tag4, 1>
<“total tags”, 3>                                     <total tags, 9>
<“total tags”, 3>
Tag Cloud Output
    Reducer output is weighted list:
    <tag1, 2>
    <tag2, 2>
    <tag3, 4>
    <tag4, 1>
    <total tags, 9>
                                         output
   Tag’s weight:
    weight(tagi)=tagi count/total tags

    <weight(tag1), 2/9>
    <weight(tag2), 2/9>
    <weight(tag3), 4/9>
    <weight(tag4), 1/9>

   Size of font:
    font(tagi)=fn(weight(tagi))
Between Map and Reduce
                                                  Mapper output:
   Combiner:                                     <“total tags”, 2>
                                                  <“tag1”, 1>
     implements interface                        <“tag1”, 1>
    org.apache.hadoop.mapreduce.Reducer           <“tag3”, 1>

     function works as in-memory Reducer                  in-memory combine
     serves for additional optimization
                                                  Combiner output:
                                                  <“total tags”, 3>
                                                  <“tag1”, 2>
   Partitioner:                                  <“tag3”, 1>
     implements interface
    org.apache.hadoop.mapreduce.Partitioner
     function assigns intermediate <key, value> pair from
    Mapper to designed Reducer partition
Time for a Workshop
                                     Standalone mode
   Build “Tag Cloud” project jar:
cd $TAG_CLOUD_HOME
mvn clean install

  Check input directory:
$HADOOP_HOME/bin/hadoop fs -ls $TAG_CLOUD_HOME/input/

  Check input file:
$HADOOP_HOME/bin/hadoop fs -cat $TAG_CLOUD_HOME/input/tags01

  Submit TagCloudJob to Hadoop:
$HADOOP_HOME/bin/hadoop jar $TAG_CLOUD_HOME/target/tagcloud-1.0.jar
com.altoros.rnd.hadoop.tagcloud.TagCloudJob $TAG_CLOUD_HOME/input
$TAG_CLOUD_HOME/output

  Check output directory:
$HADOOP_HOME/bin/hadoop fs -ls $TAG_CLOUD_HOME/output/

  Check output file:
$HADOOP_HOME/bin/hadoop fs -cat $TAG_CLOUD_HOME/output/part-r-00000
Apache Pig
   Higher-level data processing layer on top
    of Hadoop
   Data-flow oriented language (pig scripts)
   Data types include sets, associative
    arrays, tuples
   Developed at Yahoo!
Apache Hive
   Feature set is similar to Pig
   SQL-like data warehouse infrastructure
   Language is more strictly SQL
   Supports SELECT, JOIN, GROUP BY, etc
   Developed at Facebook
Apache HBase
    Column-store database (after Google
     BigTable model)
    HDFS is an underlying file system
    Holds extremely large datasets (multi Tb)
    Constrained access model
Apache Mahout
     Scalable machine learning algorithms on
      top of Hadoop:
     – filtering,
     – recommendations,
     – classifiers,
     – clustering
Apache ZooKeeper
     Common services for distributed
      applications:
      - group services,
      - configuration management,
      - naming services,
      - synchronization
Oozie
   Workflow engine for Hadoop
   Orchestrates dependencies between
    jobs running on Hadoop (including HDFS,
    Pig and MapReduce)
   Another query processing API
   Developed at Yahoo!
Apache Chukwa
    System for reliable large-scale log
     collection
    Displaying, monitoring and analyzing results
    Built on top of the Hadoop Distributed File
     System (HDFS) and Map/Reduce
    Incubated at apache.org
Questions

             links:
http://www.slideshare.net/tazija/anatomy-of-distributed-computing-with-hadoop
https://github.com/tazija/TagCloud

             skype: siarhei_bushyk
             mailto: tazija@gmail.com
             mailto: sergey.bushik@altoros.com

Weitere Àhnliche Inhalte

Was ist angesagt?

The Testing Games: Mocking, yay!
The Testing Games: Mocking, yay!The Testing Games: Mocking, yay!
The Testing Games: Mocking, yay!Donny Wals
 
Web Integration Patterns in the Era of HTML5
Web Integration Patterns in the Era of HTML5Web Integration Patterns in the Era of HTML5
Web Integration Patterns in the Era of HTML5johnwilander
 
怚æČ»èŠ‹ITć‹‰ćŒ·äŒš Groovy Grails
怚æČ»èŠ‹ITć‹‰ćŒ·äŒš Groovy Grails怚æČ»èŠ‹ITć‹‰ćŒ·äŒš Groovy Grails
怚æČ»èŠ‹ITć‹‰ćŒ·äŒš Groovy GrailsTsuyoshi Yamamoto
 
DB2 Native XML
DB2 Native XMLDB2 Native XML
DB2 Native XMLAmol Pujari
 
JSON and Swift, Still A Better Love Story Than Twilight
JSON and Swift, Still A Better Love Story Than TwilightJSON and Swift, Still A Better Love Story Than Twilight
JSON and Swift, Still A Better Love Story Than TwilightDonny Wals
 
Sqlxml vs xquery
Sqlxml vs xquerySqlxml vs xquery
Sqlxml vs xqueryAmol Pujari
 
Practica n° 7
Practica n° 7Practica n° 7
Practica n° 7rafobarrientos
 
Modern query optimisation features in MySQL 8.
Modern query optimisation features in MySQL 8.Modern query optimisation features in MySQL 8.
Modern query optimisation features in MySQL 8.Mydbops
 
Encryption: It's For More Than Just Passwords
Encryption: It's For More Than Just PasswordsEncryption: It's For More Than Just Passwords
Encryption: It's For More Than Just PasswordsJohn Congdon
 
Apache Airflow
Apache AirflowApache Airflow
Apache AirflowJason Kim
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGMatthew McCullough
 
Embracing the-power-of-refactor
Embracing the-power-of-refactorEmbracing the-power-of-refactor
Embracing the-power-of-refactorXiaojun REN
 
BDD - Behavior Driven Development Webapps mit Groovy Spock und Geb
BDD - Behavior Driven Development Webapps mit Groovy Spock und GebBDD - Behavior Driven Development Webapps mit Groovy Spock und Geb
BDD - Behavior Driven Development Webapps mit Groovy Spock und GebChristian Baranowski
 
Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Markus Klems
 
MySQL's JSON Data Type and Document Store
MySQL's JSON Data Type and Document StoreMySQL's JSON Data Type and Document Store
MySQL's JSON Data Type and Document StoreDave Stokes
 
Store and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and CassandraStore and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and CassandraDeependra Ariyadewa
 
Webinar: Replication and Replica Sets
Webinar: Replication and Replica SetsWebinar: Replication and Replica Sets
Webinar: Replication and Replica SetsMongoDB
 
SCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
SCALE 15x Minimizing PostgreSQL Major Version Upgrade DowntimeSCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
SCALE 15x Minimizing PostgreSQL Major Version Upgrade DowntimeJeff Frost
 

Was ist angesagt? (20)

The Testing Games: Mocking, yay!
The Testing Games: Mocking, yay!The Testing Games: Mocking, yay!
The Testing Games: Mocking, yay!
 
Web Integration Patterns in the Era of HTML5
Web Integration Patterns in the Era of HTML5Web Integration Patterns in the Era of HTML5
Web Integration Patterns in the Era of HTML5
 
怚æČ»èŠ‹ITć‹‰ćŒ·äŒš Groovy Grails
怚æČ»èŠ‹ITć‹‰ćŒ·äŒš Groovy Grails怚æČ»èŠ‹ITć‹‰ćŒ·äŒš Groovy Grails
怚æČ»èŠ‹ITć‹‰ćŒ·äŒš Groovy Grails
 
DB2 Native XML
DB2 Native XMLDB2 Native XML
DB2 Native XML
 
JSON and Swift, Still A Better Love Story Than Twilight
JSON and Swift, Still A Better Love Story Than TwilightJSON and Swift, Still A Better Love Story Than Twilight
JSON and Swift, Still A Better Love Story Than Twilight
 
Sqlxml vs xquery
Sqlxml vs xquerySqlxml vs xquery
Sqlxml vs xquery
 
Practica n° 7
Practica n° 7Practica n° 7
Practica n° 7
 
Modern query optimisation features in MySQL 8.
Modern query optimisation features in MySQL 8.Modern query optimisation features in MySQL 8.
Modern query optimisation features in MySQL 8.
 
Encryption: It's For More Than Just Passwords
Encryption: It's For More Than Just PasswordsEncryption: It's For More Than Just Passwords
Encryption: It's For More Than Just Passwords
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
Embracing the-power-of-refactor
Embracing the-power-of-refactorEmbracing the-power-of-refactor
Embracing the-power-of-refactor
 
BDD - Behavior Driven Development Webapps mit Groovy Spock und Geb
BDD - Behavior Driven Development Webapps mit Groovy Spock und GebBDD - Behavior Driven Development Webapps mit Groovy Spock und Geb
BDD - Behavior Driven Development Webapps mit Groovy Spock und Geb
 
Spock and Geb in Action
Spock and Geb in ActionSpock and Geb in Action
Spock and Geb in Action
 
Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3
 
MySQL's JSON Data Type and Document Store
MySQL's JSON Data Type and Document StoreMySQL's JSON Data Type and Document Store
MySQL's JSON Data Type and Document Store
 
Store and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and CassandraStore and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and Cassandra
 
Webinar: Replication and Replica Sets
Webinar: Replication and Replica SetsWebinar: Replication and Replica Sets
Webinar: Replication and Replica Sets
 
SCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
SCALE 15x Minimizing PostgreSQL Major Version Upgrade DowntimeSCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
SCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
 
Intro to Redis
Intro to RedisIntro to Redis
Intro to Redis
 

Andere mochten auch

How not to make a hacker friendly application
How not to make a hacker friendly applicationHow not to make a hacker friendly application
How not to make a hacker friendly applicationAbhinav Mishra
 
Assorted Learnings of Microservices
Assorted Learnings of MicroservicesAssorted Learnings of Microservices
Assorted Learnings of MicroservicesDavid Dawson
 
IntervenciĂłn huecos
IntervenciĂłn huecosIntervenciĂłn huecos
Intervención huecosDavid Acuña
 
Sheryl Larson 2015
Sheryl Larson 2015Sheryl Larson 2015
Sheryl Larson 2015Sheryl Larson
 
Metodologias de Seguridad De Sistemas
Metodologias de Seguridad De SistemasMetodologias de Seguridad De Sistemas
Metodologias de Seguridad De SistemasCarmen Benites
 
Design Pattern MicroServices Architecture in Japanese
Design Pattern MicroServices Architecture in JapaneseDesign Pattern MicroServices Architecture in Japanese
Design Pattern MicroServices Architecture in JapaneseLei Xu
 
Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?Alexandre Vasseur
 
Light microscope vs. Electron microscope
Light microscope vs. Electron microscopeLight microscope vs. Electron microscope
Light microscope vs. Electron microscopeJamica Ambion
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 
O DIAGNÓSTICO FONOAUDIOLÓGICO NO TRATAMENTO DO TDAH
O DIAGNÓSTICO FONOAUDIOLÓGICO NO TRATAMENTO DO TDAHO DIAGNÓSTICO FONOAUDIOLÓGICO NO TRATAMENTO DO TDAH
O DIAGNÓSTICO FONOAUDIOLÓGICO NO TRATAMENTO DO TDAHviasdosaber
 

Andere mochten auch (12)

How not to make a hacker friendly application
How not to make a hacker friendly applicationHow not to make a hacker friendly application
How not to make a hacker friendly application
 
Synthesis Presentation of Agricultural Communication
Synthesis Presentation of Agricultural Communication Synthesis Presentation of Agricultural Communication
Synthesis Presentation of Agricultural Communication
 
resume 2015
resume 2015resume 2015
resume 2015
 
Assorted Learnings of Microservices
Assorted Learnings of MicroservicesAssorted Learnings of Microservices
Assorted Learnings of Microservices
 
IntervenciĂłn huecos
IntervenciĂłn huecosIntervenciĂłn huecos
IntervenciĂłn huecos
 
Sheryl Larson 2015
Sheryl Larson 2015Sheryl Larson 2015
Sheryl Larson 2015
 
Metodologias de Seguridad De Sistemas
Metodologias de Seguridad De SistemasMetodologias de Seguridad De Sistemas
Metodologias de Seguridad De Sistemas
 
Design Pattern MicroServices Architecture in Japanese
Design Pattern MicroServices Architecture in JapaneseDesign Pattern MicroServices Architecture in Japanese
Design Pattern MicroServices Architecture in Japanese
 
Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?
 
Light microscope vs. Electron microscope
Light microscope vs. Electron microscopeLight microscope vs. Electron microscope
Light microscope vs. Electron microscope
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
O DIAGNÓSTICO FONOAUDIOLÓGICO NO TRATAMENTO DO TDAH
O DIAGNÓSTICO FONOAUDIOLÓGICO NO TRATAMENTO DO TDAHO DIAGNÓSTICO FONOAUDIOLÓGICO NO TRATAMENTO DO TDAH
O DIAGNÓSTICO FONOAUDIOLÓGICO NO TRATAMENTO DO TDAH
 

Ähnlich wie Anatomy of distributed computing with Hadoop

2012 mongo db_bangalore_roadmap_new
2012 mongo db_bangalore_roadmap_new2012 mongo db_bangalore_roadmap_new
2012 mongo db_bangalore_roadmap_newMongoDB
 
The Ring programming language version 1.10 book - Part 92 of 212
The Ring programming language version 1.10 book - Part 92 of 212The Ring programming language version 1.10 book - Part 92 of 212
The Ring programming language version 1.10 book - Part 92 of 212Mahmoud Samir Fayed
 
The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181Mahmoud Samir Fayed
 
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkRde:code 2017
 
Reproducible Computational Research in R
Reproducible Computational Research in RReproducible Computational Research in R
Reproducible Computational Research in RSamuel Bosch
 
Python course Day 1
Python course Day 1Python course Day 1
Python course Day 1Karin Lagesen
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startupsbmlever
 
Stata cheatsheet programming
Stata cheatsheet programmingStata cheatsheet programming
Stata cheatsheet programmingTim Essam
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat SheetLaura Hughes
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojureelliando dias
 
Hw09 Hadoop + Clojure
Hw09   Hadoop + ClojureHw09   Hadoop + Clojure
Hw09 Hadoop + ClojureCloudera, Inc.
 
yagdao-0.3.1 JPA guide
yagdao-0.3.1 JPA guideyagdao-0.3.1 JPA guide
yagdao-0.3.1 JPA guideMert Can Akkan
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to SparkLi Ming Tsai
 
NoSQL and SQL Anti Patterns
NoSQL and SQL Anti PatternsNoSQL and SQL Anti Patterns
NoSQL and SQL Anti PatternsGleicon Moraes
 
Go 1.10 Release Party - PDX Go
Go 1.10 Release Party - PDX GoGo 1.10 Release Party - PDX Go
Go 1.10 Release Party - PDX GoRodolfo Carvalho
 
OPM Recipe designer notes
OPM Recipe designer notesOPM Recipe designer notes
OPM Recipe designer notested-xu
 
Groovy and Grails talk
Groovy and Grails talkGroovy and Grails talk
Groovy and Grails talkdesistartups
 
Hatohol technical-brief-20130830-hbstudy
Hatohol technical-brief-20130830-hbstudyHatohol technical-brief-20130830-hbstudy
Hatohol technical-brief-20130830-hbstudykoedoyoshida
 

Ähnlich wie Anatomy of distributed computing with Hadoop (20)

2012 mongo db_bangalore_roadmap_new
2012 mongo db_bangalore_roadmap_new2012 mongo db_bangalore_roadmap_new
2012 mongo db_bangalore_roadmap_new
 
The Ring programming language version 1.10 book - Part 92 of 212
The Ring programming language version 1.10 book - Part 92 of 212The Ring programming language version 1.10 book - Part 92 of 212
The Ring programming language version 1.10 book - Part 92 of 212
 
The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181
 
Redis
RedisRedis
Redis
 
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
 
Reproducible Computational Research in R
Reproducible Computational Research in RReproducible Computational Research in R
Reproducible Computational Research in R
 
Python course Day 1
Python course Day 1Python course Day 1
Python course Day 1
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
 
Stata cheatsheet programming
Stata cheatsheet programmingStata cheatsheet programming
Stata cheatsheet programming
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat Sheet
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Hw09 Hadoop + Clojure
Hw09   Hadoop + ClojureHw09   Hadoop + Clojure
Hw09 Hadoop + Clojure
 
yagdao-0.3.1 JPA guide
yagdao-0.3.1 JPA guideyagdao-0.3.1 JPA guide
yagdao-0.3.1 JPA guide
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
NoSQL and SQL Anti Patterns
NoSQL and SQL Anti PatternsNoSQL and SQL Anti Patterns
NoSQL and SQL Anti Patterns
 
Hadoop
HadoopHadoop
Hadoop
 
Go 1.10 Release Party - PDX Go
Go 1.10 Release Party - PDX GoGo 1.10 Release Party - PDX Go
Go 1.10 Release Party - PDX Go
 
OPM Recipe designer notes
OPM Recipe designer notesOPM Recipe designer notes
OPM Recipe designer notes
 
Groovy and Grails talk
Groovy and Grails talkGroovy and Grails talk
Groovy and Grails talk
 
Hatohol technical-brief-20130830-hbstudy
Hatohol technical-brief-20130830-hbstudyHatohol technical-brief-20130830-hbstudy
Hatohol technical-brief-20130830-hbstudy
 

KĂŒrzlich hochgeladen

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel AraĂșjo
 

KĂŒrzlich hochgeladen (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Anatomy of distributed computing with Hadoop

  • 2. What is Hadoop?  Hadoop was started out as a subproject of Nutch by Doug Cutting  Hadoop boosted Nutch’s scalability  Enhanced by Yahoo! and became Apache top level project  System for distributed big data processing  Big data is Terabytes and Petabytes and more
  Exabytes, Zettabytes datasets?
  • 7. Hadoop basics  Implements Google’s whitepaper: http://research.google.com/archive/mapreduce.html  Hadoop is a combination of: HDFS Storage MapReduce Computation
  • 8. HDFS Hadoop Distributed File System  It’s a file system bin/hadoop dfs <command> <options> <command> cat expunge put chgrp get rm chmod getmerge rmr chown ls setrep copyFromLocal lsr stat copyToLocal mkdir tail cp moveFromLocal test du moveToLocal text dus mv touchz
  • 9. Hadoop Distributed File System  It’s accessible
  • 10. Hadoop Distributed File System  It’s distributed  It employs masterslave architecture
  • 11. Hadoop Distributed File System  Name Node: Stores file system metadata  Secondary Name Node(s): Periodically merges file system image  Data Node(s): Stores actual data (blocks) Allows data to be replicated
  • 12. MapReduce  A programming model for distributed data processing  A data processing primitives are functions: Mappers and Reducers
  • 13. MapReduce ! To decompose MapReduce think of data in terms of keys and values: <key, value> <user id, user profile> <timestamp, apache log entry> <tag, list of tagged images>
  • 14. MapReduce  Mapper Function that takes key and value and emits zero or more keys and values  Reducer Function that takes key and all “mapped” values and emits zero or more new keys and value
  • 15. MapReduce example  “Hello World” for Hadoop: http://wiki.apache.org/hadoop/WordCount  “Tag Cloud” example for Hadoop: tag1 tag2 tag3 tag1 tag3 weight(tagi) tag3 tag4 tag5 tag6
  • 16. Tag Cloud example  Input is taggable content (images, posts, videos) with space separated tags: <posti, “tag1 tag2 
 tagn”>  Output is tagi with it’s count and total tags: <tagi, tag count> <total tags, total tags count>  Results: weight(tagi)=tagi count/total tags font(tagi)=fn(weight(tagi))
  • 17. Tag Cloud Mapper  Mapper implements interface: org.apache.hadoop.mapreduce.Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>  Mapper input: <post1, “tag1 tag3”> <post2, “tag3”> <post3, “tag2 tag3 tag4”> <post4, “tag1 tag2 tag3”> simplify model & make line number a key <line1, “tag1 tag3”> <line2, “tag3”> <line3, “tag2 tag3 tag4”> <line4, “tag1 tag2 tag3”> write raw tags to input file
  • 18. Tag Cloud Mapper  Mapper input:  Mapper output: <0, “tag1 tag3”> <“total tags”, 2> <1, “tag3”> <“tag1”, 1> <2, “tag2 tag3 tag4”> <“tag3”, 1> <3, “tag1 tag2 tag3”> <“total tags”, 1> read values - tags from file (line number is a key) <“tag3”, 1> “tag1 tag3” // space separated tags <“total tags”, 3> <“tag2”, 1> String line = value.toString(); <“tag3”, 1> StringTokenizer tokenizer = new StringTokenizer(line, ” "); <“tag4”, 1> context.write(TOTAL_TAGS_KEY, context.write() new IntWritable(tokenizer.countTokens())); <“total tags”, 3> while (tokenizer.hasMoreTokens()) { <“tag1”, 1> Text tag = new Text(tokenizer.nextToken()); <“tag2”, 1> context.write(tag, new IntWritable(1)); // write to HDFS <“tag3”, 1> }
  • 19. Reducer phases  1. Shuffle or Copy phase: Copies output from Mapper to Reducer local file system  2. Sort phase: Sort Mapper output by keys. This becomes Reducer input Mapper output: Reducer input: <“total tags”, 2> <“tag1”, 1> <“tag1”, 1> <“tag1”, 1> <“tag3”, 1> <“tag2”, 1> <“total tags”, 1> <“tag2”, 1> <“tag3”, 1> shuffle & sort by <“total tags”, 3> key <“tag3”, 1> <“tag2”, 1> <“tag3”, 1> <“tag3”, 1> <“tag3”, 1> <“tag4”, 1> <“tag3”, 1> <“total tags”, 3> <“tag4”, 1> <“tag1”, 1> <“tag2”, 1> <“total tags”, 2> <“tag3”, 1> <“total tags”, 1> <“total tags”, 3> <“total tags”, 3>  3. Reduce or Emit phase: Performs reduce() for each sorted <key, value> input groups
  • 20. Tag Cloud Reduce phase  Reducer implements interface: org.apache.hadoop.mapreduce.Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>  Reducer input: [<“tag1”, 1>, <“tag1”, 1>] <“tag1”, 1> <“tag1”, 1> int tagsCount = 0; pairs grouped by tagi for (IntWritable value : values) { <“tag2”, 1> tagsCount += value.get(); <“tag2”, 1> } context.write(key, new IntWritable(tagsCount)); <“tag3”, 1> <“tag3”, 1> context.write() <“tag3”, 1> <“tag3”, 1>  Reducer output: <“tag4”, 1> <tag1, 2> <tag2, 2> <“total tags”, 2> <tag3, 4> <“total tags”, 1> <tag4, 1> <“total tags”, 3> <total tags, 9> <“total tags”, 3>
  • 21. Tag Cloud Output  Reducer output is weighted list: <tag1, 2> <tag2, 2> <tag3, 4> <tag4, 1> <total tags, 9> output  Tag’s weight: weight(tagi)=tagi count/total tags <weight(tag1), 2/9> <weight(tag2), 2/9> <weight(tag3), 4/9> <weight(tag4), 1/9>  Size of font: font(tagi)=fn(weight(tagi))
  • 22. Between Map and Reduce Mapper output:  Combiner: <“total tags”, 2> <“tag1”, 1>  implements interface <“tag1”, 1> org.apache.hadoop.mapreduce.Reducer <“tag3”, 1>  function works as in-memory Reducer in-memory combine  serves for additional optimization Combiner output: <“total tags”, 3> <“tag1”, 2>  Partitioner: <“tag3”, 1>  implements interface org.apache.hadoop.mapreduce.Partitioner  function assigns intermediate <key, value> pair from Mapper to designed Reducer partition
  • 23. Time for a Workshop Standalone mode  Build “Tag Cloud” project jar: cd $TAG_CLOUD_HOME mvn clean install  Check input directory: $HADOOP_HOME/bin/hadoop fs -ls $TAG_CLOUD_HOME/input/  Check input file: $HADOOP_HOME/bin/hadoop fs -cat $TAG_CLOUD_HOME/input/tags01  Submit TagCloudJob to Hadoop: $HADOOP_HOME/bin/hadoop jar $TAG_CLOUD_HOME/target/tagcloud-1.0.jar com.altoros.rnd.hadoop.tagcloud.TagCloudJob $TAG_CLOUD_HOME/input $TAG_CLOUD_HOME/output  Check output directory: $HADOOP_HOME/bin/hadoop fs -ls $TAG_CLOUD_HOME/output/  Check output file: $HADOOP_HOME/bin/hadoop fs -cat $TAG_CLOUD_HOME/output/part-r-00000
  • 24. Apache Pig  Higher-level data processing layer on top of Hadoop  Data-flow oriented language (pig scripts)  Data types include sets, associative arrays, tuples  Developed at Yahoo!
  • 25. Apache Hive  Feature set is similar to Pig  SQL-like data warehouse infrastructure  Language is more strictly SQL  Supports SELECT, JOIN, GROUP BY, etc  Developed at Facebook
  • 26. Apache HBase  Column-store database (after Google BigTable model)  HDFS is an underlying file system  Holds extremely large datasets (multi Tb)  Constrained access model
  • 27. Apache Mahout  Scalable machine learning algorithms on top of Hadoop: – filtering, – recommendations, – classifiers, – clustering
  • 28. Apache ZooKeeper  Common services for distributed applications: - group services, - configuration management, - naming services, - synchronization
  • 29. Oozie  Workflow engine for Hadoop  Orchestrates dependencies between jobs running on Hadoop (including HDFS, Pig and MapReduce)  Another query processing API  Developed at Yahoo!
  • 30. Apache Chukwa  System for reliable large-scale log collection  Displaying, monitoring and analyzing results  Built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce  Incubated at apache.org
  • 31. Questions links: http://www.slideshare.net/tazija/anatomy-of-distributed-computing-with-hadoop https://github.com/tazija/TagCloud skype: siarhei_bushyk mailto: tazija@gmail.com mailto: sergey.bushik@altoros.com

Hinweis der Redaktion

  1. DataNodes are constantly reporting to the NameNode. Blocks are stored on the Data Nodes.
  2. Standalone operation mode:1. export TAG_CLOUD_HOME=/Users/tazija/Projects/hadoop/tagcloud2. export HADOOP_HOME=/Users/tazija/Programs/apache-hadoop-0.23.03.cd $TAG_CLOUD_HOME4.mvn clean install5. $HADOOP_HOME/bin/hadoopfs -ls $TAG_CLOUD_HOME/inputInput directory is $TAG_CLOUD_HOME/input6. $HADOOP_HOME/bin/hadoopfs -cat $TAG_CLOUD_HOME/input/tags01We use InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text.5. $HADOOP_HOME/bin/hadoop jar $TAG_CLOUD_HOME/target/tagcloud-1.0.jar com.altoros.rnd.hadoop.tagcloud.TagCloudJob $TAG_CLOUD_HOME/input $TAG_CLOUD_HOME/outputDistributed mode:1. /etc/hadoop/hdfs-site.xml&lt;configuration&gt; &lt;property&gt; &lt;name&gt;dfs.namenode.name.dir&lt;/name&gt; &lt;value&gt;file:/Users/tazija/Programs/apache-hadoop-0.23.0/data/hdfs/namenode&lt;/value&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;dfs.namenode.name.dir&lt;/name&gt; &lt;value&gt;file:/Users/tazija/Programs/apache-hadoop-0.23.0/data/hdfs/namenode&lt;/value&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;dfs.replication&lt;/name&gt; &lt;value&gt;1&lt;/value&gt; &lt;/property&gt;&lt;/configuration&gt;2. Formatfilesystembin/hadoopnamenode –format3. Start daemons./sbin/hadoop-daemon.sh start namenode./sbin/hadoop-daemon.sh start datanode./sbin/hadoop-daemon.sh start secondarynamenode4. Checkhdfs statushttp://localhost:50070/