SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
Weekly meeting
         Week 5


    Le Duc Tung

 PRL group, NII, Tokyo


    Nov 21, 2012




 Le Duc Tung   Weekly meeting
Last week




     Described Quicksort using MapReduce, however,
         Required many MapReduce tasks: (log n) to n steps.




                           Le Duc Tung   Weekly meeting
Terasort using Hadoop



     Written by Owen O’Malley and Arun Murthy at Yahoo Inc.
     Won the annual general purpose terabyte sort benchmark in 2008
     and 2009.
         In 2008, 910 nodes x (4 dual-core processors, 4 disks, 8 GB
         memory). Sorting 1TB data in 3.48 minutes.
         In 2009, 3452 nodes x (2 Quadcore Xeons, 8 GB memory, 4 SATA).
         Sorting 100TB data in 173 minutes.
     Terasort includes 3 MapReduce applications:
         Teragen: generates the data.
         Terasort: samples the input data and uses them with MapReduce to
         sort the data.
         Teravalidate: validates the output is sorted.




                           Le Duc Tung   Weekly meeting
A closer look at MapReduce’s implementation model




  ”source: http://developer.yahoo.com/hadoop/tutorial/module4.html”
                           Le Duc Tung   Weekly meeting
Teragen




     Input: The number of rows and the output directory.
     Output format:




                           Le Duc Tung   Weekly meeting
MapReduce for Teragen


     Preparation of data for Map task.
                                    InputSplit
         The number of rows            →         splits of (startid, num of rows).
                                      RecordReader
         (startid, num of rows)            →         key-value pairs of (rowid, null)
     Example:
         We need generate data of 100 rows
         We use 2 mappers
         We generate 2 splits for 2 mappers.
                Split 0: consists of two values: startid = 0, number of rows = 50.
                Split 1: consists of two values: startid = 50, number of rows = 50.
         Key-value pairs generated from split 0:
                (0, null), (1, null), ..., (49, null)
         Key-value pairs generated from split 1:
                (50, null), (51, null), ..., (99, null)




                                  Le Duc Tung      Weekly meeting
Map and Reduce task


     Map
           Input is a pair of (rowid, null)
           Output is a pair of (key, value), in which
                Step 1: Create a random generator based on ”rowid”.
                Step 2: Create a key that is a text of 10 random bytes.
                Step 3: The value is a text of: ”rowid + 7 blocks of 10 characters +
                1 block of 8 characters”
                Characters are got from [’A’ .. ’Z’]
     Reduce task
           Does nothing.




                               Le Duc Tung   Weekly meeting
Terasort



     Step 1: (Before submitting job) Generate the sample keys by
     sampling the input. It is a sorted list of sampled keys.
     Step 2: (Using Partitioner) Distribute keys to the correspondent
     reducers using sample keys.
     Example: If we have N reducers.
           Generate (N-1) sample keys.
           All keys such that sample[i − 1] ≤ key ≤ sample[i] are sent to
           reduce i
           The above guarantees that the output of reduce i are all less than
           the output of reduce (i + 1)




                              Le Duc Tung   Weekly meeting
Generate sample keys

     From the input, create an array R containing all keys or the
     maximum of 100.000 keys (1MB). However, we can specify how
     many elements the array has.
     Using Quicksort to sort keys in the array.
     Create an array of (N-1) sample keys from the array R, in which N is
     the number of reducers, such that,
         (i − 1)-th element in the new array is (i ∗ stepSize)-th element in the
         original array.
         stepSize is of (No. of elements of R)/(No. of reducers)
     List of sampled keys is written into HDFS.




                             Le Duc Tung   Weekly meeting
Partitioner

  public interface Partitioner<K2, V2> extends JobConfigurable {
    /**
     * Get the partition number for a given key (hence record)
     * given the total number of partitions
     * i.e. number of reduce-tasks for the job.
     *
     * <p>Typically a hash function on a all or a subset of
     * the key.</p>
     *
     * @param key the key to be partitioned.
     * @param value the entry value.
     * @param numPartitions the total number of partitions.
     * @return the partition number for the <code>key</code>.
     */
    int getPartition(K2 key, V2 value, int numPartitions);
  }


                        Le Duc Tung   Weekly meeting
Partitioner and Trie tree




    getPartition() function will
    return the position of k in the
    trie tree.                              ”source:
                                            http://en.wikipedia.org/wiki/Trie”


                              Le Duc Tung   Weekly meeting
Trie for sample keys
      Assume that we have three sample keys:
           :x=P[ Q%D¡, in which, ASCII code of : and x is 58 and 120
           respectively.
           LsS8)—.ZLD, in which, ASCII code of L and s is 76 and 115
           respectively.
           le5awB.$sm, in which, ASCII code of l and e is 108 and 101
           respectively.
      Using the 2 first bytes of keys for building a 2-level trie tree.




                              Le Duc Tung   Weekly meeting
Default sort




    By default, data
    will be sorted
    before passing to
    reducers.
    Reduce task
    does nothing, it
    just outputs the
    (K, V) pairs to
    output files.




                        Le Duc Tung   Weekly meeting

Weitere ähnliche Inhalte

Was ist angesagt?

Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecturePiyush Mittal
 
05. performance-concepts-26-slides
05. performance-concepts-26-slides05. performance-concepts-26-slides
05. performance-concepts-26-slidesMuhammad Ahad
 
IP tables and Filtering
IP tables and FilteringIP tables and Filtering
IP tables and FilteringAisha Talat
 
introduction to system administration
introduction to system administrationintroduction to system administration
introduction to system administrationgamme123
 
Linux Memory Analysis with Volatility
Linux Memory Analysis with VolatilityLinux Memory Analysis with Volatility
Linux Memory Analysis with VolatilityAndrew Case
 
FUNCTION DEPENDENCY AND TYPES & EXAMPLE
FUNCTION DEPENDENCY  AND TYPES & EXAMPLEFUNCTION DEPENDENCY  AND TYPES & EXAMPLE
FUNCTION DEPENDENCY AND TYPES & EXAMPLEVraj Patel
 
Network Hardware And Software
Network Hardware And SoftwareNetwork Hardware And Software
Network Hardware And SoftwareSteven Cahill
 
Guided Transmission Media
Guided Transmission MediaGuided Transmission Media
Guided Transmission Mediaasrabatool
 
IP tables,Filtering.pptx
IP tables,Filtering.pptxIP tables,Filtering.pptx
IP tables,Filtering.pptxAyeCS11
 
Linux User Management
Linux User ManagementLinux User Management
Linux User ManagementGaurav Mishra
 
Transport services
Transport servicesTransport services
Transport servicesNavin Kumar
 
Types Of Keys in DBMS
Types Of Keys in DBMSTypes Of Keys in DBMS
Types Of Keys in DBMSPadamNepal1
 
Data Communications and Computer Networks
Data Communications and Computer Networks Data Communications and Computer Networks
Data Communications and Computer Networks Jubayer Alam Shoikat
 
04. availability-concepts
04. availability-concepts04. availability-concepts
04. availability-conceptsMuhammad Ahad
 
Fundamentals of Servers, server storage and server security.
Fundamentals of Servers, server storage and server security.Fundamentals of Servers, server storage and server security.
Fundamentals of Servers, server storage and server security.Aakash Panchal
 

Was ist angesagt? (20)

Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecture
 
05. performance-concepts-26-slides
05. performance-concepts-26-slides05. performance-concepts-26-slides
05. performance-concepts-26-slides
 
IP tables and Filtering
IP tables and FilteringIP tables and Filtering
IP tables and Filtering
 
introduction to system administration
introduction to system administrationintroduction to system administration
introduction to system administration
 
Filepermissions in linux
Filepermissions in linuxFilepermissions in linux
Filepermissions in linux
 
Linux Memory Analysis with Volatility
Linux Memory Analysis with VolatilityLinux Memory Analysis with Volatility
Linux Memory Analysis with Volatility
 
FUNCTION DEPENDENCY AND TYPES & EXAMPLE
FUNCTION DEPENDENCY  AND TYPES & EXAMPLEFUNCTION DEPENDENCY  AND TYPES & EXAMPLE
FUNCTION DEPENDENCY AND TYPES & EXAMPLE
 
Network management ppt
Network management pptNetwork management ppt
Network management ppt
 
Network Hardware And Software
Network Hardware And SoftwareNetwork Hardware And Software
Network Hardware And Software
 
Guided Transmission Media
Guided Transmission MediaGuided Transmission Media
Guided Transmission Media
 
IP tables,Filtering.pptx
IP tables,Filtering.pptxIP tables,Filtering.pptx
IP tables,Filtering.pptx
 
Multiprocessing operating systems
Multiprocessing operating systemsMultiprocessing operating systems
Multiprocessing operating systems
 
System calls
System callsSystem calls
System calls
 
Computer network
Computer networkComputer network
Computer network
 
Linux User Management
Linux User ManagementLinux User Management
Linux User Management
 
Transport services
Transport servicesTransport services
Transport services
 
Types Of Keys in DBMS
Types Of Keys in DBMSTypes Of Keys in DBMS
Types Of Keys in DBMS
 
Data Communications and Computer Networks
Data Communications and Computer Networks Data Communications and Computer Networks
Data Communications and Computer Networks
 
04. availability-concepts
04. availability-concepts04. availability-concepts
04. availability-concepts
 
Fundamentals of Servers, server storage and server security.
Fundamentals of Servers, server storage and server security.Fundamentals of Servers, server storage and server security.
Fundamentals of Servers, server storage and server security.
 

Ähnlich wie TeraSort

Advanced data structure
Advanced data structureAdvanced data structure
Advanced data structureShakil Ahmed
 
Python Interview Questions | Python Interview Questions And Answers | Python ...
Python Interview Questions | Python Interview Questions And Answers | Python ...Python Interview Questions | Python Interview Questions And Answers | Python ...
Python Interview Questions | Python Interview Questions And Answers | Python ...Simplilearn
 
Algorithms with-java-advanced-1.0
Algorithms with-java-advanced-1.0Algorithms with-java-advanced-1.0
Algorithms with-java-advanced-1.0BG Java EE Course
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data ManagementAlbert Bifet
 
Itroroduction to R language
Itroroduction to R languageItroroduction to R language
Itroroduction to R languagechhabria-nitesh
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docxLiterary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docxjeremylockett77
 
data structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysoredata structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysoreambikavenkatesh2
 
Basic concept of MATLAB.ppt
Basic concept of MATLAB.pptBasic concept of MATLAB.ppt
Basic concept of MATLAB.pptaliraza2732
 
Ida python intro
Ida python introIda python intro
Ida python intro小静 安
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimizationg3_nittala
 
sonam Kumari python.ppt
sonam Kumari python.pptsonam Kumari python.ppt
sonam Kumari python.pptssuserd64918
 
Time and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptxTime and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptxdudelover
 
Julia - Easier, Better, Faster, Stronger
Julia - Easier, Better, Faster, StrongerJulia - Easier, Better, Faster, Stronger
Julia - Easier, Better, Faster, StrongerKenta Sato
 

Ähnlich wie TeraSort (20)

Advanced data structure
Advanced data structureAdvanced data structure
Advanced data structure
 
Python Interview Questions | Python Interview Questions And Answers | Python ...
Python Interview Questions | Python Interview Questions And Answers | Python ...Python Interview Questions | Python Interview Questions And Answers | Python ...
Python Interview Questions | Python Interview Questions And Answers | Python ...
 
Using Parallel Computing Platform - NHDNUG
Using Parallel Computing Platform - NHDNUGUsing Parallel Computing Platform - NHDNUG
Using Parallel Computing Platform - NHDNUG
 
Algorithms with-java-advanced-1.0
Algorithms with-java-advanced-1.0Algorithms with-java-advanced-1.0
Algorithms with-java-advanced-1.0
 
Clojure intro
Clojure introClojure intro
Clojure intro
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
Itroroduction to R language
Itroroduction to R languageItroroduction to R language
Itroroduction to R language
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Python lecture 05
Python lecture 05Python lecture 05
Python lecture 05
 
Introduction2R
Introduction2RIntroduction2R
Introduction2R
 
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docxLiterary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
 
data structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysoredata structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysore
 
Basic concept of MATLAB.ppt
Basic concept of MATLAB.pptBasic concept of MATLAB.ppt
Basic concept of MATLAB.ppt
 
Ida python intro
Ida python introIda python intro
Ida python intro
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimization
 
sonam Kumari python.ppt
sonam Kumari python.pptsonam Kumari python.ppt
sonam Kumari python.ppt
 
Time and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptxTime and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptx
 
Julia - Easier, Better, Faster, Stronger
Julia - Easier, Better, Faster, StrongerJulia - Easier, Better, Faster, Stronger
Julia - Easier, Better, Faster, Stronger
 
Python Day1
Python Day1Python Day1
Python Day1
 
1.2 matlab numerical data
1.2  matlab numerical data1.2  matlab numerical data
1.2 matlab numerical data
 

Kürzlich hochgeladen

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Kürzlich hochgeladen (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

TeraSort

  • 1. Weekly meeting Week 5 Le Duc Tung PRL group, NII, Tokyo Nov 21, 2012 Le Duc Tung Weekly meeting
  • 2. Last week Described Quicksort using MapReduce, however, Required many MapReduce tasks: (log n) to n steps. Le Duc Tung Weekly meeting
  • 3. Terasort using Hadoop Written by Owen O’Malley and Arun Murthy at Yahoo Inc. Won the annual general purpose terabyte sort benchmark in 2008 and 2009. In 2008, 910 nodes x (4 dual-core processors, 4 disks, 8 GB memory). Sorting 1TB data in 3.48 minutes. In 2009, 3452 nodes x (2 Quadcore Xeons, 8 GB memory, 4 SATA). Sorting 100TB data in 173 minutes. Terasort includes 3 MapReduce applications: Teragen: generates the data. Terasort: samples the input data and uses them with MapReduce to sort the data. Teravalidate: validates the output is sorted. Le Duc Tung Weekly meeting
  • 4. A closer look at MapReduce’s implementation model ”source: http://developer.yahoo.com/hadoop/tutorial/module4.html” Le Duc Tung Weekly meeting
  • 5. Teragen Input: The number of rows and the output directory. Output format: Le Duc Tung Weekly meeting
  • 6. MapReduce for Teragen Preparation of data for Map task. InputSplit The number of rows → splits of (startid, num of rows). RecordReader (startid, num of rows) → key-value pairs of (rowid, null) Example: We need generate data of 100 rows We use 2 mappers We generate 2 splits for 2 mappers. Split 0: consists of two values: startid = 0, number of rows = 50. Split 1: consists of two values: startid = 50, number of rows = 50. Key-value pairs generated from split 0: (0, null), (1, null), ..., (49, null) Key-value pairs generated from split 1: (50, null), (51, null), ..., (99, null) Le Duc Tung Weekly meeting
  • 7. Map and Reduce task Map Input is a pair of (rowid, null) Output is a pair of (key, value), in which Step 1: Create a random generator based on ”rowid”. Step 2: Create a key that is a text of 10 random bytes. Step 3: The value is a text of: ”rowid + 7 blocks of 10 characters + 1 block of 8 characters” Characters are got from [’A’ .. ’Z’] Reduce task Does nothing. Le Duc Tung Weekly meeting
  • 8. Terasort Step 1: (Before submitting job) Generate the sample keys by sampling the input. It is a sorted list of sampled keys. Step 2: (Using Partitioner) Distribute keys to the correspondent reducers using sample keys. Example: If we have N reducers. Generate (N-1) sample keys. All keys such that sample[i − 1] ≤ key ≤ sample[i] are sent to reduce i The above guarantees that the output of reduce i are all less than the output of reduce (i + 1) Le Duc Tung Weekly meeting
  • 9. Generate sample keys From the input, create an array R containing all keys or the maximum of 100.000 keys (1MB). However, we can specify how many elements the array has. Using Quicksort to sort keys in the array. Create an array of (N-1) sample keys from the array R, in which N is the number of reducers, such that, (i − 1)-th element in the new array is (i ∗ stepSize)-th element in the original array. stepSize is of (No. of elements of R)/(No. of reducers) List of sampled keys is written into HDFS. Le Duc Tung Weekly meeting
  • 10. Partitioner public interface Partitioner<K2, V2> extends JobConfigurable { /** * Get the partition number for a given key (hence record) * given the total number of partitions * i.e. number of reduce-tasks for the job. * * <p>Typically a hash function on a all or a subset of * the key.</p> * * @param key the key to be partitioned. * @param value the entry value. * @param numPartitions the total number of partitions. * @return the partition number for the <code>key</code>. */ int getPartition(K2 key, V2 value, int numPartitions); } Le Duc Tung Weekly meeting
  • 11. Partitioner and Trie tree getPartition() function will return the position of k in the trie tree. ”source: http://en.wikipedia.org/wiki/Trie” Le Duc Tung Weekly meeting
  • 12. Trie for sample keys Assume that we have three sample keys: :x=P[ Q%D¡, in which, ASCII code of : and x is 58 and 120 respectively. LsS8)—.ZLD, in which, ASCII code of L and s is 76 and 115 respectively. le5awB.$sm, in which, ASCII code of l and e is 108 and 101 respectively. Using the 2 first bytes of keys for building a 2-level trie tree. Le Duc Tung Weekly meeting
  • 13. Default sort By default, data will be sorted before passing to reducers. Reduce task does nothing, it just outputs the (K, V) pairs to output files. Le Duc Tung Weekly meeting