SlideShare a Scribd company logo
1 of 54
Common MapReduce
    Patterns

     Chris K Wensel

     BuzzWords 2011
Engineer, Not Academic
•   Concurrent, Inc., Founder
     • Cascading support and tools
     • http://concurrentinc.com/

•   Cascading, Lead Developer (started Sept 2007)
     •  An alternative API to MapReduce
     •  http://cascading.org/

•   Formerly Hadoop mentoring and training
     •  Sun - Apple - HP - LexisNexis - startups - etc

•   Formerly Systems Architect & Consultant
     •  Thomson/Reuters - TeleAtlas - startups - etc
                                                Copyright Concurrent, Inc. 2011. All rights reserved.
Overview

• MapReduce
• Heavy Lifting
• Analytics
• Optimizations
                    Copyright Concurrent, Inc. 2011. All rights reserved.
MapReduce
• A “divide and conquer” strategy for parallelizing
  workloads against collections of data

• Map & Reduce are two user defined functions
  chained via Key Value Pairs

• It’s really Map->Group->Reduce where Group is
  built in

                                   Copyright Concurrent, Inc. 2011. All rights reserved.
Keys and Values
•   Map translates input to keys
    and values to new keys and
    values                             [K1,V1]               Map                     [K2,V2]*




•   System Groups each unique          [K2,V2]              Group               [K2,{V2,V2,....}]
    key with all its values

                                   [K2,{V2,V2,....}]        Reduce                   [K3,V3]*

•   Reduce translates the values
    of each unique key to new
    keys and values                                                    * = zero or more


                                                       Copyright Concurrent, Inc. 2011. All rights reserved.
Word Count
Mapper
 [0, "when in the course of
       human events"]            Map     ["when",1]     ["in",1]         ["the",1]          [...,1]


              ["when",1]
               ["when",1]
                ["when",1]
                 ["when",1]     Group    ["when",{1,1,1,1,1}]
                  ["when",1]
Reducer

         ["when",{1,1,1,1,1}]   Reduce   ["when",5]




                                                       Copyright Concurrent, Inc. 2011. All rights reserved.
Divide and Conquer
          Parallelism
• Since the ‘records’ entering the Map and ‘groups’
  entering the Reduce are independent

• That is, there is no expectation of order or
  requirement to share state between records/
  groups

• Arbitrary numbers of Map and Reduce function
  instances can be created against arbitrary portions
  of input data
                                   Copyright Concurrent, Inc. 2011. All rights reserved.
Cluster
         Cluster




              Rack                 Rack            Rack

              Node       Node      Node    Node    ...

                   map     map       map     map         map



               reduce     reduce                     reduce




• Multiple instances of each Map and Reduce
  function are distributed throughout the cluster

                                                           Copyright Concurrent, Inc. 2011. All rights reserved.
Another View
                  [K1,V1]            Map     [K2,V2]
                                             Combine   Group    [K2,{V2,...}]   Reduce    [K3,V3]


                                    Mapper
                                     Task                       same code



                                    Mapper                                      Reducer
                                                       Shuffle
                                     Task                                        Task


                                    Mapper                                      Reducer
                                                       Shuffle
                                     Task                                        Task


                                    Mapper                                      Reducer
                                                       Shuffle                    Task
                                     Task


                                    Mapper
                                     Task
                                                    Mappers must
                                                   complete before
                                                    Reducers can
                                                       begin
split1   split2   split3   split4      ...                              part-00000    part-00001    part-000N

                     file                                                             directory



                                                                                Copyright Concurrent, Inc. 2011. All rights reserved.
Complex job
                      assemblies
•   Real applications are many MapReduce jobs chained together

•   Linked by intermediate (usually temporary) files

•   Executed in order, by hand, from the ‘client’ application

       Count Job                                Sort Job
                    [ k, [v] ]                                    [ k, [v] ]
            Map                   Reduce              Map                         Reduce


      [ k, v ]                   [ k, v ]              [ k, v ]                         [ k, v ]


             File                            File                                    File



                                            [ k, v ] = key and value pair
                                            [ k, [v] ] = key and associated values collection
                                                                         Copyright Concurrent, Inc. 2011. All rights reserved.
Real World Apps
                                                                                                                                                                                                    [37/75] map+reduce




                                                                                                                                                                                                    [54/75] map+reduce




[41/75] map+reduce      [43/75] map+reduce       [42/75] map+reduce      [45/75] map+reduce       [44/75] map+reduce      [39/75] map+reduce    [36/75] map+reduce        [46/75] map+reduce        [40/75] map+reduce        [50/75] map+reduce     [38/75] map+reduce     [49/75] map+reduce     [51/75] map+reduce     [47/75] map+reduce     [52/75] map+reduce        [53/75] map+reduce    [48/75] map+reduce




[23/75] map+reduce      [25/75] map+reduce       [24/75] map+reduce      [27/75] map+reduce       [26/75] map+reduce      [21/75] map+reduce    [19/75] map+reduce        [28/75] map+reduce        [22/75] map+reduce        [32/75] map+reduce     [20/75] map+reduce     [31/75] map+reduce     [33/75] map+reduce     [29/75] map+reduce     [34/75] map+reduce        [35/75] map+reduce    [30/75] map+reduce




    [7/75] map+reduce        [2/75] map+reduce       [8/75] map+reduce       [10/75] map+reduce       [9/75] map+reduce     [5/75] map+reduce    [3/75] map+reduce        [11/75] map+reduce         [6/75] map+reduce        [13/75] map+reduce     [4/75] map+reduce    [16/75] map+reduce     [14/75] map+reduce     [15/75] map+reduce     [17/75] map+reduce        [18/75] map+reduce     [12/75] map+reduce




       [60/75] map              [62/75] map             [61/75] map                                                            [58/75] map          [55/75] map                                                     [56/75] map+reduce                  [57/75] map                                                                               [71/75] map               [72/75] map
                                                                                                                                                                                               [59/75] map




                                                                                                  [64/75] map+reduce                                 [63/75] map+reduce                        [65/75] map+reduce          [68/75] map+reduce      [67/75] map+reduce     [70/75] map+reduce     [69/75] map+reduce     [73/75] map+reduce     [66/75] map+reduce        [74/75] map+reduce




                                                                                                                                                                                                                                                                                                                                                    [75/75] map+reduce




                                                                                                                                                                                                                                                                                                                                                     [1/75] map+reduce




1 app, 75 jobs

green                                                     =                map + reduce
purple                                                    =                map
blue                                                      =                join/merge
orange                                                    =                map split
                                                                                                                                                                                                                                                                                  Copyright Concurrent, Inc. 2011. All rights reserved.
Heavy Lifting
•   Thing we must do because data can be heavy

•   These patterns are natural to MapReduce and easy to implement

•   But have some room for composition/aggregation within a Map/
    Reduce (i.e., Filter + Binning)

•   (leading us to think of Hadoop as an ETL framework)

•   Record Filtering
•   Parsing, Conversion           • Binning
•   Counting, Summing             • Distributed Tasks
•   Unique
                                            Copyright Concurrent, Inc. 2011. All rights reserved.
Record Filtering

• Think unix ‘grep’
• Filtering is discarding unwanted values (or
  preserving wanted)

• Only uses a Map function, no Reducer

                                   Copyright Concurrent, Inc. 2011. All rights reserved.
Parsing, Conversion
•   Think unix ‘sed’

•   A Map function that takes an input key and/or value and
    translates it into a new format

•   Examples:

    •   raw logs to delimited text or archival efficient binary

    •   entity extraction

                                           Copyright Concurrent, Inc. 2011. All rights reserved.
Counting, Summing

• The same as SQL aggregation functions
• Simply applying some function to the values
  collection seen in Reduce

• Other examples:
 • average, max, min, unique
                                 Copyright Concurrent, Inc. 2011. All rights reserved.
Merging
•   Where many files of the same type are converted to one
    output path
•   Map side merges
    •   One directory with as many part files as Mappers
•   Reduce side merges
    •   Allows for removing duplicates or deleted items
    •   One directory with as many part files as Reducers
•   Examples
    •   Nutch
    •   Normalizing log files (apache, log4j, etc)
                                              Copyright Concurrent, Inc. 2011. All rights reserved.
Binning
•   Where the values associated w/ unique keys are
    persisted together
•   Typically a directory path based on key’s value
•   Must be conscious of total open files, remember no
    appends
•   Examples:
    •   web log files by year/month/day
    •   trade data by symbol

                                         Copyright Concurrent, Inc. 2011. All rights reserved.
Distributed Tasks
•   Simply where a Map or Reduce function executes some
    ‘task’ based on the input key and value.
•   Examples:
    •   web crawling,
    •   load testing services,
    •   rdbms/nosql updates,
    •   file transfers (S3),
    •   image to pdf (NYT on EC2)

                                     Copyright Concurrent, Inc. 2011. All rights reserved.
Basic Analytic Patterns
•   Some of these patterns are unnatural to MapReduce

•   We think in terms of columns/fields, not key value
    pairs

•   (leading us to think of Hadoop as a RDBMS)


    •   Group By
                           •   Secondary Unique
    •   Unique
                           •   CoGrouping and Joining
    •   Secondary Sort

                                       Copyright Concurrent, Inc. 2011. All rights reserved.
Composite Keys/Values
              [K1,V1]     <A1,B1,C1,...>




• It is easier to think in columns/fields
 • e.g. “firstname” & “lastname”, not “line”
• Whether a set of columns are Keys or Values is
  arbitrary
• Keys become a means to piggyback the
  properties of MR and become an impl detail
                                     Copyright Concurrent, Inc. 2011. All rights reserved.
Group By
                            GroupBy
                                1001

                                        Jim
                  dept_id
                                       Mary

                    name               Susan

                                1002

                                       Fred

                                       Wilma

                                       Ernie

                                       Barny




•   Group By is where Value fields are grouped by Grouping fields
•   Above, Map output key is “dept_id” and value is “name”
                                               Copyright Concurrent, Inc. 2011. All rights reserved.
Group By
                     Mapper                         Reducer
    Piggyback Code             [K1,V1]                      [K2,{V2,V2,....}]



                       [K1,V1] -> <A1,B1,C1,D1>     [K2,V2] -> <A2,B2,{<C2,D2>,...}>
        User Code


                                 Map                            Reduce



                     <A2,B2> -> K2, <C2,D2> -> V2    <A3,B3> -> K3, <C3,D3> -> V3



                               [K2,V2]                          [K3,V3]




•   So the K2 key becomes a composite Key of
    • key: [grouping], value: [values]

                                                                   Copyright Concurrent, Inc. 2011. All rights reserved.
Unique
         Mapper
          [0, "when in the course of
                human events"]             Map     ["when",null]      ["in",null]   [...,null]


                       ["when",1]
                        ["when",1]
                         ["when",1]
                          ["when",1]      Group    ["when",{nulls}]
                          ["when",null]
         Reducer

                      ["when",{nulls}]    Reduce   ["when",null]




•   Or Distinct (as in SQL)
•   Globally finding all the unique values in a dataset
    • Usually finding unique values in a column
•   Often used to filter a second dataset using a join
                                                                      Copyright Concurrent, Inc. 2011. All rights reserved.
Secondary Sort
      (group)   (sorted value)         (remaining value)
    Date          Time           Url

    08/08/2008, 1:00:00, http://www.example.com/foo

    08/08/2008, 1:01:00, http://www.example.com/bar

    08/08/2008, 1:01:30, http://www.example.com/baz



• Secondary Sorting is where
  • Some Fields are grouped on, and
  • Some of the remaining Fields are sorted within
    their grouping
                                               Copyright Concurrent, Inc. 2011. All rights reserved.
Secondary Sort
               Mapper                          Reducer
                          [K1,V1]                      [K2,{V2,V2,....}]



                 [K1,V1] -> <A1,B1,C1,D1>      [K2,V2] -> <A2,B2,{<C2,D2>,...}>



                           Map                             Reduce



               <A2,B2><C2> -> K2, <D2> -> V2    <A3,B3> -> K3, <C3,D3> -> V3



                          [K2,V2]                          [K3,V3]




•   So the K2 key becomes a composite Key of
    • key: [grouping, secondary], value: [remaining values]
•   The trick is to piggyback the Reduce sort yet not be compared
    during the unique key comparison
                                                                     Copyright Concurrent, Inc. 2011. All rights reserved.
Secondary Unique
        Mapper                                                                                       Assume Secondary Sorting
                                                                                                        magic happens here
               [0, "when in the course of
                     human events"]                Map     [0, "when"]    [0, "in"]     [0,"the"]    [0,...]


                             ["when",1]
                              ["when",1]
                               ["when",1]
                                ["when",1]        Group    [0,{"in","in","the","when","when",...}]
                                 [0,"when"]
        Reducer

        [0,{"in","in","the","when","when",...}]   Reduce   ["in",null]   ["the",null]    ["when",null]




•   Secondary Unique is where the grouping values are uniqued
    • .... in a “scale free” way
•   Perform a Secondary Sort...
•   Reducer removes duplicates by discarding every value that
    matches the previous value
    • since values are now ordered, no need to maintain a Set of
      values
                                                                                Copyright Concurrent, Inc. 2011. All rights reserved.
Joining
                           lhs data
                                          rhs data

                   1001

         dept_id           Jim        Accounting
                                                     dept_name
                          Mary        Accounting

         name             Susan       Accounting

                   1002

                          Fred         Shipping

                          Wilma        Shipping

                          Ernie        Shipping

                          Barny        Shipping




•   Where two or more input data sets are ‘joined’ by a
    common key
    • Like a SQL join
                                                      Copyright Concurrent, Inc. 2011. All rights reserved.
Join Definitions
•   Consider the input data [key, value]:
    •  LHS = [0,a] [1,b] [2,c]
    •  RHS = [0,A]                    [2,C] [3,D]
•   Joins on the key:
    •  Inner
      •   [0,a,A] [2,c,C]
    •  Outer (Left Outer, Right Outer)
      •   [0,a,A] [1,b,null] [2,c,C] [3,null,D]
    •  Left (Left Inner, Right Outer)
      •   [0,a,A] [1,b,null] [2,c,C]
    •  Right (Left Outer, Right Inner)
      •   [0,a,A] [2,c,C] [3,null,D]
                                     Copyright Concurrent, Inc. 2011. All rights reserved.
CoGrouping


• Before Joining, CoGrouping must happen
• Simply concurrent GroupBy operations on each
  input data set



                               Copyright Concurrent, Inc. 2011. All rights reserved.
GroupBy vs CoGroup
                                         lhs data
                                                                rhs data
          GroupBy            CoGroup
              1001               1001

                      Jim                Jim               Accounting
dept_id
                     Mary               Mary

  name               Susan              Susan                                dept_name

              1002               1002

                     Fred               Fred                Shipping

                     Wilma              Wilma

                     Ernie              Ernie

                     Barny              Barny



                                           Independent collections
                                             of unordered values



                                                Copyright Concurrent, Inc. 2011. All rights reserved.
CoGroup Joined
                          lhs data
                                         rhs data

                  1001

        dept_id           Jim        Accounting
                                                            dept_name
                         Mary        Accounting

         name            Susan       Accounting

                  1002

                         Fred         Shipping

                         Wilma        Shipping

                         Ernie        Shipping

                         Barny        Shipping




• Considering the previous data, a typical Inner Join
                                                    Copyright Concurrent, Inc. 2011. All rights reserved.
CoGrouping
    Mapper    [n]                            [n+1]                Reducer
            [K1,V1]                        [K1',V1']                           [K2,{V2,V2,....}]



                                                                   [K2,V2] -> <A2,B2,{<C2,D2,C2',D2'>,...}>
    [K1,V1] -> <A1,B1,C1,D1>     [K1',V1'] -> <A1',B1',C1',D1'>



                                                                                   Reduce
                               Map



                                                                       <A3,B3> -> K3, <C3,D3> -> V3
               <A2,B2> -> K2, [n]<C2,D2> -> V2



                           [K2,V2]                                                 [K3,V3]




•   Maps must run for each input set in same Job (n, n+1, etc)
•   CoGrouping must happen against each common key
                                                                           Copyright Concurrent, Inc. 2011. All rights reserved.
Joining
       Reducer
                    [K2,{V2,V2,....}]
                                                          <A2,B2,{[n]<C2,D2>,[n+1]..}>



        [K2,V2] -> <A2,B2,{<C2,D2,C2',D2'>,...}>
                                                      <A2,B2,{<C2,D2>,...},{<C2',D2'>,...}>



                        Reduce
                                                   {<C2,D2>,...}     Join       {<C2',D2'>,...}



            <A3,B3> -> K3, <C3,D3> -> V3                       <C2,D2,C2',D2'>



                        [K3,V3]                          <A2,B2,{<C2,D2,C2',D2'>,...}>




•   The CoGroups must be joined

•   Finally the Reduce can be applied
                                                                    Copyright Concurrent, Inc. 2011. All rights reserved.
Optimizations

 • Patterns for reducing IO

• Identity Mapper
                          • Partial Aggregates
• Map Side Join
                          • Similarity Joins
• Combiners
                                 Copyright Concurrent, Inc. 2011. All rights reserved.
Identity Mapper                                                                                                                                                                                                  [head]




                                                                                                                                                                                Dfs['TextLine[['offset', 'line']->[ALL]]']['/logs/stumbles/short-stumbles-20090504.log']']

                                                                                                                                                                                                                       [{2}:'offset', 'line']
                                                                                                                                                                                                                       [{2}:'offset', 'line']

                                                                                                                                                                                             Each('import')[RegexParser[decl:'day', 'urlid', 'method'][args:1]]

                                                                                                                                                                                                                  [{3}:'day', 'urlid', 'method']
                                                                                                                                                                                                                  [{3}:'day', 'urlid', 'method']

                                                                                                                                                                                               Each('import')[ExpressionFilter[decl:'day', 'urlid', 'method']]

                                                                                                                                                                                                                  [{3}:'day', 'urlid', 'method']
                                                                                                                                                                                                                  [{3}:'day', 'urlid', 'method']

                                                                                                                                                                                            TempHfs['SequenceFile[['day', 'urlid', 'method']]'][import/71897/]

                                                                                                                                                                                             [{3}:'day', 'urlid', 'method']
                                                                                                                                                                                             [{3}:'day', 'urlid', 'method']
                                                                                                                                                                                                                       [{3}:'day', 'urlid', 'method']
                                                   identity                                                                                                                                                            [{3}:'day', 'urlid', 'method']
                                                                                                                                                                     Each('paidCount')[Not[decl:'day', 'urlid', 'method']]


                                                  function                                                                                                                       [{3}:'day', 'urlid', 'method']
                                                                                                                                                                                 [{3}:'day', 'urlid', 'method']
                                                                                                                                                                                                                  Each('organicCount')[OrganicFilter[decl:'day', 'urlid', 'method']]


                                                                                                                                                                                                                                         [{3}:'day', 'urlid', 'method']
                                                                                                                                                                        GroupBy('paidCount')[by:['day', 'urlid']]
                                                                                                                                                                                                                                         [{3}:'day', 'urlid', 'method']

                                                                                                                                                                              paidCount[{2}:'day', 'urlid']
                                                                                                                                                                                                                               GroupBy('organicCount')[by:['day', 'urlid']]
                                                                                                                                                                              [{3}:'day', 'urlid', 'method']

                                                                                                                                                                                                                                       organicCount[{2}:'day', 'urlid']
                                                                                                                                                                       Every('paidCount')[Count[decl:'count']]
                                                                                                                                                                                                                                       [{3}:'day', 'urlid', 'method']

                                                                                                                                                                               [{3}:'day', 'urlid', 'count']
                                                                                                                                                                                                                               Every('organicCount')[Count[decl:'count']]




•
                                                                                                                                                                               [{3}:'day', 'urlid', 'method']




    Move Map operations to the
                                                                                                                                                        Each('paidCount')[Identity[decl:'paid_day', 'paid_urlid', 'paid_count']]
                                                                                                                                                                                                                                                           [{3}:'day', 'urlid', 'count']
                                                                                                                                                                                                                                                           [{3}:'day', 'urlid', 'method']
                                                                                                                                                             [{3}:'paid_day', 'paid_urlid', 'paid_count']
                                                                                                                                                             [{3}:'paid_day', 'paid_urlid', 'paid_count']

                                                                                                                         TempHfs['SequenceFile[['paid_day', 'paid_urlid', 'paid_count']]'][paidCount/33072/]                                         TempHfs['SequenceFile[['day', 'urlid', 'count']]'][organicCount/97544/]

                                                                                                                    [{3}:'paid_day', 'paid_urlid', 'paid_count']                      [{3}:'paid_day', 'paid_urlid', 'paid_count']




    previous Reduce
                                                                                                                    [{3}:'paid_day', 'paid_urlid', 'paid_count']                                   [{3}:'day', 'urlid', 'count']
                                                                                                                                                                                      [{3}:'paid_day', 'paid_urlid', 'paid_count']                                           [{3}:'day', 'urlid', 'count']
                                                                                                                                                                                                   [{3}:'day', 'urlid', 'count']
                                                                                                                                                                                                                                                                             [{3}:'day', 'urlid', 'count']
                                                    CoGroup('organicCount*paidCount')[by:organicCount:['day', 'urlid']paidCount:['paid_day', 'paid_urlid']]                                             Each('paidDomainCount')[LookupDomainFunction[decl:'domainid'][args:1]]


                                                                 organicCount[{2}:'day', 'urlid'],paidCount[{2}:'paid_day', 'paid_urlid']                                                                            [{4}:'paid_day', 'paid_urlid', 'paid_count', 'domainid']
                                                                                                                                                                                                                                                                   Each('organicDomainCount')[LookupDomainFunction[decl:'domainid'][args:1]]
                                                                 [{6}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count']                                                                               [{4}:'paid_day', 'paid_urlid', 'paid_count', 'domainid']

                                                                                                                                                                                                                                                                                                [{4}:'day', 'urlid', 'count', 'domainid']
                                                        Each('organicCount*paidCount')[ExpressionFunction[decl:'urlid_day']]                                                                                      GroupBy('paidDomainCount')[by:['paid_day', 'domainid']]
                                                                                                                                                                                                                                                                                                [{4}:'day', 'urlid', 'count', 'domainid']




•
                                                     [{7}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count', 'urlid_day']                                                                              paidDomainCount[{2}:'paid_day', 'domainid']
                                                                                                                                                                                                                                                                                     GroupBy('organicDomainCount')[by:['day', 'domainid']]
                                                     [{7}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count', 'urlid_day']                                                                              [{4}:'paid_day', 'paid_urlid', 'paid_count', 'domainid']




    Replace with an Identity
                                                                                                                                                                                                                                                                                             organicDomainCount[{2}:'day', 'domainid']
                                                   Each('organicCount*paidCount')[ExpressionFunction[decl:'fixed_count']]                                                                                             Every('paidDomainCount')[Sum[decl:'sum'][args:1]]
                                                                                                                                                                                                                                                                                             [{4}:'day', 'urlid', 'count', 'domainid']

                                            [{8}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count', 'urlid_day', 'fixed_count']                                                                         [{3}:'paid_day', 'domainid', 'sum']
                                                                                                                                                                                                                                                                                      Every('organicDomainCount')[Sum[decl:'sum'][args:1]]
                                            [{8}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count', 'urlid_day', 'fixed_count']                                                                         [{4}:'paid_day', 'paid_urlid', 'paid_count', 'domainid']




    function
                                                                                                                                                                                                                                                                                                     [{3}:'day', 'domainid', 'sum']
                                               Each('organicCount*paidCount')[ExpressionFunction[decl:'fixed_paid_count']]                                                                            Each('paidDomainCount')[Identity[decl:'paid_day', 'paid_domainid', 'paid_sum']]
                                                                                                                                                                                                                                                                                                     [{4}:'day', 'urlid', 'count', 'domainid']

                                 [{9}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count', 'urlid_day', 'fixed_count', 'fixed_paid_count']                                                                     [{3}:'paid_day', 'paid_domainid', 'paid_sum']
                                                                                                                                                                                                                                                                       TempHfs['SequenceFile[['day', 'domainid', 'sum']]'][organicDomainCount/49784/]
                                 [{9}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count', 'urlid_day', 'fixed_count', 'fixed_paid_count']                                                                     [{3}:'paid_day', 'paid_domainid', 'paid_sum']


                               Each('organicCount*paidCount')[Identity[decl:'urlid_day', 'default:organic_stumbles', 'default:paid_stumbles']]                                              TempHfs['SequenceFile[['paid_day', 'paid_domainid', 'paid_sum']]'][paidDomainCount/54349/]

                                                                                                                                                                                                                         [{3}:'paid_day', 'paid_domainid', 'paid_sum']                      [{3}:'day', 'domainid', 'sum']




•
                                                                                                                                                                                                                         [{3}:'paid_day', 'paid_domainid', 'paid_sum']                      [{3}:'day', 'domainid', 'sum']
                                                                       [{3}:'urlid_day', 'default:organic_stumbles', 'default:paid_stumbles']
                                                                       [{3}:'urlid_day', 'default:organic_stumbles', 'default:paid_stumbles']
                                                                                                                                                                 CoGroup('organicDomainCount*paidDomainCount')[by:organicDomainCount:['day', 'domainid']paidDomainCount:['paid_day', 'paid_domainid']]




    Assumes Map operations                                                                              cascading.hbase.HBaseTap@d9a475a0
                                                                                                                                                                                         organicDomainCount[{2}:'day', 'domainid'],paidDomainCount[{2}:'paid_day', 'paid_domainid']
                                                                                                                                                                                         [{6}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum']


                                                                                                                                                                                      Each('organicDomainCount*paidDomainCount')[ExpressionFunction[decl:'domainid_day']]




    reduce the data
                                                                                                                                                                                         [{7}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum', 'domainid_day']
                                                                                                                                                                                         [{7}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum', 'domainid_day']

                                                                                             [{3}:'urlid_day', 'default:organic_stumbles', 'default:paid_stumbles']
                                                                                                                                                                                      Each('organicDomainCount*paidDomainCount')[ExpressionFunction[decl:'fixed_sum']]
                                                                                             [{3}:'urlid_day', 'default:organic_stumbles', 'default:paid_stumbles']

                                                                                                                                                                               [{8}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum', 'domainid_day', 'fixed_sum']
                                                                                                                                                                               [{8}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum', 'domainid_day', 'fixed_sum']

                                                                                                                                                                                  Each('organicDomainCount*paidDomainCount')[ExpressionFunction[decl:'fixed_paid_sum']]

                                                                                                                                                                     [{9}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum', 'domainid_day', 'fixed_sum', 'fixed_paid_sum']
                                                                                                                                                                     [{9}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum', 'domainid_day', 'fixed_sum', 'fixed_paid_sum']

                                                                                                                                                             Each('organicDomainCount*paidDomainCount')[Identity[decl:'domainid_day', 'default:organic_stumbles', 'default:paid_stumbles']]

                                                                                                                                                                     [{3}:'domainid_day', 'default:organic_stumbles', 'default:paid_stumbles']
                                                                                                                                                                     [{3}:'domainid_day', 'default:organic_stumbles', 'default:paid_stumbles']

                                                                                                                                                                cascading.hbase.HBaseTap@3d07f00

                                                                                                                        [{3}:'domainid_day', 'default:organic_stumbles', 'default:paid_stumbles']
                                                                                                                        [{3}:'domainid_day', 'default:organic_stumbles', 'default:paid_stumbles']

                                                                                                                                               [tail]




                                                                                                                                          Copyright Concurrent, Inc. 2011. All rights reserved.
Map Side Joins
• Bypasses the (immediate) need for a Reducer
• Symmetrical
   • Where LHS and RHS are of equivalent size
   • Requires data to be sorted on key
• Asymmetrical
 • One side is small enough to fit in memory
 • Typically a hashtable lookup
                               Copyright Concurrent, Inc. 2011. All rights reserved.
Combiners
        Mapper
         [0, "when in the course of
               human events"]           Map     ["when",1]    ["in",1]     ["the",1]    [...,1]

                Combiner
                         ["when",1]
                          ["when",1]   Group      ["when",{1,1}]



                   ["when",{1,1}]      Reduce   ["when",2]
                                                                         Same Implementation

                        ["when",1]
                         ["when",1]    Group     ["when",{2,1,2}]
                          ["when",2]
        Reducer

                   ["when",{2,1,2}]    Reduce   ["when",5]



•   Where Reduce runs Map side, and again Reduce side
•   Only works if Reduce is commutative and associative
•   Reduces bandwidth by trading CPU for IO
    • Serialization/deserialization during local sorting before combining
                                                                    Copyright Concurrent, Inc. 2011. All rights reserved.
Partial Aggregates
         Mapper
          [0, "when in the course of
                human events"]                   ["when",1]    ["in",1]    ["the",1]     [...,1]

                                         Map
                 Partial
                                                                             Provides an opportunity to
                         ["when",1]
                          ["when",1]             ["when",2]                  promote the functionality of
                                                                            the next Map to this Reduce

                         ["when",1]
                          ["when",1]    Group     ["when",{2,1,2}]
                           ["when",2]
         Reducer

                    ["when",{2,1,2}]    Reduce   ["when",5]




•   Supports any aggregate type, while being composable with other
    aggregates
•   Reduces bandwidth by trading Memory for IO
    • Very important for a CPU constrained cluster
    • Use a bounded LRU to keep constant memory (requires tuning)
                                                                      Copyright Concurrent, Inc. 2011. All rights reserved.
Partial Aggregates
                   [a,b,c,a,a,b]
                    [a,b,c,a,a,b]     partial unique
                                       partial unique     [a,b,c,a,b]
                                                           [a,b,c,a,b]
                     [a,b,c,a,a,b]
                      [a,b,c,a,a,b]     partial unique
                                         partial unique     [a,b,c,a,b]
                                                             [a,b,c,a,b]

                                          LRU*

                                          {_,_}
                                                            *cache size of 2
                                      a -> {a,_} -> _

                                      b -> {b,a} -> _
                         incoming                          discarded
                                      c -> {c,b} -> a
                           value                             value
                                      a -> {a,c} -> b

                                        a -> {a,c}

                                      b -> {b,a} -> c




•   OK that dupes emit from a Mapper and across
    Mappers (or prev Reducers!)
•   Final aggregation happens in Reducer
•   Larger the cache, fewer dupes     Copyright Concurrent, Inc. 2011. All rights reserved.
Tradeoffs


• CPU for IO == fault tolerance
• Memory for IO == performance


                              Copyright Concurrent, Inc. 2011. All rights reserved.
Similarity Join
• Compare all values LHS to values RHS to find
  duplicates (or similar values)
• Naive approaches
 • Cross Join (all data through one reducer)
 • In-common features (very common features will
    bottleneck)


                                   Copyright Concurrent, Inc. 2011. All rights reserved.
Set-Similarity Joining


• “Efficient Parallel Set-Similarity Joins Using
  MapReduce” - R Vernica, M Carey, C Li

• Only compare candidate pairs
• Candidates share uncommon features
                                 Copyright Concurrent, Inc. 2011. All rights reserved.
4             1
1


                                                    4             2
2


                                                    4             3
3


                                                    2             4
4

                                                                      3: order by least frequent
             1: records                             1                      discard common


                                                    1

                                       2: count tokens



         1                                                        1
                                        1       3

         3                          5: candidate pairs            3

    4: uncommon features                                                  6: final compare
          in common


               •      1 and 3 share uncommon features
               •      thus are candidates for a full comparison
                                                         Copyright Concurrent, Inc. 2011. All rights reserved.
Tokenize              Count Job
                      Map     Reduce         Map           Reduce




           File


                      File       File

                                        Join Tokens/Counts Job
            File                            Map         Reduce




                                 File

                                        Sort/Prefix Filter Job
                                            Map         Reduce




                                 File


Match two sets                          Self Join Job
                                            Map         Reduce



 using prefix                    File



   filtering                             Unique Pairs Job
                                            Map         Reduce



                                 File


                                        Join LHS Job
                                            Map         Reduce



                                 File



                                        Join RHS / Match Job
                                            Map         Reduce             File


                                                                Copyright Concurrent, Inc. 2011. All rights reserved.
Duality


• Note the use of the previous patterns to route
  data to implement a more efficient algorithm




                                 Copyright Concurrent, Inc. 2011. All rights reserved.
Use a Higher
               Abstraction
•   Command Line
    • Multitool - CLI for parallel sed, grep & joins

•   API
    • Cascading - Java Query API and Planner
    • Plume - “approximate clone of FlumeJava”

•   Interactive Shell
    •  Cascalog - Clojure+Cascading query language (API also)
    •  Pig - A text Syntax
    •  Hive - Syntax + Infrastructure - SQL “like”
                                             Copyright Concurrent, Inc. 2011. All rights reserved.
References

•   Set Similarity
    •  http://www.slideshare.net/ydn/4-similarity-joinshadoopsummit2010
    •  http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/

•   MapReduce Text Processing
    • http://www.umiacs.umd.edu/~jimmylin/book.html

•   Plume/FlumeJava
    •  http://portal.acm.org/citation.cfm?id=1806596.1806638
    •  http://github.com/tdunning/Plume/wiki



                                                Copyright Concurrent, Inc. 2011. All rights reserved.
I’m Hiring

• Enterprise Java server and web client
• Language design, compilers, and interpreters
• No Hadoop experience required
• More info
 • http://www.concurrentinc.com/careers/
                                  Copyright Concurrent, Inc. 2011. All rights reserved.
Resources
•   Chris K Wensel
    •chris@wensel.net
    •@cwensel

•   Cascading & Cascalog
    •http://cascading.org
    •@cascading

•   Concurrent, Inc.
    •http://concurrentinc.com
    •@concurrent
                                Copyright Concurrent, Inc. 2011. All rights reserved.
Appendix



       Copyright Concurrent, Inc. 2011. All rights reserved.
Simple Total Sorting

•   Where lines in a result file should be sorted


•   Must set number of reducers to 1
    •   Sorting in MR is local per Reduce, not global across
        Reducers



                                          Copyright Concurrent, Inc. 2011. All rights reserved.
Why Sorting Isn’t
                 “Total”
       [aaa,aab,aac]    Mapper

                                 aaa

                        Mapper   aac      Reducer    [aaa,zzx]

                                 aab

                        Mapper            Reducer    [aac,zzz]

                                 zzx

                        Mapper   zzz      Reducer    [aab,zzy]

                                 zzy

        [zzx,zzy,zzz]   Mapper




•   Keys emitted from Map are naturally sorted at a given Reducer
•   But are Partitioned to Reducers in a random way
•   Thus, only one Reducer can be used for a total sort
                                              Copyright Concurrent, Inc. 2011. All rights reserved.
Distributed Total Sort

• To work, the Shuffling phase must be modified
  with:
 • Custom Partitioner to partition on the
    distribution of ordered Keys
 • Custom Comparator for comparing Key types
  • Strings work by default
                                   Copyright Concurrent, Inc. 2011. All rights reserved.
Distributed Total Sort -
            Details
                                                                       a                        ...                     z



                                                          ar           ...         ax                       za          ...          zo




                                                    ara   ...   ari          axe   ...   axi          zag   ...   zap         zon    ...   zoo




                                                   aran         aria                     axis                                 zone




•   Sample all K2 values and build balanced distribution for num reducers

    •   Sample all input keys and divide into partitions

    •   Write out boundaries of partitions

•   Supply Partitioner that looks up partition for current K2 value

    •   Read boundaries into a Trie (pronounced ‘try’) data structure

•   Use appropriate Comparator for Key type
                                                                 Copyright Concurrent, Inc. 2011. All rights reserved.

More Related Content

What's hot

Epic_GDC2011_Samaritan
Epic_GDC2011_SamaritanEpic_GDC2011_Samaritan
Epic_GDC2011_SamaritanMinGeun Park
 
Paris Master Class 2011 - 07 Dynamic Global Illumination
Paris Master Class 2011 - 07 Dynamic Global IlluminationParis Master Class 2011 - 07 Dynamic Global Illumination
Paris Master Class 2011 - 07 Dynamic Global IlluminationWolfgang Engel
 
Photogrammetry and Star Wars Battlefront
Photogrammetry and Star Wars BattlefrontPhotogrammetry and Star Wars Battlefront
Photogrammetry and Star Wars BattlefrontElectronic Arts / DICE
 
GDC 2014 - Deformable Snow Rendering in Batman: Arkham Origins
GDC 2014 - Deformable Snow Rendering in Batman: Arkham OriginsGDC 2014 - Deformable Snow Rendering in Batman: Arkham Origins
GDC 2014 - Deformable Snow Rendering in Batman: Arkham OriginsColin Barré-Brisebois
 
The Rendering Pipeline - Challenges & Next Steps
The Rendering Pipeline - Challenges & Next StepsThe Rendering Pipeline - Challenges & Next Steps
The Rendering Pipeline - Challenges & Next StepsJohan Andersson
 
A new Post-Processing Pipeline
A new Post-Processing PipelineA new Post-Processing Pipeline
A new Post-Processing PipelineWolfgang Engel
 
Around the World in 80 Shaders
Around the World in 80 ShadersAround the World in 80 Shaders
Around the World in 80 Shadersstevemcauley
 
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)Johan Andersson
 
Calibrating Lighting and Materials in Far Cry 3
Calibrating Lighting and Materials in Far Cry 3Calibrating Lighting and Materials in Far Cry 3
Calibrating Lighting and Materials in Far Cry 3stevemcauley
 
Triangle Visibility buffer
Triangle Visibility bufferTriangle Visibility buffer
Triangle Visibility bufferWolfgang Engel
 
Paris Master Class 2011 - 04 Shadow Maps
Paris Master Class 2011 - 04 Shadow MapsParis Master Class 2011 - 04 Shadow Maps
Paris Master Class 2011 - 04 Shadow MapsWolfgang Engel
 
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3Electronic Arts / DICE
 
Crysis Next-Gen Effects (GDC 2008)
Crysis Next-Gen Effects (GDC 2008)Crysis Next-Gen Effects (GDC 2008)
Crysis Next-Gen Effects (GDC 2008)Tiago Sousa
 
A Bit More Deferred Cry Engine3
A Bit More Deferred   Cry Engine3A Bit More Deferred   Cry Engine3
A Bit More Deferred Cry Engine3guest11b095
 

What's hot (16)

Epic_GDC2011_Samaritan
Epic_GDC2011_SamaritanEpic_GDC2011_Samaritan
Epic_GDC2011_Samaritan
 
Paris Master Class 2011 - 07 Dynamic Global Illumination
Paris Master Class 2011 - 07 Dynamic Global IlluminationParis Master Class 2011 - 07 Dynamic Global Illumination
Paris Master Class 2011 - 07 Dynamic Global Illumination
 
Photogrammetry and Star Wars Battlefront
Photogrammetry and Star Wars BattlefrontPhotogrammetry and Star Wars Battlefront
Photogrammetry and Star Wars Battlefront
 
GDC 2014 - Deformable Snow Rendering in Batman: Arkham Origins
GDC 2014 - Deformable Snow Rendering in Batman: Arkham OriginsGDC 2014 - Deformable Snow Rendering in Batman: Arkham Origins
GDC 2014 - Deformable Snow Rendering in Batman: Arkham Origins
 
The Rendering Pipeline - Challenges & Next Steps
The Rendering Pipeline - Challenges & Next StepsThe Rendering Pipeline - Challenges & Next Steps
The Rendering Pipeline - Challenges & Next Steps
 
A new Post-Processing Pipeline
A new Post-Processing PipelineA new Post-Processing Pipeline
A new Post-Processing Pipeline
 
Lighting you up in Battlefield 3
Lighting you up in Battlefield 3Lighting you up in Battlefield 3
Lighting you up in Battlefield 3
 
Around the World in 80 Shaders
Around the World in 80 ShadersAround the World in 80 Shaders
Around the World in 80 Shaders
 
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
 
Calibrating Lighting and Materials in Far Cry 3
Calibrating Lighting and Materials in Far Cry 3Calibrating Lighting and Materials in Far Cry 3
Calibrating Lighting and Materials in Far Cry 3
 
Triangle Visibility buffer
Triangle Visibility bufferTriangle Visibility buffer
Triangle Visibility buffer
 
Paris Master Class 2011 - 04 Shadow Maps
Paris Master Class 2011 - 04 Shadow MapsParis Master Class 2011 - 04 Shadow Maps
Paris Master Class 2011 - 04 Shadow Maps
 
A Real-time Radiosity Architecture
A Real-time Radiosity ArchitectureA Real-time Radiosity Architecture
A Real-time Radiosity Architecture
 
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
 
Crysis Next-Gen Effects (GDC 2008)
Crysis Next-Gen Effects (GDC 2008)Crysis Next-Gen Effects (GDC 2008)
Crysis Next-Gen Effects (GDC 2008)
 
A Bit More Deferred Cry Engine3
A Bit More Deferred   Cry Engine3A Bit More Deferred   Cry Engine3
A Bit More Deferred Cry Engine3
 

Viewers also liked

An Integrated Marketing Plan
An Integrated Marketing PlanAn Integrated Marketing Plan
An Integrated Marketing PlanLinda Dacosta
 
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...cwensel
 
BigDataCamp 2011
BigDataCamp 2011BigDataCamp 2011
BigDataCamp 2011cwensel
 
Real Social Media Recruitment ROI
Real Social Media Recruitment ROIReal Social Media Recruitment ROI
Real Social Media Recruitment ROIMikeVangel
 
Hadoop Summit EU 2014
Hadoop Summit EU   2014Hadoop Summit EU   2014
Hadoop Summit EU 2014cwensel
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
Hadoop User Group EU 2014
Hadoop User Group EU 2014Hadoop User Group EU 2014
Hadoop User Group EU 2014cwensel
 
2015 Title Pckg_HEART YOUR LADY PARTS 5k
2015 Title Pckg_HEART YOUR LADY PARTS 5k2015 Title Pckg_HEART YOUR LADY PARTS 5k
2015 Title Pckg_HEART YOUR LADY PARTS 5klizloden
 
Making the Quantum Leap: UPS Social Media Recruitment ROI 2012
Making the Quantum Leap: UPS Social Media Recruitment ROI 2012Making the Quantum Leap: UPS Social Media Recruitment ROI 2012
Making the Quantum Leap: UPS Social Media Recruitment ROI 2012MikeVangel
 
Building Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and CascadingBuilding Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and Cascadingcwensel
 
Isra' wal mikraj
Isra' wal mikrajIsra' wal mikraj
Isra' wal mikrajZuraimi Ali
 

Viewers also liked (18)

An Integrated Marketing Plan
An Integrated Marketing PlanAn Integrated Marketing Plan
An Integrated Marketing Plan
 
Illinois Birds2
Illinois Birds2Illinois Birds2
Illinois Birds2
 
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
 
Digital Marketing Lecture 2015
Digital Marketing Lecture 2015Digital Marketing Lecture 2015
Digital Marketing Lecture 2015
 
BigDataCamp 2011
BigDataCamp 2011BigDataCamp 2011
BigDataCamp 2011
 
Social Media Lecture Summer 2011
Social Media Lecture Summer 2011Social Media Lecture Summer 2011
Social Media Lecture Summer 2011
 
Real Social Media Recruitment ROI
Real Social Media Recruitment ROIReal Social Media Recruitment ROI
Real Social Media Recruitment ROI
 
Hadoop Summit EU 2014
Hadoop Summit EU   2014Hadoop Summit EU   2014
Hadoop Summit EU 2014
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Illinois Birds
Illinois BirdsIllinois Birds
Illinois Birds
 
Hadoop User Group EU 2014
Hadoop User Group EU 2014Hadoop User Group EU 2014
Hadoop User Group EU 2014
 
2015 Title Pckg_HEART YOUR LADY PARTS 5k
2015 Title Pckg_HEART YOUR LADY PARTS 5k2015 Title Pckg_HEART YOUR LADY PARTS 5k
2015 Title Pckg_HEART YOUR LADY PARTS 5k
 
Engaging the customer
Engaging the customer Engaging the customer
Engaging the customer
 
Dialog Marketing with Digital Media
Dialog Marketing with Digital MediaDialog Marketing with Digital Media
Dialog Marketing with Digital Media
 
Making the Quantum Leap: UPS Social Media Recruitment ROI 2012
Making the Quantum Leap: UPS Social Media Recruitment ROI 2012Making the Quantum Leap: UPS Social Media Recruitment ROI 2012
Making the Quantum Leap: UPS Social Media Recruitment ROI 2012
 
Building Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and CascadingBuilding Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and Cascading
 
Digital Marketing Lecture 2016
Digital Marketing Lecture 2016Digital Marketing Lecture 2016
Digital Marketing Lecture 2016
 
Isra' wal mikraj
Isra' wal mikrajIsra' wal mikraj
Isra' wal mikraj
 

Similar to Buzz words

Geoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizationsSzehon Ho
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using DiscoJim Roepcke
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Loom & Functional Graphs in Clojure @ LambdaConf 2015
Loom & Functional Graphs in Clojure @ LambdaConf 2015Loom & Functional Graphs in Clojure @ LambdaConf 2015
Loom & Functional Graphs in Clojure @ LambdaConf 2015Aysylu Greenberg
 
Cloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoopCloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoopVeda Vyas
 
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...kcitp
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startupsbmlever
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
Data-Oriented Programming with Clojure and Jackdaw (Charles Reese, Funding Ci...
Data-Oriented Programming with Clojure and Jackdaw (Charles Reese, Funding Ci...Data-Oriented Programming with Clojure and Jackdaw (Charles Reese, Funding Ci...
Data-Oriented Programming with Clojure and Jackdaw (Charles Reese, Funding Ci...confluent
 
Adaptive MapReduce using Situation-Aware Mappers
Adaptive MapReduce using Situation-Aware MappersAdaptive MapReduce using Situation-Aware Mappers
Adaptive MapReduce using Situation-Aware Mappersrvernica
 
WOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph MiningWOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph Miningaravindan_raghu
 
FME's Role in a Map Revision Production Workflow and R&D Environment
FME's Role in a Map Revision Production Workflow and R&D EnvironmentFME's Role in a Map Revision Production Workflow and R&D Environment
FME's Role in a Map Revision Production Workflow and R&D EnvironmentSafe Software
 

Similar to Buzz words (20)

Geoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel Processing
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizations
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
02 Map Reduce
02 Map Reduce02 Map Reduce
02 Map Reduce
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using Disco
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Loom & Functional Graphs in Clojure @ LambdaConf 2015
Loom & Functional Graphs in Clojure @ LambdaConf 2015Loom & Functional Graphs in Clojure @ LambdaConf 2015
Loom & Functional Graphs in Clojure @ LambdaConf 2015
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Cloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoopCloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoop
 
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
 
Intro to Map Reduce
Intro to Map ReduceIntro to Map Reduce
Intro to Map Reduce
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Data-Oriented Programming with Clojure and Jackdaw (Charles Reese, Funding Ci...
Data-Oriented Programming with Clojure and Jackdaw (Charles Reese, Funding Ci...Data-Oriented Programming with Clojure and Jackdaw (Charles Reese, Funding Ci...
Data-Oriented Programming with Clojure and Jackdaw (Charles Reese, Funding Ci...
 
Adaptive MapReduce using Situation-Aware Mappers
Adaptive MapReduce using Situation-Aware MappersAdaptive MapReduce using Situation-Aware Mappers
Adaptive MapReduce using Situation-Aware Mappers
 
WOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph MiningWOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph Mining
 
FME's Role in a Map Revision Production Workflow and R&D Environment
FME's Role in a Map Revision Production Workflow and R&D EnvironmentFME's Role in a Map Revision Production Workflow and R&D Environment
FME's Role in a Map Revision Production Workflow and R&D Environment
 

Recently uploaded

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Recently uploaded (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Buzz words

  • 1. Common MapReduce Patterns Chris K Wensel BuzzWords 2011
  • 2. Engineer, Not Academic • Concurrent, Inc., Founder • Cascading support and tools • http://concurrentinc.com/ • Cascading, Lead Developer (started Sept 2007) • An alternative API to MapReduce • http://cascading.org/ • Formerly Hadoop mentoring and training • Sun - Apple - HP - LexisNexis - startups - etc • Formerly Systems Architect & Consultant • Thomson/Reuters - TeleAtlas - startups - etc Copyright Concurrent, Inc. 2011. All rights reserved.
  • 3. Overview • MapReduce • Heavy Lifting • Analytics • Optimizations Copyright Concurrent, Inc. 2011. All rights reserved.
  • 4. MapReduce • A “divide and conquer” strategy for parallelizing workloads against collections of data • Map & Reduce are two user defined functions chained via Key Value Pairs • It’s really Map->Group->Reduce where Group is built in Copyright Concurrent, Inc. 2011. All rights reserved.
  • 5. Keys and Values • Map translates input to keys and values to new keys and values [K1,V1] Map [K2,V2]* • System Groups each unique [K2,V2] Group [K2,{V2,V2,....}] key with all its values [K2,{V2,V2,....}] Reduce [K3,V3]* • Reduce translates the values of each unique key to new keys and values * = zero or more Copyright Concurrent, Inc. 2011. All rights reserved.
  • 6. Word Count Mapper [0, "when in the course of human events"] Map ["when",1] ["in",1] ["the",1] [...,1] ["when",1] ["when",1] ["when",1] ["when",1] Group ["when",{1,1,1,1,1}] ["when",1] Reducer ["when",{1,1,1,1,1}] Reduce ["when",5] Copyright Concurrent, Inc. 2011. All rights reserved.
  • 7. Divide and Conquer Parallelism • Since the ‘records’ entering the Map and ‘groups’ entering the Reduce are independent • That is, there is no expectation of order or requirement to share state between records/ groups • Arbitrary numbers of Map and Reduce function instances can be created against arbitrary portions of input data Copyright Concurrent, Inc. 2011. All rights reserved.
  • 8. Cluster Cluster Rack Rack Rack Node Node Node Node ... map map map map map reduce reduce reduce • Multiple instances of each Map and Reduce function are distributed throughout the cluster Copyright Concurrent, Inc. 2011. All rights reserved.
  • 9. Another View [K1,V1] Map [K2,V2] Combine Group [K2,{V2,...}] Reduce [K3,V3] Mapper Task same code Mapper Reducer Shuffle Task Task Mapper Reducer Shuffle Task Task Mapper Reducer Shuffle Task Task Mapper Task Mappers must complete before Reducers can begin split1 split2 split3 split4 ... part-00000 part-00001 part-000N file directory Copyright Concurrent, Inc. 2011. All rights reserved.
  • 10. Complex job assemblies • Real applications are many MapReduce jobs chained together • Linked by intermediate (usually temporary) files • Executed in order, by hand, from the ‘client’ application Count Job Sort Job [ k, [v] ] [ k, [v] ] Map Reduce Map Reduce [ k, v ] [ k, v ] [ k, v ] [ k, v ] File File File [ k, v ] = key and value pair [ k, [v] ] = key and associated values collection Copyright Concurrent, Inc. 2011. All rights reserved.
  • 11. Real World Apps [37/75] map+reduce [54/75] map+reduce [41/75] map+reduce [43/75] map+reduce [42/75] map+reduce [45/75] map+reduce [44/75] map+reduce [39/75] map+reduce [36/75] map+reduce [46/75] map+reduce [40/75] map+reduce [50/75] map+reduce [38/75] map+reduce [49/75] map+reduce [51/75] map+reduce [47/75] map+reduce [52/75] map+reduce [53/75] map+reduce [48/75] map+reduce [23/75] map+reduce [25/75] map+reduce [24/75] map+reduce [27/75] map+reduce [26/75] map+reduce [21/75] map+reduce [19/75] map+reduce [28/75] map+reduce [22/75] map+reduce [32/75] map+reduce [20/75] map+reduce [31/75] map+reduce [33/75] map+reduce [29/75] map+reduce [34/75] map+reduce [35/75] map+reduce [30/75] map+reduce [7/75] map+reduce [2/75] map+reduce [8/75] map+reduce [10/75] map+reduce [9/75] map+reduce [5/75] map+reduce [3/75] map+reduce [11/75] map+reduce [6/75] map+reduce [13/75] map+reduce [4/75] map+reduce [16/75] map+reduce [14/75] map+reduce [15/75] map+reduce [17/75] map+reduce [18/75] map+reduce [12/75] map+reduce [60/75] map [62/75] map [61/75] map [58/75] map [55/75] map [56/75] map+reduce [57/75] map [71/75] map [72/75] map [59/75] map [64/75] map+reduce [63/75] map+reduce [65/75] map+reduce [68/75] map+reduce [67/75] map+reduce [70/75] map+reduce [69/75] map+reduce [73/75] map+reduce [66/75] map+reduce [74/75] map+reduce [75/75] map+reduce [1/75] map+reduce 1 app, 75 jobs green = map + reduce purple = map blue = join/merge orange = map split Copyright Concurrent, Inc. 2011. All rights reserved.
  • 12. Heavy Lifting • Thing we must do because data can be heavy • These patterns are natural to MapReduce and easy to implement • But have some room for composition/aggregation within a Map/ Reduce (i.e., Filter + Binning) • (leading us to think of Hadoop as an ETL framework) • Record Filtering • Parsing, Conversion • Binning • Counting, Summing • Distributed Tasks • Unique Copyright Concurrent, Inc. 2011. All rights reserved.
  • 13. Record Filtering • Think unix ‘grep’ • Filtering is discarding unwanted values (or preserving wanted) • Only uses a Map function, no Reducer Copyright Concurrent, Inc. 2011. All rights reserved.
  • 14. Parsing, Conversion • Think unix ‘sed’ • A Map function that takes an input key and/or value and translates it into a new format • Examples: • raw logs to delimited text or archival efficient binary • entity extraction Copyright Concurrent, Inc. 2011. All rights reserved.
  • 15. Counting, Summing • The same as SQL aggregation functions • Simply applying some function to the values collection seen in Reduce • Other examples: • average, max, min, unique Copyright Concurrent, Inc. 2011. All rights reserved.
  • 16. Merging • Where many files of the same type are converted to one output path • Map side merges • One directory with as many part files as Mappers • Reduce side merges • Allows for removing duplicates or deleted items • One directory with as many part files as Reducers • Examples • Nutch • Normalizing log files (apache, log4j, etc) Copyright Concurrent, Inc. 2011. All rights reserved.
  • 17. Binning • Where the values associated w/ unique keys are persisted together • Typically a directory path based on key’s value • Must be conscious of total open files, remember no appends • Examples: • web log files by year/month/day • trade data by symbol Copyright Concurrent, Inc. 2011. All rights reserved.
  • 18. Distributed Tasks • Simply where a Map or Reduce function executes some ‘task’ based on the input key and value. • Examples: • web crawling, • load testing services, • rdbms/nosql updates, • file transfers (S3), • image to pdf (NYT on EC2) Copyright Concurrent, Inc. 2011. All rights reserved.
  • 19. Basic Analytic Patterns • Some of these patterns are unnatural to MapReduce • We think in terms of columns/fields, not key value pairs • (leading us to think of Hadoop as a RDBMS) • Group By • Secondary Unique • Unique • CoGrouping and Joining • Secondary Sort Copyright Concurrent, Inc. 2011. All rights reserved.
  • 20. Composite Keys/Values [K1,V1] <A1,B1,C1,...> • It is easier to think in columns/fields • e.g. “firstname” & “lastname”, not “line” • Whether a set of columns are Keys or Values is arbitrary • Keys become a means to piggyback the properties of MR and become an impl detail Copyright Concurrent, Inc. 2011. All rights reserved.
  • 21. Group By GroupBy 1001 Jim dept_id Mary name Susan 1002 Fred Wilma Ernie Barny • Group By is where Value fields are grouped by Grouping fields • Above, Map output key is “dept_id” and value is “name” Copyright Concurrent, Inc. 2011. All rights reserved.
  • 22. Group By Mapper Reducer Piggyback Code [K1,V1] [K2,{V2,V2,....}] [K1,V1] -> <A1,B1,C1,D1> [K2,V2] -> <A2,B2,{<C2,D2>,...}> User Code Map Reduce <A2,B2> -> K2, <C2,D2> -> V2 <A3,B3> -> K3, <C3,D3> -> V3 [K2,V2] [K3,V3] • So the K2 key becomes a composite Key of • key: [grouping], value: [values] Copyright Concurrent, Inc. 2011. All rights reserved.
  • 23. Unique Mapper [0, "when in the course of human events"] Map ["when",null] ["in",null] [...,null] ["when",1] ["when",1] ["when",1] ["when",1] Group ["when",{nulls}] ["when",null] Reducer ["when",{nulls}] Reduce ["when",null] • Or Distinct (as in SQL) • Globally finding all the unique values in a dataset • Usually finding unique values in a column • Often used to filter a second dataset using a join Copyright Concurrent, Inc. 2011. All rights reserved.
  • 24. Secondary Sort (group) (sorted value) (remaining value) Date Time Url 08/08/2008, 1:00:00, http://www.example.com/foo 08/08/2008, 1:01:00, http://www.example.com/bar 08/08/2008, 1:01:30, http://www.example.com/baz • Secondary Sorting is where • Some Fields are grouped on, and • Some of the remaining Fields are sorted within their grouping Copyright Concurrent, Inc. 2011. All rights reserved.
  • 25. Secondary Sort Mapper Reducer [K1,V1] [K2,{V2,V2,....}] [K1,V1] -> <A1,B1,C1,D1> [K2,V2] -> <A2,B2,{<C2,D2>,...}> Map Reduce <A2,B2><C2> -> K2, <D2> -> V2 <A3,B3> -> K3, <C3,D3> -> V3 [K2,V2] [K3,V3] • So the K2 key becomes a composite Key of • key: [grouping, secondary], value: [remaining values] • The trick is to piggyback the Reduce sort yet not be compared during the unique key comparison Copyright Concurrent, Inc. 2011. All rights reserved.
  • 26. Secondary Unique Mapper Assume Secondary Sorting magic happens here [0, "when in the course of human events"] Map [0, "when"] [0, "in"] [0,"the"] [0,...] ["when",1] ["when",1] ["when",1] ["when",1] Group [0,{"in","in","the","when","when",...}] [0,"when"] Reducer [0,{"in","in","the","when","when",...}] Reduce ["in",null] ["the",null] ["when",null] • Secondary Unique is where the grouping values are uniqued • .... in a “scale free” way • Perform a Secondary Sort... • Reducer removes duplicates by discarding every value that matches the previous value • since values are now ordered, no need to maintain a Set of values Copyright Concurrent, Inc. 2011. All rights reserved.
  • 27. Joining lhs data rhs data 1001 dept_id Jim Accounting dept_name Mary Accounting name Susan Accounting 1002 Fred Shipping Wilma Shipping Ernie Shipping Barny Shipping • Where two or more input data sets are ‘joined’ by a common key • Like a SQL join Copyright Concurrent, Inc. 2011. All rights reserved.
  • 28. Join Definitions • Consider the input data [key, value]: • LHS = [0,a] [1,b] [2,c] • RHS = [0,A] [2,C] [3,D] • Joins on the key: • Inner • [0,a,A] [2,c,C] • Outer (Left Outer, Right Outer) • [0,a,A] [1,b,null] [2,c,C] [3,null,D] • Left (Left Inner, Right Outer) • [0,a,A] [1,b,null] [2,c,C] • Right (Left Outer, Right Inner) • [0,a,A] [2,c,C] [3,null,D] Copyright Concurrent, Inc. 2011. All rights reserved.
  • 29. CoGrouping • Before Joining, CoGrouping must happen • Simply concurrent GroupBy operations on each input data set Copyright Concurrent, Inc. 2011. All rights reserved.
  • 30. GroupBy vs CoGroup lhs data rhs data GroupBy CoGroup 1001 1001 Jim Jim Accounting dept_id Mary Mary name Susan Susan dept_name 1002 1002 Fred Fred Shipping Wilma Wilma Ernie Ernie Barny Barny Independent collections of unordered values Copyright Concurrent, Inc. 2011. All rights reserved.
  • 31. CoGroup Joined lhs data rhs data 1001 dept_id Jim Accounting dept_name Mary Accounting name Susan Accounting 1002 Fred Shipping Wilma Shipping Ernie Shipping Barny Shipping • Considering the previous data, a typical Inner Join Copyright Concurrent, Inc. 2011. All rights reserved.
  • 32. CoGrouping Mapper [n] [n+1] Reducer [K1,V1] [K1',V1'] [K2,{V2,V2,....}] [K2,V2] -> <A2,B2,{<C2,D2,C2',D2'>,...}> [K1,V1] -> <A1,B1,C1,D1> [K1',V1'] -> <A1',B1',C1',D1'> Reduce Map <A3,B3> -> K3, <C3,D3> -> V3 <A2,B2> -> K2, [n]<C2,D2> -> V2 [K2,V2] [K3,V3] • Maps must run for each input set in same Job (n, n+1, etc) • CoGrouping must happen against each common key Copyright Concurrent, Inc. 2011. All rights reserved.
  • 33. Joining Reducer [K2,{V2,V2,....}] <A2,B2,{[n]<C2,D2>,[n+1]..}> [K2,V2] -> <A2,B2,{<C2,D2,C2',D2'>,...}> <A2,B2,{<C2,D2>,...},{<C2',D2'>,...}> Reduce {<C2,D2>,...} Join {<C2',D2'>,...} <A3,B3> -> K3, <C3,D3> -> V3 <C2,D2,C2',D2'> [K3,V3] <A2,B2,{<C2,D2,C2',D2'>,...}> • The CoGroups must be joined • Finally the Reduce can be applied Copyright Concurrent, Inc. 2011. All rights reserved.
  • 34. Optimizations • Patterns for reducing IO • Identity Mapper • Partial Aggregates • Map Side Join • Similarity Joins • Combiners Copyright Concurrent, Inc. 2011. All rights reserved.
  • 35. Identity Mapper [head] Dfs['TextLine[['offset', 'line']->[ALL]]']['/logs/stumbles/short-stumbles-20090504.log']'] [{2}:'offset', 'line'] [{2}:'offset', 'line'] Each('import')[RegexParser[decl:'day', 'urlid', 'method'][args:1]] [{3}:'day', 'urlid', 'method'] [{3}:'day', 'urlid', 'method'] Each('import')[ExpressionFilter[decl:'day', 'urlid', 'method']] [{3}:'day', 'urlid', 'method'] [{3}:'day', 'urlid', 'method'] TempHfs['SequenceFile[['day', 'urlid', 'method']]'][import/71897/] [{3}:'day', 'urlid', 'method'] [{3}:'day', 'urlid', 'method'] [{3}:'day', 'urlid', 'method'] identity [{3}:'day', 'urlid', 'method'] Each('paidCount')[Not[decl:'day', 'urlid', 'method']] function [{3}:'day', 'urlid', 'method'] [{3}:'day', 'urlid', 'method'] Each('organicCount')[OrganicFilter[decl:'day', 'urlid', 'method']] [{3}:'day', 'urlid', 'method'] GroupBy('paidCount')[by:['day', 'urlid']] [{3}:'day', 'urlid', 'method'] paidCount[{2}:'day', 'urlid'] GroupBy('organicCount')[by:['day', 'urlid']] [{3}:'day', 'urlid', 'method'] organicCount[{2}:'day', 'urlid'] Every('paidCount')[Count[decl:'count']] [{3}:'day', 'urlid', 'method'] [{3}:'day', 'urlid', 'count'] Every('organicCount')[Count[decl:'count']] • [{3}:'day', 'urlid', 'method'] Move Map operations to the Each('paidCount')[Identity[decl:'paid_day', 'paid_urlid', 'paid_count']] [{3}:'day', 'urlid', 'count'] [{3}:'day', 'urlid', 'method'] [{3}:'paid_day', 'paid_urlid', 'paid_count'] [{3}:'paid_day', 'paid_urlid', 'paid_count'] TempHfs['SequenceFile[['paid_day', 'paid_urlid', 'paid_count']]'][paidCount/33072/] TempHfs['SequenceFile[['day', 'urlid', 'count']]'][organicCount/97544/] [{3}:'paid_day', 'paid_urlid', 'paid_count'] [{3}:'paid_day', 'paid_urlid', 'paid_count'] previous Reduce [{3}:'paid_day', 'paid_urlid', 'paid_count'] [{3}:'day', 'urlid', 'count'] [{3}:'paid_day', 'paid_urlid', 'paid_count'] [{3}:'day', 'urlid', 'count'] [{3}:'day', 'urlid', 'count'] [{3}:'day', 'urlid', 'count'] CoGroup('organicCount*paidCount')[by:organicCount:['day', 'urlid']paidCount:['paid_day', 'paid_urlid']] Each('paidDomainCount')[LookupDomainFunction[decl:'domainid'][args:1]] organicCount[{2}:'day', 'urlid'],paidCount[{2}:'paid_day', 'paid_urlid'] [{4}:'paid_day', 'paid_urlid', 'paid_count', 'domainid'] Each('organicDomainCount')[LookupDomainFunction[decl:'domainid'][args:1]] [{6}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count'] [{4}:'paid_day', 'paid_urlid', 'paid_count', 'domainid'] [{4}:'day', 'urlid', 'count', 'domainid'] Each('organicCount*paidCount')[ExpressionFunction[decl:'urlid_day']] GroupBy('paidDomainCount')[by:['paid_day', 'domainid']] [{4}:'day', 'urlid', 'count', 'domainid'] • [{7}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count', 'urlid_day'] paidDomainCount[{2}:'paid_day', 'domainid'] GroupBy('organicDomainCount')[by:['day', 'domainid']] [{7}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count', 'urlid_day'] [{4}:'paid_day', 'paid_urlid', 'paid_count', 'domainid'] Replace with an Identity organicDomainCount[{2}:'day', 'domainid'] Each('organicCount*paidCount')[ExpressionFunction[decl:'fixed_count']] Every('paidDomainCount')[Sum[decl:'sum'][args:1]] [{4}:'day', 'urlid', 'count', 'domainid'] [{8}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count', 'urlid_day', 'fixed_count'] [{3}:'paid_day', 'domainid', 'sum'] Every('organicDomainCount')[Sum[decl:'sum'][args:1]] [{8}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count', 'urlid_day', 'fixed_count'] [{4}:'paid_day', 'paid_urlid', 'paid_count', 'domainid'] function [{3}:'day', 'domainid', 'sum'] Each('organicCount*paidCount')[ExpressionFunction[decl:'fixed_paid_count']] Each('paidDomainCount')[Identity[decl:'paid_day', 'paid_domainid', 'paid_sum']] [{4}:'day', 'urlid', 'count', 'domainid'] [{9}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count', 'urlid_day', 'fixed_count', 'fixed_paid_count'] [{3}:'paid_day', 'paid_domainid', 'paid_sum'] TempHfs['SequenceFile[['day', 'domainid', 'sum']]'][organicDomainCount/49784/] [{9}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count', 'urlid_day', 'fixed_count', 'fixed_paid_count'] [{3}:'paid_day', 'paid_domainid', 'paid_sum'] Each('organicCount*paidCount')[Identity[decl:'urlid_day', 'default:organic_stumbles', 'default:paid_stumbles']] TempHfs['SequenceFile[['paid_day', 'paid_domainid', 'paid_sum']]'][paidDomainCount/54349/] [{3}:'paid_day', 'paid_domainid', 'paid_sum'] [{3}:'day', 'domainid', 'sum'] • [{3}:'paid_day', 'paid_domainid', 'paid_sum'] [{3}:'day', 'domainid', 'sum'] [{3}:'urlid_day', 'default:organic_stumbles', 'default:paid_stumbles'] [{3}:'urlid_day', 'default:organic_stumbles', 'default:paid_stumbles'] CoGroup('organicDomainCount*paidDomainCount')[by:organicDomainCount:['day', 'domainid']paidDomainCount:['paid_day', 'paid_domainid']] Assumes Map operations cascading.hbase.HBaseTap@d9a475a0 organicDomainCount[{2}:'day', 'domainid'],paidDomainCount[{2}:'paid_day', 'paid_domainid'] [{6}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum'] Each('organicDomainCount*paidDomainCount')[ExpressionFunction[decl:'domainid_day']] reduce the data [{7}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum', 'domainid_day'] [{7}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum', 'domainid_day'] [{3}:'urlid_day', 'default:organic_stumbles', 'default:paid_stumbles'] Each('organicDomainCount*paidDomainCount')[ExpressionFunction[decl:'fixed_sum']] [{3}:'urlid_day', 'default:organic_stumbles', 'default:paid_stumbles'] [{8}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum', 'domainid_day', 'fixed_sum'] [{8}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum', 'domainid_day', 'fixed_sum'] Each('organicDomainCount*paidDomainCount')[ExpressionFunction[decl:'fixed_paid_sum']] [{9}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum', 'domainid_day', 'fixed_sum', 'fixed_paid_sum'] [{9}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum', 'domainid_day', 'fixed_sum', 'fixed_paid_sum'] Each('organicDomainCount*paidDomainCount')[Identity[decl:'domainid_day', 'default:organic_stumbles', 'default:paid_stumbles']] [{3}:'domainid_day', 'default:organic_stumbles', 'default:paid_stumbles'] [{3}:'domainid_day', 'default:organic_stumbles', 'default:paid_stumbles'] cascading.hbase.HBaseTap@3d07f00 [{3}:'domainid_day', 'default:organic_stumbles', 'default:paid_stumbles'] [{3}:'domainid_day', 'default:organic_stumbles', 'default:paid_stumbles'] [tail] Copyright Concurrent, Inc. 2011. All rights reserved.
  • 36. Map Side Joins • Bypasses the (immediate) need for a Reducer • Symmetrical • Where LHS and RHS are of equivalent size • Requires data to be sorted on key • Asymmetrical • One side is small enough to fit in memory • Typically a hashtable lookup Copyright Concurrent, Inc. 2011. All rights reserved.
  • 37. Combiners Mapper [0, "when in the course of human events"] Map ["when",1] ["in",1] ["the",1] [...,1] Combiner ["when",1] ["when",1] Group ["when",{1,1}] ["when",{1,1}] Reduce ["when",2] Same Implementation ["when",1] ["when",1] Group ["when",{2,1,2}] ["when",2] Reducer ["when",{2,1,2}] Reduce ["when",5] • Where Reduce runs Map side, and again Reduce side • Only works if Reduce is commutative and associative • Reduces bandwidth by trading CPU for IO • Serialization/deserialization during local sorting before combining Copyright Concurrent, Inc. 2011. All rights reserved.
  • 38. Partial Aggregates Mapper [0, "when in the course of human events"] ["when",1] ["in",1] ["the",1] [...,1] Map Partial Provides an opportunity to ["when",1] ["when",1] ["when",2] promote the functionality of the next Map to this Reduce ["when",1] ["when",1] Group ["when",{2,1,2}] ["when",2] Reducer ["when",{2,1,2}] Reduce ["when",5] • Supports any aggregate type, while being composable with other aggregates • Reduces bandwidth by trading Memory for IO • Very important for a CPU constrained cluster • Use a bounded LRU to keep constant memory (requires tuning) Copyright Concurrent, Inc. 2011. All rights reserved.
  • 39. Partial Aggregates [a,b,c,a,a,b] [a,b,c,a,a,b] partial unique partial unique [a,b,c,a,b] [a,b,c,a,b] [a,b,c,a,a,b] [a,b,c,a,a,b] partial unique partial unique [a,b,c,a,b] [a,b,c,a,b] LRU* {_,_} *cache size of 2 a -> {a,_} -> _ b -> {b,a} -> _ incoming discarded c -> {c,b} -> a value value a -> {a,c} -> b a -> {a,c} b -> {b,a} -> c • OK that dupes emit from a Mapper and across Mappers (or prev Reducers!) • Final aggregation happens in Reducer • Larger the cache, fewer dupes Copyright Concurrent, Inc. 2011. All rights reserved.
  • 40. Tradeoffs • CPU for IO == fault tolerance • Memory for IO == performance Copyright Concurrent, Inc. 2011. All rights reserved.
  • 41. Similarity Join • Compare all values LHS to values RHS to find duplicates (or similar values) • Naive approaches • Cross Join (all data through one reducer) • In-common features (very common features will bottleneck) Copyright Concurrent, Inc. 2011. All rights reserved.
  • 42. Set-Similarity Joining • “Efficient Parallel Set-Similarity Joins Using MapReduce” - R Vernica, M Carey, C Li • Only compare candidate pairs • Candidates share uncommon features Copyright Concurrent, Inc. 2011. All rights reserved.
  • 43. 4 1 1 4 2 2 4 3 3 2 4 4 3: order by least frequent 1: records 1 discard common 1 2: count tokens 1 1 1 3 3 5: candidate pairs 3 4: uncommon features 6: final compare in common • 1 and 3 share uncommon features • thus are candidates for a full comparison Copyright Concurrent, Inc. 2011. All rights reserved.
  • 44. Tokenize Count Job Map Reduce Map Reduce File File File Join Tokens/Counts Job File Map Reduce File Sort/Prefix Filter Job Map Reduce File Match two sets Self Join Job Map Reduce using prefix File filtering Unique Pairs Job Map Reduce File Join LHS Job Map Reduce File Join RHS / Match Job Map Reduce File Copyright Concurrent, Inc. 2011. All rights reserved.
  • 45. Duality • Note the use of the previous patterns to route data to implement a more efficient algorithm Copyright Concurrent, Inc. 2011. All rights reserved.
  • 46. Use a Higher Abstraction • Command Line • Multitool - CLI for parallel sed, grep & joins • API • Cascading - Java Query API and Planner • Plume - “approximate clone of FlumeJava” • Interactive Shell • Cascalog - Clojure+Cascading query language (API also) • Pig - A text Syntax • Hive - Syntax + Infrastructure - SQL “like” Copyright Concurrent, Inc. 2011. All rights reserved.
  • 47. References • Set Similarity • http://www.slideshare.net/ydn/4-similarity-joinshadoopsummit2010 • http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/ • MapReduce Text Processing • http://www.umiacs.umd.edu/~jimmylin/book.html • Plume/FlumeJava • http://portal.acm.org/citation.cfm?id=1806596.1806638 • http://github.com/tdunning/Plume/wiki Copyright Concurrent, Inc. 2011. All rights reserved.
  • 48. I’m Hiring • Enterprise Java server and web client • Language design, compilers, and interpreters • No Hadoop experience required • More info • http://www.concurrentinc.com/careers/ Copyright Concurrent, Inc. 2011. All rights reserved.
  • 49. Resources • Chris K Wensel •chris@wensel.net •@cwensel • Cascading & Cascalog •http://cascading.org •@cascading • Concurrent, Inc. •http://concurrentinc.com •@concurrent Copyright Concurrent, Inc. 2011. All rights reserved.
  • 50. Appendix Copyright Concurrent, Inc. 2011. All rights reserved.
  • 51. Simple Total Sorting • Where lines in a result file should be sorted • Must set number of reducers to 1 • Sorting in MR is local per Reduce, not global across Reducers Copyright Concurrent, Inc. 2011. All rights reserved.
  • 52. Why Sorting Isn’t “Total” [aaa,aab,aac] Mapper aaa Mapper aac Reducer [aaa,zzx] aab Mapper Reducer [aac,zzz] zzx Mapper zzz Reducer [aab,zzy] zzy [zzx,zzy,zzz] Mapper • Keys emitted from Map are naturally sorted at a given Reducer • But are Partitioned to Reducers in a random way • Thus, only one Reducer can be used for a total sort Copyright Concurrent, Inc. 2011. All rights reserved.
  • 53. Distributed Total Sort • To work, the Shuffling phase must be modified with: • Custom Partitioner to partition on the distribution of ordered Keys • Custom Comparator for comparing Key types • Strings work by default Copyright Concurrent, Inc. 2011. All rights reserved.
  • 54. Distributed Total Sort - Details a ... z ar ... ax za ... zo ara ... ari axe ... axi zag ... zap zon ... zoo aran aria axis zone • Sample all K2 values and build balanced distribution for num reducers • Sample all input keys and divide into partitions • Write out boundaries of partitions • Supply Partitioner that looks up partition for current K2 value • Read boundaries into a Trie (pronounced ‘try’) data structure • Use appropriate Comparator for Key type Copyright Concurrent, Inc. 2011. All rights reserved.

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. - commutativity is the ability to change the order of something without changing the end result.\n- associativity is a property that a binary operation can have. It means that, within an expression containing two or more of the same associative operators in a row, the order of operations does not matter as long as the sequence of the operands is not changed. That is, rearranging the parentheses in such an expression will not change its value.\n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n