SlideShare a Scribd company logo
1 of 17
Download to read offline
Implementing Batcher’s Bitonic Sort in C++
An Investigation into using Library-Based, Data-Parallelism
                     Support in C++.
                       October 2011


      Jason Mc Guiness, Colin Egan & Russel Winder

               ppd-bcs-dsg-11@hussar.demon.co.uk
                       c.egan@herts.ac.uk
                       russel@russel.org.uk
                   http://libjmmcg.sf.net/


                      October 5, 2011
Sequence of Presentation.



       An introduction: why I am here.
       How do we manage all these threads:
            libraries, languages and compilers.
       Very brief introduction of parallel sorting algorithms.
       An outline of Parallel Pixie Dust (PPD).
       Outline of implementation using PPD.
       Conclusion.
       Future directions.
Introduction.

   Why yet another thread-library presentation?
       Because we still find it hard to write multi-threaded programs
       correctly.
            According to programming folklore.
       We haven’t successfully replaced the von Neumann
       architecture:
            Stored program architectures are still prevalent.
       The memory wall still affects us:
            The CPU-instruction retirement rate, i.e. rate at which
            programs require and generate data, exceeds the the memory
            bandwidth - a by product of Moore’s Law.
                 Modern architectures add extra cores to CPUs, in this
                 instance, extra memory buses which feed into those cores.
A Quick Review of Some Threading Models:

      Restrict ourselves to library-based techniques.
      Raw library calls.
      The “thread as an active class”.
      Co-routines.
      The “thread pool” containing those objects.
           Producer-consumer models are a sub-set of this model.
      The issue: classes used to implement business logic also
      implement the operations to be run on the threads. This often
      means that the locking is intimately entangled with those
      objects, and possibly even the threading logic.
           This makes it much harder to debug these applications. The
           question of how to implement multi-threaded debuggers
           correctly is still an open question.
Parallel Sorting...
   Parallel sorting has a history:
        The three Hungarians (Ajtai, Komlos & Szemeredi, AKS)
        presented an algorithm that takes O (log (n)) time with n
        processors in 1983, but has impractically large complexity
        constants.
        Cole’s parallel merge sort was presented in 1986 and takes
        O (log (n)) time with n processors, but has much smaller
        complexity constants.
        Batcher’s bitonic sort was presented in 1968 and takes
        O log2 (n) time, but according to Knuth when comparing it
        to AKS: "Batcher’s method is much better, unless n exceeds
        the total memory capacity of all computers on earth!”
        Research has indicated that Batcher’s algorithm has better
        complexity constants than Cole’s, so we shall focus on
        Batcher’s algorithm.
An Examination of SPEC2006 for STL Usage.

   SPEC2006 was examined for usage of STL algorithms used within
   it, sorting is important to one of the benchmarks, so a potential
   target for parallelism.
                                                          Distribution of algorithms in 483.xalancbmk.                                                                                                                                                        Distribution of algorithms in 477.dealII.
                           100                                                                                                                                                     100

                                                                                                                                set::count                                                                                                                                                                                                                                                                    set::count
                                                                                                                                map::count                                                                                                                                                                                                                                                                    map::count
                                                                                                                                string::find                                                                                                                                                                                                                                                                  string::find
                                                                                                                                set::find                                                                                                                                                                                                                                                                     set::find
                           80                                                                                                                                                       80
                                                                                                                                map::find                                                                                                                                                                                                                                                                     map::find
                                                                                                                                Global algorithm                                                                                                                                                                                                                                                              Global algorithm



                           60                                                                                                                                                       60
   Number of occurences.




                                                                                                                                                           Number of occurences.
                           40                                                                                                                                                       40




                           20                                                                                                                                                       20




                             0                                                                                                                                                       0
                                                                                                                                                                                         copy


                                                                                                                                                                                                         fill


                                                                                                                                                                                                                       sort


                                                                                                                                                                                                                                      find


                                                                                                                                                                                                                                                           count_if


                                                                                                                                                                                                                                                                                    accumulate


                                                                                                                                                                                                                                                                                                               min_element


                                                                                                                                                                                                                                                                                                                                      inner_product


                                                                                                                                                                                                                                                                                                                                                                  transform


                                                                                                                                                                                                                                                                                                                                                                                            inplace_merge


                                                                                                                                                                                                                                                                                                                                                                                                                       find_if


                                                                                                                                                                                                                                                                                                                                                                                                                                         binary_search


                                                                                                                                                                                                                                                                                                                                                                                                                                                                         find
                                 for_each


                                            swap


                                                   find


                                                           find_if


                                                                     copy


                                                                            reverse


                                                                                      remove


                                                                                               transform


                                                                                                           stable_sort


                                                                                                                         sort


                                                                                                                                  replace


                                                                                                                                            fill


                                                                                                                                                   count




                                                                                                                                                                                                fill_n


                                                                                                                                                                                                                swap


                                                                                                                                                                                                                              count


                                                                                                                                                                                                                                             lower_bound


                                                                                                                                                                                                                                                                      max_element


                                                                                                                                                                                                                                                                                                 upper_bound


                                                                                                                                                                                                                                                                                                                             unique


                                                                                                                                                                                                                                                                                                                                                      remove_if


                                                                                                                                                                                                                                                                                                                                                                              nth_element


                                                                                                                                                                                                                                                                                                                                                                                                            for_each


                                                                                                                                                                                                                                                                                                                                                                                                                                 equal


                                                                                                                                                                                                                                                                                                                                                                                                                                                         adjacent_find
                                                                            Algorithm.
                                                                                                                                                                                                Algorithm.
Part 1: Design Features of PPD.

   The main design features of PPD are:
       It targets general purpose threading using a data-flow model of
       parallelism:
            This type of scheduling may be viewed as dynamic scheduling
            (run-time) as opposed to static scheduling (potentially
            compile-time), where the operations are statically assigned to
            execution pipelines, with relatively fixed timing constraints.
       DSEL implemented as futures and thread pools (of many
       different types using traits).
            Can be used to implement a tree-like thread schedule.
            “thread-as-an-active-class” exists.
            Gives rise to important properties: efficient, dead-lock &
            race-condition free schedules.
Part 2: Design Features of PPD.


   Extensions to the basic DSEL have been added:
       Certain binary functors: each operand is executed in parallel.
       Adaptors for the STL collections to assist with thread-safety.
            Combined with thread pools allows replacement of the STL
            algorithms.
       Implemented GSS(k), or baker’s scheduling:
            May reduce synchronisation costs on thread-limited systems or
            pools in which the synchronisation costs are high.
       Amongst other influences, PPD was born out of discussions
       with Richard Harris and motivated by Kevlin Henney’s
       presentation to ACCU’04 regarding threads.
The STL-style Parallel Algorithms.

   That analysis, amongst other reasons lead me to implement in
   PPD:
    1. for_each()
    2. find_if() & find()
    3. count_if() & count()
    4. transform() (both overloads)
    5. copy()
    6. accumulate() (both overloads)
    7. fill_n() & fill()
    8. reverse()
    9. max_element() & min_element() (both overloads)
   10. merge() & sort() (both overloads)
Basic Example of sort().
              Listing 1: Parallel sort() with a Thread Pool and Future.
   t y p e d e f ppd : : s a f e _ c o l l n <
                 v e c t o r <l o n g >, l o c k : : r e c u r s i v e _ c r i t i c a l _ s e c t i o n _ l o c k _ t y p e
   > vtr_colln_t ;

   vtr_colln_t v ;
   v . push_back ( 2 ) ; v . push_back ( 1 ) ;
   execution_context context (
             p o o l < j o i n a b l e ()<< p o o l . s o r t <a s c e n d i n g >(v )
                      <
   );
   ∗context ;


             To assist in making the code appear to be readable, a number
             of typedefs have been omitted.
             The safe_colln adaptor provides the lock for the collection,
                      in this case a simple EREW lock,
                      only locks the collection, not the elements in it.
             The sort() returns an execution_context:
                      The input collection has a read-lock taken on it.
                      Released when the asynchronous sort has completed, when the
                      read-lock has been dropped.
                      It makes no sense to return a random iterator.
The typedefs used.


                        Listing 2: typedefs in the sort() example.
  t y p e d e f ppd : : t h r e a d _ p o o l <
                p o o l _ t r a i t s : : worker_threads_get_work , p o o l _ t r a i t s : : f i x e d _ s i z e ,
                pool_adaptor<
                               g e n e r i c _ t r a i t s : : j o i n a b l e , platform_api , heavyweight_threading ,
                               pool_traits : : normal_fifo , std : : less ,1
                >
  > pool_type ;
  t y p e d e f pool_type : : void_exec_cxt execution_context ;
  typedef execution_context : : j o i n a b l e j o i n a b l e ;



          Expresses the flexibility & optimizations available:
                  pool_traits::prioritised_queue: specifies work should
                  be ordered by the trait std::less.
                          Could be used to order work by an estimate of time taken to
                          complete - very interesting, but not implemented.
                  GSS(k) scheduling implemented: number specifies batch size.
Implementing Batcher’s Bitonic Sort with PPD.

                        Listing 3: Schematic of the sort() method.
   const i n _ i t e r a t o r middle ( boost : : next ( begin , half_size_of_portion ) ) ;
   execution_context sort_lhs_ascending (
             p o o l < j o i n a b l e ()<< s o r t <a s c e n d i n g >( b e g i n , m i d d l e , compare )
                      <
   );
   execution_context sort_rhs_descending (
             p o o l < j o i n a b l e ()<< s o r t <d e s c e n d i n g >( m i d d l e , end , compare )
                      <
   );
   ∗sort_lhs_ascending ;
   ∗sort_rhs_descending ;
   execution_context bitonic_merge_all (
             p o o l < j o i n a b l e ()<<merge<d i r e c t i o n >( b e g i n , end , compare )
                      <
   );
   ∗bitonic_merge_all ;



           As the target processor may be more sophisticated than a
           comparator, we could use a serial sort for a minimal sub-range,
           based upon threading costs of the target architecture,
           terminating the recursion early.
                   We’d avoid using quicksort, as it is O n2 for nearly-sorted
                   inputs, which a bitonic sequence is.
Implementing Batcher’s Bitonic Merge with PPD.

                      Listing 4: Schematic of the merge() method.
   const o u t _ i t e r a t o r middle ( boost : : next ( begin , half_size_of_portion ) ) ;
   const out_iterator output ( middle ) ;
   execution_context s h u f f l e (
                    p o o l < j o i n a b l e ()<< p o o l . s w a p _ r a n g e s (
                             <
                                   b e g i n , m i d d l e , o u t p u t , swap_pred<d i r e c i o n >( compare )
                    )
   );
   ∗shuffle ;
   e x e c u t i o n _ c o n t e x t merge_lhs (
                    p o o l < j o i n a b l e ()<<merge<d i r e c t i o n >( b e g i n , m i d d l e , compare )
                             <
   );
   e x e c u t i o n _ c o n t e x t merg e_r hs (
                    p o o l < j o i n a b l e ()<<merge<d i r e c t i o n >( m i d d l e , end , compare )
                             <
   );
   ∗merge_lhs ;
   ∗ m e r g e _ r hs ;



           Again the target processor may be more sophisticated than a
           simple comparator, we could use a serial sort in a similar
           manner.
                   Although this might increase the complexity with an
                   O (n log (n)) factor.
Some Thoughts Arising from the Examples.


      Internally PPD creates sub-tasks that recursively divide the
      collections into p equal-length sub-sequences, where p is the
      number of threads in the pool, generating a tree of depth
      O (log (min (p, n))).
      The comparison operations on the pairs of elements should be
      thread-safe & must not affect the validity of iterators on those
      collections.
      The user may call many STL-style algorithms in succession,
      each one placing work into the thread pool to be operated
      upon, and later recall the results or note the completion of
      those operations via the returned execution_contexts.
Sequential Operation.
   Recall the thread_pool typedef that contained the API and
   threading model:

             Listing 5: Definitions of the typedef for the thread pool.
   t y p e d e f ppd : : t h r e a d _ p o o l <
                 p o o l _ t r a i t s : : worker_threads_get_work , p o o l _ t r a i t s : : f i x e d _ s i z e ,
                 g e n e r i c _ t r a i t s : : j o i n a b l e , platform_api , heavyweight_threading ,
                 pool_traits : : normal_fifo , std : less , 1
   > pool_type ;


   To make the thread_pool and therefore all operations on it
   sequential, and remove all threading operations, one merely needs
   to replace the trait heavyweight_threading with the trait
   sequential.
           It is that simple, no other changes need be made to the client
           code, so all threading-related bugs have been removed.
           This makes debugging significantly easier by separating the
           business logic from the threading algorithms.
Other Features.


   Able to provide estimates of:
       The optimal number of threads required to complete the
       current work-load.
       The minimum time to complete the current work-load given
       the current number of threads.
   This could provide:
       An indication of soft-real-time performance.
            Furthermore this could be used to implement dynamic tuning
            of the thread pools.
Conclusions.

        A study of the scalability of the sort & merge operations would
        be interesting - this is to be done.
             I believe the algorithmic complexity of the merge() and
             sort() operations is O log2 n log(n) + log (p) .
                                                p

        PPD assists with creating efficient, deadlock and
        race-condition free schedules.
        The business logic and parallel algorithms are sufficiently
        separated to allow relatively simple, independent testing and
        debugging of each.
        All this in a traditionally hard-to-thread imperative language,
        such as C++, indicating that it is possible to re-implement
        such an example library in many other programming languages.
   These are, in my opinion, rather useful features of the library!

More Related Content

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Implementing Batcher’s Bitonic Sort in C++: An Investigation into using Library-Based, Data-Parallelism Support in C++.

  • 1. Implementing Batcher’s Bitonic Sort in C++ An Investigation into using Library-Based, Data-Parallelism Support in C++. October 2011 Jason Mc Guiness, Colin Egan & Russel Winder ppd-bcs-dsg-11@hussar.demon.co.uk c.egan@herts.ac.uk russel@russel.org.uk http://libjmmcg.sf.net/ October 5, 2011
  • 2. Sequence of Presentation. An introduction: why I am here. How do we manage all these threads: libraries, languages and compilers. Very brief introduction of parallel sorting algorithms. An outline of Parallel Pixie Dust (PPD). Outline of implementation using PPD. Conclusion. Future directions.
  • 3. Introduction. Why yet another thread-library presentation? Because we still find it hard to write multi-threaded programs correctly. According to programming folklore. We haven’t successfully replaced the von Neumann architecture: Stored program architectures are still prevalent. The memory wall still affects us: The CPU-instruction retirement rate, i.e. rate at which programs require and generate data, exceeds the the memory bandwidth - a by product of Moore’s Law. Modern architectures add extra cores to CPUs, in this instance, extra memory buses which feed into those cores.
  • 4. A Quick Review of Some Threading Models: Restrict ourselves to library-based techniques. Raw library calls. The “thread as an active class”. Co-routines. The “thread pool” containing those objects. Producer-consumer models are a sub-set of this model. The issue: classes used to implement business logic also implement the operations to be run on the threads. This often means that the locking is intimately entangled with those objects, and possibly even the threading logic. This makes it much harder to debug these applications. The question of how to implement multi-threaded debuggers correctly is still an open question.
  • 5. Parallel Sorting... Parallel sorting has a history: The three Hungarians (Ajtai, Komlos & Szemeredi, AKS) presented an algorithm that takes O (log (n)) time with n processors in 1983, but has impractically large complexity constants. Cole’s parallel merge sort was presented in 1986 and takes O (log (n)) time with n processors, but has much smaller complexity constants. Batcher’s bitonic sort was presented in 1968 and takes O log2 (n) time, but according to Knuth when comparing it to AKS: "Batcher’s method is much better, unless n exceeds the total memory capacity of all computers on earth!” Research has indicated that Batcher’s algorithm has better complexity constants than Cole’s, so we shall focus on Batcher’s algorithm.
  • 6. An Examination of SPEC2006 for STL Usage. SPEC2006 was examined for usage of STL algorithms used within it, sorting is important to one of the benchmarks, so a potential target for parallelism. Distribution of algorithms in 483.xalancbmk. Distribution of algorithms in 477.dealII. 100 100 set::count set::count map::count map::count string::find string::find set::find set::find 80 80 map::find map::find Global algorithm Global algorithm 60 60 Number of occurences. Number of occurences. 40 40 20 20 0 0 copy fill sort find count_if accumulate min_element inner_product transform inplace_merge find_if binary_search find for_each swap find find_if copy reverse remove transform stable_sort sort replace fill count fill_n swap count lower_bound max_element upper_bound unique remove_if nth_element for_each equal adjacent_find Algorithm. Algorithm.
  • 7. Part 1: Design Features of PPD. The main design features of PPD are: It targets general purpose threading using a data-flow model of parallelism: This type of scheduling may be viewed as dynamic scheduling (run-time) as opposed to static scheduling (potentially compile-time), where the operations are statically assigned to execution pipelines, with relatively fixed timing constraints. DSEL implemented as futures and thread pools (of many different types using traits). Can be used to implement a tree-like thread schedule. “thread-as-an-active-class” exists. Gives rise to important properties: efficient, dead-lock & race-condition free schedules.
  • 8. Part 2: Design Features of PPD. Extensions to the basic DSEL have been added: Certain binary functors: each operand is executed in parallel. Adaptors for the STL collections to assist with thread-safety. Combined with thread pools allows replacement of the STL algorithms. Implemented GSS(k), or baker’s scheduling: May reduce synchronisation costs on thread-limited systems or pools in which the synchronisation costs are high. Amongst other influences, PPD was born out of discussions with Richard Harris and motivated by Kevlin Henney’s presentation to ACCU’04 regarding threads.
  • 9. The STL-style Parallel Algorithms. That analysis, amongst other reasons lead me to implement in PPD: 1. for_each() 2. find_if() & find() 3. count_if() & count() 4. transform() (both overloads) 5. copy() 6. accumulate() (both overloads) 7. fill_n() & fill() 8. reverse() 9. max_element() & min_element() (both overloads) 10. merge() & sort() (both overloads)
  • 10. Basic Example of sort(). Listing 1: Parallel sort() with a Thread Pool and Future. t y p e d e f ppd : : s a f e _ c o l l n < v e c t o r <l o n g >, l o c k : : r e c u r s i v e _ c r i t i c a l _ s e c t i o n _ l o c k _ t y p e > vtr_colln_t ; vtr_colln_t v ; v . push_back ( 2 ) ; v . push_back ( 1 ) ; execution_context context ( p o o l < j o i n a b l e ()<< p o o l . s o r t <a s c e n d i n g >(v ) < ); ∗context ; To assist in making the code appear to be readable, a number of typedefs have been omitted. The safe_colln adaptor provides the lock for the collection, in this case a simple EREW lock, only locks the collection, not the elements in it. The sort() returns an execution_context: The input collection has a read-lock taken on it. Released when the asynchronous sort has completed, when the read-lock has been dropped. It makes no sense to return a random iterator.
  • 11. The typedefs used. Listing 2: typedefs in the sort() example. t y p e d e f ppd : : t h r e a d _ p o o l < p o o l _ t r a i t s : : worker_threads_get_work , p o o l _ t r a i t s : : f i x e d _ s i z e , pool_adaptor< g e n e r i c _ t r a i t s : : j o i n a b l e , platform_api , heavyweight_threading , pool_traits : : normal_fifo , std : : less ,1 > > pool_type ; t y p e d e f pool_type : : void_exec_cxt execution_context ; typedef execution_context : : j o i n a b l e j o i n a b l e ; Expresses the flexibility & optimizations available: pool_traits::prioritised_queue: specifies work should be ordered by the trait std::less. Could be used to order work by an estimate of time taken to complete - very interesting, but not implemented. GSS(k) scheduling implemented: number specifies batch size.
  • 12. Implementing Batcher’s Bitonic Sort with PPD. Listing 3: Schematic of the sort() method. const i n _ i t e r a t o r middle ( boost : : next ( begin , half_size_of_portion ) ) ; execution_context sort_lhs_ascending ( p o o l < j o i n a b l e ()<< s o r t <a s c e n d i n g >( b e g i n , m i d d l e , compare ) < ); execution_context sort_rhs_descending ( p o o l < j o i n a b l e ()<< s o r t <d e s c e n d i n g >( m i d d l e , end , compare ) < ); ∗sort_lhs_ascending ; ∗sort_rhs_descending ; execution_context bitonic_merge_all ( p o o l < j o i n a b l e ()<<merge<d i r e c t i o n >( b e g i n , end , compare ) < ); ∗bitonic_merge_all ; As the target processor may be more sophisticated than a comparator, we could use a serial sort for a minimal sub-range, based upon threading costs of the target architecture, terminating the recursion early. We’d avoid using quicksort, as it is O n2 for nearly-sorted inputs, which a bitonic sequence is.
  • 13. Implementing Batcher’s Bitonic Merge with PPD. Listing 4: Schematic of the merge() method. const o u t _ i t e r a t o r middle ( boost : : next ( begin , half_size_of_portion ) ) ; const out_iterator output ( middle ) ; execution_context s h u f f l e ( p o o l < j o i n a b l e ()<< p o o l . s w a p _ r a n g e s ( < b e g i n , m i d d l e , o u t p u t , swap_pred<d i r e c i o n >( compare ) ) ); ∗shuffle ; e x e c u t i o n _ c o n t e x t merge_lhs ( p o o l < j o i n a b l e ()<<merge<d i r e c t i o n >( b e g i n , m i d d l e , compare ) < ); e x e c u t i o n _ c o n t e x t merg e_r hs ( p o o l < j o i n a b l e ()<<merge<d i r e c t i o n >( m i d d l e , end , compare ) < ); ∗merge_lhs ; ∗ m e r g e _ r hs ; Again the target processor may be more sophisticated than a simple comparator, we could use a serial sort in a similar manner. Although this might increase the complexity with an O (n log (n)) factor.
  • 14. Some Thoughts Arising from the Examples. Internally PPD creates sub-tasks that recursively divide the collections into p equal-length sub-sequences, where p is the number of threads in the pool, generating a tree of depth O (log (min (p, n))). The comparison operations on the pairs of elements should be thread-safe & must not affect the validity of iterators on those collections. The user may call many STL-style algorithms in succession, each one placing work into the thread pool to be operated upon, and later recall the results or note the completion of those operations via the returned execution_contexts.
  • 15. Sequential Operation. Recall the thread_pool typedef that contained the API and threading model: Listing 5: Definitions of the typedef for the thread pool. t y p e d e f ppd : : t h r e a d _ p o o l < p o o l _ t r a i t s : : worker_threads_get_work , p o o l _ t r a i t s : : f i x e d _ s i z e , g e n e r i c _ t r a i t s : : j o i n a b l e , platform_api , heavyweight_threading , pool_traits : : normal_fifo , std : less , 1 > pool_type ; To make the thread_pool and therefore all operations on it sequential, and remove all threading operations, one merely needs to replace the trait heavyweight_threading with the trait sequential. It is that simple, no other changes need be made to the client code, so all threading-related bugs have been removed. This makes debugging significantly easier by separating the business logic from the threading algorithms.
  • 16. Other Features. Able to provide estimates of: The optimal number of threads required to complete the current work-load. The minimum time to complete the current work-load given the current number of threads. This could provide: An indication of soft-real-time performance. Furthermore this could be used to implement dynamic tuning of the thread pools.
  • 17. Conclusions. A study of the scalability of the sort & merge operations would be interesting - this is to be done. I believe the algorithmic complexity of the merge() and sort() operations is O log2 n log(n) + log (p) . p PPD assists with creating efficient, deadlock and race-condition free schedules. The business logic and parallel algorithms are sufficiently separated to allow relatively simple, independent testing and debugging of each. All this in a traditionally hard-to-thread imperative language, such as C++, indicating that it is possible to re-implement such an example library in many other programming languages. These are, in my opinion, rather useful features of the library!