SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
A Domain-SpeciïŹc Embedded Language for
Programming Parallel Architectures.
Distributed Computing and Applications to Business,
Engineering and Science
September 2013.
Jason McGuiness
& Colin Egan
University of Hertfordshire
Copyright © Jason Mc
Guiness, 2013.
dcabes2013@count-zero.ltd.uk
http://libjmmcg.sf.net/
Sequence of Presentation.
A very pragmatic, practical basis to the talk.
An introduction: why I am here.
Why do we need & how do we manage multiple threads?
Propose a DSEL to enable parallelism.
Describe the grammar, give resultant theorems.
Examples, and their motivation.
Discussion.
Introduction.
Why yet another thread-library presentation?
Because we still ïŹnd it hard to write multi-threaded programs
correctly.
According to programming folklore.
We haven’t successfully replaced the von Neumann
architecture:
Stored program architectures are still prevalent.
Companies don’t like to change their compilers.
People don’t like to recompile their programs to run on the
latest architectures.
The memory wall still aïŹ€ects us:
The CPU-instruction retirement rate, i.e. rate at which
programs require and generate data, exceeds the the memory
bandwidth - a by product of Moore’s Law.
Modern architectures add extra cores to CPUs, in this
instance, extra memory buses which feed into those cores.
A Quick Review of Related Threading Models:
Compiler-based such as Erlang, UPC or HPF.
Corollary: companies/people don’t like to change their
programming language.
Profusion of library-based solutions such as Posix Threads and
OpenMP, Boost.Threads:
Don’t have to change the language, nor compiler!
SuïŹ€er from inheritance anomalies & related issue of entangling
the thread-safety, thread scheduling and business logic.
Each program becomes bespoke, requiring re-testing for
threading and business logic issues.
Debugging: very hard, an open area of research.
Intel’s TBB or Cilk.
Have limited grammars: Cilk - simple data-ïŹ‚ow model, TBB -
complex, but invasive API.
The question of how to implement multi-threaded debuggers
correctly an open question.
Race conditions commonly “disappear” in the debugger...
The DSEL to Assist Parallelism.
Should have the following properties:
Target general purpose threading, deïŹned as scheduling where
conditionals or loop-bounds may not be computed at
compile-time, nor memoized.
Support both data-ïŹ‚ow and data-parallel constructs succinctly
and naturally within the host language.
Provide guarantees regarding:
deadlocks and race-conditions,
the algorithmic complexity of any parallel schedule
implemented with it.
Assist in debugging any use of it.
Example implementation uses C++ as the host language, so
more likely to be used in business.
Grammar Overview: Part 1: thread-pool-type.
thread-pool-type → thread_pool work-policy size-policy pool-adaptor
A thread_pool would contain a collection of threads, which may diïŹ€er from the number of
physical cores.
work-policy → worker_threads_get_work | one_thread_distributes
The library should implement the classic work-stealing or master-slave work
sharing algorithms.
size-policy → fixed_size | tracks_to_max | infinite
The size-policy combined with the threading-model could be used to
optimize the implementation of the thread-pool-type.
pool-adaptor → joinability api-type threading-model priority-modeopt comparatoropt
joinability → joinable | nonjoinable
The joinability type has been provided to allow for optimizations of the
thread-pool-type.
api-type → posix_pthreads | IBM_cyclops | ... omitted for brevity
threading-model → sequential_mode | heavyweight_threading | lightweight_threading
This speciïŹer provides a coarse representation of the various
implementations of threading in the many architectures.
priority-mode → normal_fifodef | prioritized_queue
The prioritized_queue would allow speciïŹcation of whether certain
instances of work-to-be-mutated could be mutated before other instances
according to the speciïŹed comparator.
comparator → std::lessdef
A binary function-type that would be used to specify a strict weak-ordering
on the elements within the prioritized_queue.
Grammar Overview: Part 2: other types.
The thread-pool-type should deïŹne further terminals for programming convenience:
execution_context: An opaque type of future that a transfer returns and a proxy to the result_type
that the mutation creates.
Access to the instance of the result_type implicitly causes the calling
thread to wait until the mutation has been completed: a data-ïŹ‚ow
operation.
Implementations of execution_context must speciïŹcally prohibit: aliasing
instances of these types, copying instances of these types and assigning
instances of these types.
joinable: A modiïŹer for transferring work-to-be-mutated into an instance of
thread-pool-type, a data-ïŹ‚ow operation.
nonjoinable: Another modiïŹer for transferring work-to-be-mutated into an instance of
thread-pool-type, a data-ïŹ‚ow operation.
safe-colln → safe_colln collection-type lock-type
This adaptor wraps the collection-type and lock-type in one object; also providing some
thread-safe operations upon and access to the underlying collection.
lock-type → critical_section_lock_type | read_write | read_decaying_write
A critical_section_lock_type would be a single-reader, single-writer lock,
a simulation of EREW semantics.
A read_write lock would be a multi-readers, single-write lock, a simulation
of CREW semantics.
A read_decaying_write lock would be a specialization of a read_write lock
that also implements atomic transformation of a write-lock into a read-lock.
collection-type: A standard collection such as an STL-style list or vector, etc.
Grammar Overview: Part 3: Rewrite Rules.
Transfer of work-to-be-mutated into an instance of thread-pool-type has been deïŹned as follows:
transfer-future → execution-context-resultopt thread-pool-type transfer-operation
execution-context-result → execution_context <‌<
An execution_context should be created only via a transfer of
work-to-be-mutated with the joinable modiïŹer into a thread_pool deïŹned
with the joinable joinability type.
It must be an error to transfer work into a thread_pool that has been
deïŹned using the nonjoinable type.
An execution_context should not be creatable without transferring work, so
guaranteed to contain an instance of result_type of a mutation, implying
data-ïŹ‚ow like operation.
transfer-operation → transfer-modiïŹer-operationopt transfer-data-operation
transfer-modiïŹer-operation → <‌< transfer-modiïŹer
transfer-modiïŹer → joinable | nonjoinable
transfer-data-operation → <‌< transfer-data
transfer-data → work-to-be-mutated | data-parallel-algorithm
The data-parallel-algorithms have been deïŹned as follows:
data-parallel-algorithm → accumulate | ... omitted for brevity
The style and arguments of the data-parallel-algorithms should be similar to
those of the STL. SpeciïŹcally they should all take a safe-colln as an
argument to specify the range and functors as speciïŹed within the STL.
Properties of the DSEL.
Due to the restricted properties of the execution contexts and the
thread pools a few important results arise:
1. The thread schedule created is only an acyclic, directed graph:
a tree.
2. From this property we have proved that the schedule
generated is deadlock and race-condition free.
3. Moreover in implementing the STL-style algorithms those
implementations are eïŹƒcient, i.e. there are provable bounds
on both the execution time and minimum number of
processors required to achieve that time.
Initial Theorems (Proofs in the Paper).
1. CFG is a tree:
Theorem
The CFG of any program must be an acyclic directed graph
comprising of at least one singly-rooted tree, but may contain
multiple singly-rooted, independent, trees.
2. Race-condition Free:
Theorem
The schedule of a CFG satisfying Theorem 1 should be guaranteed
to be free of race-conditions.
3. Deadlock Free:
Theorem
The schedule of a CFG satisfying Theorem 1 should be guaranteed
to be free of deadlocks.
Final Theorems (Proofs in the Paper).
1. Race-condition and Deadlock Free:
Corollary
The schedule of a CFG satisfying Theorem 1 should be guaranteed
to be free of race-conditions and deadlocks
2. Implements Optimal Schedule:
Theorem
The schedule of a CFG satisfying Theorem 1 should be executed
with an algorithmic complexity of at least O (log (p)) and at most
O (n), in units of time to mutate the work, where n would be the
number of work items to be mutated on p processors. The
algorithmic order of the minimal time would be poly-logarithmic, so
within NC, therefore at least optimal.
Basic Data-Flow Example.
Listing 1: General-Purpose use of a Thread Pool and Future.
s t r u c t res_t { i n t i ; } ;
s t r u c t work_type {
v o i d p r o c e s s ( res_t &) {}
} ;
t y p e d e f ppd : : thread_pool <
p o o l _ t r a i t s : : worker_threads_get_work , p o o l _ t r a i t s : : f i x e d _ s i z e ,
pool_adaptor<g e n e r i c _ t r a i t s : : j o i n a b l e , posix_pthreads , heavyweight_threading >
> pool_type ;
t y p e d e f pool_type : : j o i n a b l e j o i n a b l e ;
pool_type p o o l ( 2 ) ;
auto c o n s t &c o n t e x t=pool<<j o i n a b l e ()<<work_type ( ) ;
context −>i ;
The work has been transferred to the thread_pool and the
resultant opaque execution_context has been captured.
process(res_t &) is the only invasive artefact of the library
for this use-case.
The dereference of the proxy conceals the implicit
synchronisation:
obviously a data-ïŹ‚ow operation,
an implementation of the split-phase constraint.
Data-Parallel Example: map-reduce as accumulate.
Listing 2: Accumulate with a Thread Pool and Future.
t y p e d e f ppd : : thread_pool <
p o o l _ t r a i t s : : worker_threads_get_work , p o o l _ t r a i t s : : f i x e d _ s i z e ,
pool_adaptor<g e n e r i c _ t r a i t s : : j o i n a b l e , posix_pthreads , heavyweight_threading >
> pool_type ;
t y p e d e f ppd : : s a f e _ c o l l n <
v e c t o r <i n t >, l o c k _ t r a i t s : : c r i t i c a l _ s e c t i o n _ l o c k _ t y p e
> v t r _ c o l l n _ t ;
t y p e d e f pool_type : : j o i n a b l e j o i n a b l e ;
v t r _ c o l l n _ t v ; v . push_back ( 1 ) ; v . push_back ( 2 ) ;
auto c o n s t &c o n t e x t=pool<<j o i n a b l e ( )
<<p o o l . accumulate ( v , 1 , s t d : : plus <v t r _ c o l l n _ t : : value_type > ( ) ) ;
a s s e r t (∗ c o n t e x t ==4);
An implementation might:
distribute sub-ranges of the safe-colln, within the
thread_pool, performing the mutations sequentially within
the sub-ranges, without any locking,
compute the ïŹnal result by combining the intermediate results,
the implementation providing suitable locking.
The lock-type of the safe_colln:
indicates EREW semantics obeyed for access to the collection,
released when all of the mutations have completed.
Operation of accumulate.
main()
accumulate
v
distribute_root
s
distribute
h
distribute
v
distribute
h
distribute
v
distribute
v
distribute
v
distribute
h
distribute
v
distribute
v
distribute
v
distribute
v
distribute
v
distribute
v
distribute
v
main() the C++ entry-point for the program,
accumulate & distribute_root the root-node of the transferred algorithm,
distribute
- internally distributed the input collection recursively within the graph,
- leaf nodes performed the mutation upon the sub-range.
s sequential, shown for exposition purposes only,
v vertical, mutation performed by thread within thread_pool.
h horizontal, mutation performed by a thread spawned within an execution_context.
Ensures that suïŹƒcient free threads available for fixed_size thread_pools.
Discussion.
A DSEL has been formulated:
that targets general purpose threading using both data-ïŹ‚ow
and data-parallel constructs,
ensures there should be no deadlocks and race-conditions with
guarantees regarding the algorithmic complexity,
and assists with debugging any use of it.
The choice of C++ as a host language was not special.
Result should be no surprise: consider the work done relating
to auto-parallelizing compilers.
No need to learn a new programming language, nor change to
a novel compiler.
Not a panacea: program must be written in a data-ïŹ‚ow style.
Expose estimate of threading costs.
Testing the performance with SPEC2006 could be investigated.
Perhaps on alternative architectures, GPUs, APUs, etc.

Weitere Àhnliche Inhalte

Was ist angesagt?

Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problems
Richard Ashworth
 
Comparing Write-Ahead Logging and the Memory Bus Using
Comparing Write-Ahead Logging and the Memory Bus UsingComparing Write-Ahead Logging and the Memory Bus Using
Comparing Write-Ahead Logging and the Memory Bus Using
jorgerodriguessimao
 
Task Parallel Library Data Flows
Task Parallel Library Data FlowsTask Parallel Library Data Flows
Task Parallel Library Data Flows
SANKARSAN BOSE
 
SSBSE10.ppt
SSBSE10.pptSSBSE10.ppt
SSBSE10.ppt
Ptidej Team
 
Parallel Programming Primer
Parallel Programming PrimerParallel Programming Primer
Parallel Programming Primer
Sri Prasanna
 
DIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe WorkshopDIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe Workshop
odsc
 

Was ist angesagt? (20)

Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problems
 
Practical pairing of generative programming with functional programming.
Practical pairing of generative programming with functional programming.Practical pairing of generative programming with functional programming.
Practical pairing of generative programming with functional programming.
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 
First steps with Keras 2: A tutorial with Examples
First steps with Keras 2: A tutorial with ExamplesFirst steps with Keras 2: A tutorial with Examples
First steps with Keras 2: A tutorial with Examples
 
Comparing Write-Ahead Logging and the Memory Bus Using
Comparing Write-Ahead Logging and the Memory Bus UsingComparing Write-Ahead Logging and the Memory Bus Using
Comparing Write-Ahead Logging and the Memory Bus Using
 
Task Parallel Library Data Flows
Task Parallel Library Data FlowsTask Parallel Library Data Flows
Task Parallel Library Data Flows
 
SSBSE10.ppt
SSBSE10.pptSSBSE10.ppt
SSBSE10.ppt
 
Parallel Programming: Beyond the Critical Section
Parallel Programming: Beyond the Critical SectionParallel Programming: Beyond the Critical Section
Parallel Programming: Beyond the Critical Section
 
Introduction to neural networks and Keras
Introduction to neural networks and KerasIntroduction to neural networks and Keras
Introduction to neural networks and Keras
 
Parallel Programming Primer
Parallel Programming PrimerParallel Programming Primer
Parallel Programming Primer
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
 
Shifu plugin-trainer and pmml-adapter
Shifu plugin-trainer and pmml-adapterShifu plugin-trainer and pmml-adapter
Shifu plugin-trainer and pmml-adapter
 
Generative Programming In The Large - Applied C++ meta-programming
Generative Programming In The Large - Applied C++ meta-programmingGenerative Programming In The Large - Applied C++ meta-programming
Generative Programming In The Large - Applied C++ meta-programming
 
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemHybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
 
Understand and Harness the Capabilities of Intel¼ Xeon Phiℱ Processors
Understand and Harness the Capabilities of Intel¼ Xeon Phiℱ ProcessorsUnderstand and Harness the Capabilities of Intel¼ Xeon Phiℱ Processors
Understand and Harness the Capabilities of Intel¼ Xeon Phiℱ Processors
 
Basic ideas on keras framework
Basic ideas on keras frameworkBasic ideas on keras framework
Basic ideas on keras framework
 
Final training course
Final training courseFinal training course
Final training course
 
IS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialIS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorial
 
Nondeterminism is unavoidable, but data races are pure evil
Nondeterminism is unavoidable, but data races are pure evilNondeterminism is unavoidable, but data races are pure evil
Nondeterminism is unavoidable, but data races are pure evil
 
DIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe WorkshopDIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe Workshop
 

Ähnlich wie A Domain-Specific Embedded Language for Programming Parallel Architectures.

Writing Efficient Code Feb 08
Writing Efficient Code Feb 08Writing Efficient Code Feb 08
Writing Efficient Code Feb 08
Ganesh Samarthyam
 
73d32 session1 c++
73d32 session1 c++73d32 session1 c++
73d32 session1 c++
Mukund Trivedi
 
Resisting skew accumulation
Resisting skew accumulationResisting skew accumulation
Resisting skew accumulation
Md. Hasibur Rashid
 

Ähnlich wie A Domain-Specific Embedded Language for Programming Parallel Architectures. (20)

Unit 1
Unit  1Unit  1
Unit 1
 
Matopt
MatoptMatopt
Matopt
 
C optimization notes
C optimization notesC optimization notes
C optimization notes
 
EuroAD 2021: ChainRules.jl
EuroAD 2021: ChainRules.jl EuroAD 2021: ChainRules.jl
EuroAD 2021: ChainRules.jl
 
C, C++ Interview Questions Part - 1
C, C++ Interview Questions Part - 1C, C++ Interview Questions Part - 1
C, C++ Interview Questions Part - 1
 
Pretzel: optimized Machine Learning framework for low-latency and high throu...
Pretzel: optimized Machine Learning framework for  low-latency and high throu...Pretzel: optimized Machine Learning framework for  low-latency and high throu...
Pretzel: optimized Machine Learning framework for low-latency and high throu...
 
Chap7 slides
Chap7 slidesChap7 slides
Chap7 slides
 
Writing Efficient Code Feb 08
Writing Efficient Code Feb 08Writing Efficient Code Feb 08
Writing Efficient Code Feb 08
 
73d32 session1 c++
73d32 session1 c++73d32 session1 c++
73d32 session1 c++
 
Bound and Checked
Bound and CheckedBound and Checked
Bound and Checked
 
Freckle
FreckleFreckle
Freckle
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
Coding For Cores - C# Way
Coding For Cores - C# WayCoding For Cores - C# Way
Coding For Cores - C# Way
 
Modern C++
Modern C++Modern C++
Modern C++
 
Concurrency and parallel in .net
Concurrency and parallel in .netConcurrency and parallel in .net
Concurrency and parallel in .net
 
Resisting skew accumulation
Resisting skew accumulationResisting skew accumulation
Resisting skew accumulation
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
 
Go Faster With Native Compilation
Go Faster With Native CompilationGo Faster With Native Compilation
Go Faster With Native Compilation
 
Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2
 
GCF
GCFGCF
GCF
 

Mehr von Jason Hearne-McGuiness (6)

A DSEL for Addressing the Problems Posed by Parallel Architectures
A DSEL for Addressing the Problems Posed by Parallel ArchitecturesA DSEL for Addressing the Problems Posed by Parallel Architectures
A DSEL for Addressing the Problems Posed by Parallel Architectures
 
Implementing Batcher’s Bitonic Sort in C++: An Investigation into using Libra...
Implementing Batcher’s Bitonic Sort in C++: An Investigation into using Libra...Implementing Batcher’s Bitonic Sort in C++: An Investigation into using Libra...
Implementing Batcher’s Bitonic Sort in C++: An Investigation into using Libra...
 
Introducing Parallel Pixie Dust: Advanced Library-Based Support for Paralleli...
Introducing Parallel Pixie Dust: Advanced Library-Based Support for Paralleli...Introducing Parallel Pixie Dust: Advanced Library-Based Support for Paralleli...
Introducing Parallel Pixie Dust: Advanced Library-Based Support for Paralleli...
 
Massively Parallel Architectures
Massively Parallel ArchitecturesMassively Parallel Architectures
Massively Parallel Architectures
 
Massively Parallel Architectures
Massively Parallel ArchitecturesMassively Parallel Architectures
Massively Parallel Architectures
 
Introducing Parallel Pixie Dust
Introducing Parallel Pixie DustIntroducing Parallel Pixie Dust
Introducing Parallel Pixie Dust
 

KĂŒrzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
Christopher Logan Kennedy
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

KĂŒrzlich hochgeladen (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Mcleodganj Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls đŸ„° 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 

A Domain-Specific Embedded Language for Programming Parallel Architectures.

  • 1. A Domain-SpeciïŹc Embedded Language for Programming Parallel Architectures. Distributed Computing and Applications to Business, Engineering and Science September 2013. Jason McGuiness & Colin Egan University of Hertfordshire Copyright © Jason Mc Guiness, 2013. dcabes2013@count-zero.ltd.uk http://libjmmcg.sf.net/
  • 2. Sequence of Presentation. A very pragmatic, practical basis to the talk. An introduction: why I am here. Why do we need & how do we manage multiple threads? Propose a DSEL to enable parallelism. Describe the grammar, give resultant theorems. Examples, and their motivation. Discussion.
  • 3. Introduction. Why yet another thread-library presentation? Because we still ïŹnd it hard to write multi-threaded programs correctly. According to programming folklore. We haven’t successfully replaced the von Neumann architecture: Stored program architectures are still prevalent. Companies don’t like to change their compilers. People don’t like to recompile their programs to run on the latest architectures. The memory wall still aïŹ€ects us: The CPU-instruction retirement rate, i.e. rate at which programs require and generate data, exceeds the the memory bandwidth - a by product of Moore’s Law. Modern architectures add extra cores to CPUs, in this instance, extra memory buses which feed into those cores.
  • 4. A Quick Review of Related Threading Models: Compiler-based such as Erlang, UPC or HPF. Corollary: companies/people don’t like to change their programming language. Profusion of library-based solutions such as Posix Threads and OpenMP, Boost.Threads: Don’t have to change the language, nor compiler! SuïŹ€er from inheritance anomalies & related issue of entangling the thread-safety, thread scheduling and business logic. Each program becomes bespoke, requiring re-testing for threading and business logic issues. Debugging: very hard, an open area of research. Intel’s TBB or Cilk. Have limited grammars: Cilk - simple data-ïŹ‚ow model, TBB - complex, but invasive API. The question of how to implement multi-threaded debuggers correctly an open question. Race conditions commonly “disappear” in the debugger...
  • 5. The DSEL to Assist Parallelism. Should have the following properties: Target general purpose threading, deïŹned as scheduling where conditionals or loop-bounds may not be computed at compile-time, nor memoized. Support both data-ïŹ‚ow and data-parallel constructs succinctly and naturally within the host language. Provide guarantees regarding: deadlocks and race-conditions, the algorithmic complexity of any parallel schedule implemented with it. Assist in debugging any use of it. Example implementation uses C++ as the host language, so more likely to be used in business.
  • 6. Grammar Overview: Part 1: thread-pool-type. thread-pool-type → thread_pool work-policy size-policy pool-adaptor A thread_pool would contain a collection of threads, which may diïŹ€er from the number of physical cores. work-policy → worker_threads_get_work | one_thread_distributes The library should implement the classic work-stealing or master-slave work sharing algorithms. size-policy → fixed_size | tracks_to_max | infinite The size-policy combined with the threading-model could be used to optimize the implementation of the thread-pool-type. pool-adaptor → joinability api-type threading-model priority-modeopt comparatoropt joinability → joinable | nonjoinable The joinability type has been provided to allow for optimizations of the thread-pool-type. api-type → posix_pthreads | IBM_cyclops | ... omitted for brevity threading-model → sequential_mode | heavyweight_threading | lightweight_threading This speciïŹer provides a coarse representation of the various implementations of threading in the many architectures. priority-mode → normal_fifodef | prioritized_queue The prioritized_queue would allow speciïŹcation of whether certain instances of work-to-be-mutated could be mutated before other instances according to the speciïŹed comparator. comparator → std::lessdef A binary function-type that would be used to specify a strict weak-ordering on the elements within the prioritized_queue.
  • 7. Grammar Overview: Part 2: other types. The thread-pool-type should deïŹne further terminals for programming convenience: execution_context: An opaque type of future that a transfer returns and a proxy to the result_type that the mutation creates. Access to the instance of the result_type implicitly causes the calling thread to wait until the mutation has been completed: a data-ïŹ‚ow operation. Implementations of execution_context must speciïŹcally prohibit: aliasing instances of these types, copying instances of these types and assigning instances of these types. joinable: A modiïŹer for transferring work-to-be-mutated into an instance of thread-pool-type, a data-ïŹ‚ow operation. nonjoinable: Another modiïŹer for transferring work-to-be-mutated into an instance of thread-pool-type, a data-ïŹ‚ow operation. safe-colln → safe_colln collection-type lock-type This adaptor wraps the collection-type and lock-type in one object; also providing some thread-safe operations upon and access to the underlying collection. lock-type → critical_section_lock_type | read_write | read_decaying_write A critical_section_lock_type would be a single-reader, single-writer lock, a simulation of EREW semantics. A read_write lock would be a multi-readers, single-write lock, a simulation of CREW semantics. A read_decaying_write lock would be a specialization of a read_write lock that also implements atomic transformation of a write-lock into a read-lock. collection-type: A standard collection such as an STL-style list or vector, etc.
  • 8. Grammar Overview: Part 3: Rewrite Rules. Transfer of work-to-be-mutated into an instance of thread-pool-type has been deïŹned as follows: transfer-future → execution-context-resultopt thread-pool-type transfer-operation execution-context-result → execution_context <‌< An execution_context should be created only via a transfer of work-to-be-mutated with the joinable modiïŹer into a thread_pool deïŹned with the joinable joinability type. It must be an error to transfer work into a thread_pool that has been deïŹned using the nonjoinable type. An execution_context should not be creatable without transferring work, so guaranteed to contain an instance of result_type of a mutation, implying data-ïŹ‚ow like operation. transfer-operation → transfer-modiïŹer-operationopt transfer-data-operation transfer-modiïŹer-operation → <‌< transfer-modiïŹer transfer-modiïŹer → joinable | nonjoinable transfer-data-operation → <‌< transfer-data transfer-data → work-to-be-mutated | data-parallel-algorithm The data-parallel-algorithms have been deïŹned as follows: data-parallel-algorithm → accumulate | ... omitted for brevity The style and arguments of the data-parallel-algorithms should be similar to those of the STL. SpeciïŹcally they should all take a safe-colln as an argument to specify the range and functors as speciïŹed within the STL.
  • 9. Properties of the DSEL. Due to the restricted properties of the execution contexts and the thread pools a few important results arise: 1. The thread schedule created is only an acyclic, directed graph: a tree. 2. From this property we have proved that the schedule generated is deadlock and race-condition free. 3. Moreover in implementing the STL-style algorithms those implementations are eïŹƒcient, i.e. there are provable bounds on both the execution time and minimum number of processors required to achieve that time.
  • 10. Initial Theorems (Proofs in the Paper). 1. CFG is a tree: Theorem The CFG of any program must be an acyclic directed graph comprising of at least one singly-rooted tree, but may contain multiple singly-rooted, independent, trees. 2. Race-condition Free: Theorem The schedule of a CFG satisfying Theorem 1 should be guaranteed to be free of race-conditions. 3. Deadlock Free: Theorem The schedule of a CFG satisfying Theorem 1 should be guaranteed to be free of deadlocks.
  • 11. Final Theorems (Proofs in the Paper). 1. Race-condition and Deadlock Free: Corollary The schedule of a CFG satisfying Theorem 1 should be guaranteed to be free of race-conditions and deadlocks 2. Implements Optimal Schedule: Theorem The schedule of a CFG satisfying Theorem 1 should be executed with an algorithmic complexity of at least O (log (p)) and at most O (n), in units of time to mutate the work, where n would be the number of work items to be mutated on p processors. The algorithmic order of the minimal time would be poly-logarithmic, so within NC, therefore at least optimal.
  • 12. Basic Data-Flow Example. Listing 1: General-Purpose use of a Thread Pool and Future. s t r u c t res_t { i n t i ; } ; s t r u c t work_type { v o i d p r o c e s s ( res_t &) {} } ; t y p e d e f ppd : : thread_pool < p o o l _ t r a i t s : : worker_threads_get_work , p o o l _ t r a i t s : : f i x e d _ s i z e , pool_adaptor<g e n e r i c _ t r a i t s : : j o i n a b l e , posix_pthreads , heavyweight_threading > > pool_type ; t y p e d e f pool_type : : j o i n a b l e j o i n a b l e ; pool_type p o o l ( 2 ) ; auto c o n s t &c o n t e x t=pool<<j o i n a b l e ()<<work_type ( ) ; context −>i ; The work has been transferred to the thread_pool and the resultant opaque execution_context has been captured. process(res_t &) is the only invasive artefact of the library for this use-case. The dereference of the proxy conceals the implicit synchronisation: obviously a data-ïŹ‚ow operation, an implementation of the split-phase constraint.
  • 13. Data-Parallel Example: map-reduce as accumulate. Listing 2: Accumulate with a Thread Pool and Future. t y p e d e f ppd : : thread_pool < p o o l _ t r a i t s : : worker_threads_get_work , p o o l _ t r a i t s : : f i x e d _ s i z e , pool_adaptor<g e n e r i c _ t r a i t s : : j o i n a b l e , posix_pthreads , heavyweight_threading > > pool_type ; t y p e d e f ppd : : s a f e _ c o l l n < v e c t o r <i n t >, l o c k _ t r a i t s : : c r i t i c a l _ s e c t i o n _ l o c k _ t y p e > v t r _ c o l l n _ t ; t y p e d e f pool_type : : j o i n a b l e j o i n a b l e ; v t r _ c o l l n _ t v ; v . push_back ( 1 ) ; v . push_back ( 2 ) ; auto c o n s t &c o n t e x t=pool<<j o i n a b l e ( ) <<p o o l . accumulate ( v , 1 , s t d : : plus <v t r _ c o l l n _ t : : value_type > ( ) ) ; a s s e r t (∗ c o n t e x t ==4); An implementation might: distribute sub-ranges of the safe-colln, within the thread_pool, performing the mutations sequentially within the sub-ranges, without any locking, compute the ïŹnal result by combining the intermediate results, the implementation providing suitable locking. The lock-type of the safe_colln: indicates EREW semantics obeyed for access to the collection, released when all of the mutations have completed.
  • 14. Operation of accumulate. main() accumulate v distribute_root s distribute h distribute v distribute h distribute v distribute v distribute v distribute h distribute v distribute v distribute v distribute v distribute v distribute v distribute v main() the C++ entry-point for the program, accumulate & distribute_root the root-node of the transferred algorithm, distribute - internally distributed the input collection recursively within the graph, - leaf nodes performed the mutation upon the sub-range. s sequential, shown for exposition purposes only, v vertical, mutation performed by thread within thread_pool. h horizontal, mutation performed by a thread spawned within an execution_context. Ensures that suïŹƒcient free threads available for fixed_size thread_pools.
  • 15. Discussion. A DSEL has been formulated: that targets general purpose threading using both data-ïŹ‚ow and data-parallel constructs, ensures there should be no deadlocks and race-conditions with guarantees regarding the algorithmic complexity, and assists with debugging any use of it. The choice of C++ as a host language was not special. Result should be no surprise: consider the work done relating to auto-parallelizing compilers. No need to learn a new programming language, nor change to a novel compiler. Not a panacea: program must be written in a data-ïŹ‚ow style. Expose estimate of threading costs. Testing the performance with SPEC2006 could be investigated. Perhaps on alternative architectures, GPUs, APUs, etc.