Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

ThreadModel rev 1.4

222 Aufrufe

Veröffentlicht am

  • Have you ever heard of taking paid surveys on the internet before? We have one right now that pays $50, and takes less than 10 minutes! If you want to take it, here is your personal link ♥♥♥ http://ishbv.com/surveys6/pdf
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Gehören Sie zu den Ersten, denen das gefällt!

ThreadModel rev 1.4

  1. 1. Threading and Concurrency in Proton Core™ A reusable C++ base model for concurrent and asynchronous computing (logo) Revision 1.4
  2. 2. ii Copyright © 2015 Thread Concepts, LLC. All rights reserved. Revision 1.4, September 2015 ISBN: 978-0-9908699-1-7 Author: Creator of Proton Core™, Christopher Cochran Contact: chris.cochran.protoncore@gmail.com Product names appearing in this manual are for identification purposes only, any trademarks, product names or brand names appearing in this document are the property of their respective owners. Any function names and other programmatic identifiers that happen match trademarks or brand names are purely coincidental, and relate only to the local concepts described herein. Proton Core™ is a trademark of Thread Concepts, LLC. Any instances of the name “Proton” found in this and related documentation and source code, all refer to “Proton Core™”. Any specifications contained within this document are subject to change without notice, and do not represent any commitment by the manufacturer to be complete, correct, finalized, or necessarily suitable for any particular purpose. Thread Concepts, LLC Digitally signed by Thread Concepts, LLC DN: serialNumber=2zlmm5pwjvz5ghg0, c=US, st=California, l=Fairfax, o=Thread Concepts, LLC, cn=Thread Concepts, LLC Date: 2015.09.26 12:48:09 -07'00'
  3. 3. 1 Threading and Concurrency in Proton Core™ Background From even before the introduction of Visual Basic, Java and Visual Studio, software development has typically been based on top of some unified supporting API, or Universal Library with broad coverage across computation, storage, networking and i/o services. Microsoft has done an exceptionally good job at this, producing APIs like Win32, MFC, ActiveX, Visual Basic, CLR, C#, .Net and others. The trouble is, relying on such vast software systems comes with some unexpected costs, risks and ramifications, including: •• Today’s .net and java systems lock you into a garbage collection programming model, an approach with known uneven performance (denied by proponents), for the most critical, essential computational resource of all—operational memory. Combine that with a pseudo code virtual machine, and you have a system guaranteed to produce mediocre performance now, and on into the future (despite additional denials). Such systems give up higher performance in order to achieve internet portability, a useful and common tradeoff. •• Without a reusable higher-level threading and shared memory strategy, multithreaded solutions are often newly reinvented and built-up from lower-level constructs for each application. Although this can work well, it can also lead to unexpected complexities, disappointing concurrent processor utilization and creeping development schedules. •• Many foundation APIs can be replaced with better methods, resulting in higher performance from software depending on them, including memory management, string processing, sorting, data conversion and transformation and others. •• Reusing software you have already built in future developments is necessary to controlling development costs and schedules. But this process is prematurely and regularly disrupted by the push to move into the “new wave”, rendering years of good work artificially useless when it is not immediately portable to the “new system”. You are often left to your own devices for any bridges back to your valued past work, if it is feasible. •• The compatibility between C and C++ led over the years to a “race to the bottom”, with most libraries and frameworks developing in the lowest common denominator, in C. This has slowed the adoption of class- based object-oriented methods available from C++, foregoing its superior reusability, unlimited logical stratification and simpler management of large applications, compared with C. In C, your thinking is dominated by the implementation, in C++, your thinking is dominated by the problem domain at hand. •• Due to historical sequencing, treating strings as character arrays is firmly entrenched in many string processing systems, including standard strings of C and C++. The trouble is, that approach only works when characters are the same width, while Unicode characters are variable width. This is a serious flaw, a collision with the past that causes a variety of bugs in many installed software bases world-wide to this day. Unicode strings in Proton avoids this problem, with true character-oriented strings and processing services. •• High-profile, commonly used application models, from Microsoft and others, are studied by hackers for vulnerabilities to attack or exploit. Products based upon independently developed logical systems are not as commonly or quickly assimilated.
  4. 4. 2 Because of these and other characteristics, some vendors have developed smaller, faster, more focused and domain-specific, application-supporting infrastructures. Some systems concentrate more on the fundamentals, use non-proprietary languages, support composition and specialization, and play well with others. Proton over C++ provides this kind of application support. Proton is a C++ framework that arose from the end of the “free lunch” era, when clock rates stopped doubling every two years. Proton was developed to advance performance still further, by combining the most critical elements together for tighter interaction: memory management, multithreading and event processing. Over that substructure, Proton provides mid-level data processing services, including comprehensive treatment of strings, arrays, vectors and file access. Expressions of aggregate data perform iterative computation without loops or subscripting. Strings and arrays dynamically manage themselves to automatically provide memory space based on present needs. These services are designed as an everyday programming model with more concise logic and shorter code. Move constructor technology eliminates the temporary value problem, making expressions of aggregate values practical and highly efficient. Preface This document provides a summary of the asynchronous programming services provided by the Proton framework. While the product manual goes into more depth and detail across all topics using Proton, this discussion focuses on the architecture and motivation for the Proton concurrency solutions now running. Proton presents essentially the same programming models and APIs under Windows and Linux. Proton applications compile and run under both operating systems, and provide all the same services. Proton uses and relies on the Windows API over MS Windows, and on Winelib over Linux. Proton makes space for both the logical flexibility of multithreading and the opportunities for concurrency it brings. One goal is to be able to write application logic in C++ that easily scales in performance with the number of processors available to run them. This goal might be easy and off the shelf, if turning a pile of threads loose in your application were all there was to it. But it’s not that simple—as soon as different threads begin interacting with the same data, and each other—synchronizing data transactions becomes necessary, opening the door to delays, deadlocks, performance degradation, and interlocking complexities. The locking methods that work so well to control transaction access, can deadlock, don’t scale well to large numbers of concurrent transactions, and their use in finer-grained synchronization can severely degrade transaction performance. The multithreaded event processing model in Proton is built from a thread reactor pool combined with active objects, virtual blocking, dynamic resource management, lock-free transactions, RAII orientation, and other well-founded supporting models and design patterns. The resulting Actor model effectively unifies multithreading with event processing in C++, supporting it with custom per-thread allocation services designed for this work, and scales up to the available processors and across active threads for as far as virtual memory will stretch. For applications large and small, Proton runs best in 64-bit, where virtual memory is practically limitless and 64-bit logic can be much faster. Proton applications consist of some set of independent threads, each with their own independent roles to play, that often collaborate on work initiated by one another. An application begins with a main thread and a Windows message pump thread, with additional threads started as needed for the application. Threads have names and can start or find each other, and use native non-blocking event- based communication among themselves. Passing tasks to others is an effective way to spread work among processors, while maintaining asynchronous operation throughout.
  5. 5. 3 C++ object model Threads, tasks and events are objects with virtual behavior that you define. Objects give you something to create and lifetimes to manage, with member functions local to the object and out of the global space. Threads start up when you post events to them, stay alive with AddRef ( ) calls, and retire upon final Release ( ). All of the handles, identifiers, operating system calls and other details required are encapsulated within these objects, making them simple and low-noise to use, portable into other environments and bullet proof. Proton base classes encapsulate the layered complexities of multithreading, event processing and dynamic allocation services, so that client application logic can focus on its own specific activities. Virtual functions are defined on objects everywhere to support custom operations and choices. A variety of Proton debugging hooks and procedures are provided to make all of these pieces practical to use in the face of rough and tumble development and imperfect programming. Defaults exist everywhere to do something reasonable until you choose to do something more. Free-running thread roles The model for managing multithreaded complexities starts with a set of threads acting out their own independent roles,roles of your choice that you program. Threads in Proton are defined by the data and behavior ( ) that you put into them,as specified in the thread classes you define over Proton thread base classes. Proton provides a thread-local free-running environment for unimpeded thread execution,free from interference from other threads,each computing generally in their own logical data space atomic-free—nearly all of the time. When thread roles communicate by posting non-blocking events and tasks among themselves,you have the Actor Model—a strategy that first appeared in Errlang,a language from the 1980s that was far ahead of its time. Posting tasks and events When one role needs something to happen in another role, it posts events or tasks to that role, or to an object managed by that role. These tasks are serviced by the receiving thread and run in the order posted. Tasks and events can do anything in the receiving thread context, including reposting themselves or posting additional tasks to other recipients, and finally auto-releasing after use. Tasks in Proton are defined by the data and action ( ) that you put into them, as specified in the task classes you build over Proton task base classes. Custody of task data changes hands from one thread to the next, and is normally assumed to be unaccessed by others, to provide unfettered non-atomic access to thread, task and event data, wherever it goes. Tasks can finish after one use, or live on to be posted to another thread, or go back to the thread that posted it. Many degrees of freedom exist here. Task and event processing In order to be responsive to events from other thread roles, threads must process their events from time to time. This is performed implicitly while blocking for completions or time-outs, or explicitly whenever needed, by calling a DoEvents ( ) function to immediately get up-to-date. Each call services all accumulated work available for the calling thread. Each thread controls its own event latency by how often it is able to process the events and tasks it receives. It also chooses where in its logic to service things, to tightly control order and placement of local completions.
  6. 6. 4 Busy threads choose their moment to service their own events and tasks, and service them in the order received. Blocking becomes just another excuse to service thread-local events and tasks. Threads only block when they have nothing better to do, but otherwise are free to go about their own independent business. In this way, all threads become event processors, an arrangement that is fast and flexible, that supports a wide domain of multithreaded application models, and that can scale to hundreds of threads in 32-bit model (more in 64-bit), across any number of processors. Concurrent central workheap The concurrency available from this multithreaded event model can sometimes be more coincidental than the kind of focused high-utilization concurrency we would like to see. So for that, threads may also post tasks to the application-wide workheap, where they are immediately processed by all available processors, as directed by processing load. This capability is similar to the ExecutorService in Java, and comes preassembled and ready to accept new tasks, with immediate and aggressive performance, low-overhead and fully integrated into the Proton programming model. The Proton workheap is always-on, and its automatic processing doesn’t wait—it starts the moment it goes non-empty. Proton manages this and makes it practical and robust, within an overall discrete workflow approach that may best be described by considering more of the elements involved: Task and event workflow •• threads package up events and tasks with the data necessary for others to process •• client threads post unordered events and tasks, service threads process them •• tasks go into light-weight lock-free queues with low internal latency •• tasks usually don’t need critical sections unless required by application data •• task and event-based work flow passes between and among threads •• a task is an action ( ) with the data it needs, that you hand off to do in another thread •• event response is defined by its recipient, task action ( ) is defined by its sender •• you can post tasks anytime and they can start executing immediately upon arrival •• a completion task may run after completing a group of tasks posted to different threads •• threads normally carry on independently, block, compute, perform i/o and other duties •• work tasks are buffered, submitted and processed in single-cacheline units called worksets •• posting work tasks is atomic-free, worksets are internally posted wait-free and unbounded •• event and task objects are faster, more manageable and generic than message passing Concurrent processing opportunities •• multiple threads acting in specific roles while servicing and posting tasks between them •• multi-stream tasks to a global workheap for automatic multi-processor servicing •• large tasks can recursively split into smaller pieces and repost to other processors •• event-driven scheduling and completion of concurrent external processes •• use completion tasks to post and orchestrate the next set of concurrent tasks •• multiple completion groups may proceed concurrently
  7. 7. 5 •• event producers provide hints (size, ordering, priority, etc.) for better service routing •• work may be posted in phases, such that phase (i) is finished before phase (i+1) is started •• phase barriers may limit the extent to which concurrency can be achieved •• larger work phases allow for more concurrency than smaller phases •• number of processors servicing the workheap follows the workload Virtual blocking •• virtual blocking state puts waiting threads to work servicing events and tasks •• blocking threads maintain a pool of themselves, for use as concurrent service threads •• hard blocked threads are wakeable for multiprocessing service •• threads (hard) block only when they have nothing else to do,when work really has run out •• open files in ASYNC mode continue servicing events while blocking during file i/o Notifiers •• notifiers allow posting threads to define actions later triggered by service completions •• a notifier is for client-specific actions that necessarily originate in the service thread •• notifiers can be defined by any thread, for other threads to activate like signals •• servicing threads can notify posting threads when their local work completes •• invoking undefined notifiers is harmless and does nothing, redefinition replaces them •• unlike event responses that run in the target thread, notifiers run in the notifying thread •• notifiers are logically similar to events but more direct and immediate •• notifiers can post events or directly take any action as required •• multi-threaded precautions are often necessary in notifier responses Containment and Tasks Much of the Proton approach is designed to promote containment for managing most things, to give your code normal non-atomic access to objects and data, nearly all the time. Containment just means that objects have current thread affinity, and only one thread may access them at any given time. If I have such an object, I can modify it all I want, then give it to you, then you can access and modify it all you want—correct, fast and simple. This is how tasks work, and posting a task to another thread moves the task object from one thread containment to another. Task code needs no further synchronization unless it is introduced by special needs of its application logic. An effective containment policy generally reduces complexity and improves performance. Well-written Proton applications take advantage of the containment model as a base approach, helping to keep multithreaded hard transactions relatively rare. Proton memory management supports this, and allows a memory aquisition and later release to occur in different threads as needed. Proton supports multi-threaded transactions most often by encapsulating their complexity within objects that are cleaner to use in everyday programming. All the Proton atomics and related structures and APIs are also available for use in pursuit of your goals.
  8. 8. 6 More on posting events and tasks Threads post events and tasks for other threads to perform. An event is a message block where the recipient defines the response, a task is one where the sender defines the response. These transactions are serviced synchronously, at a moment chosen by the receiving thread, by calling its thread-local DoEvents ( ) function. This call is made implicitly when blocking for timeouts or other wait conditions. Threads may explicitly call DoEvents ( ) to process all events and tasks that have queued up since the last such call. Threads may choose to put off calling DoEvents ( ), until they need to bring thread-local data state up-to-date. When thread-local work dries up, DoEvents ( ) processes available work in the concurrent workheap, along with other threads so inclined, up to the available processors. Inattentive threads can call DoEvents ( ) interspersed within any logical process to improve event processing latency, as needed. Each call processes all pending activity and returns quickly when there is nothing to do (under 100 machine cycles). An event is a spontaneous occurrence that requires someone’s attention and often a response, e.g. a key is struck on the keyboard, or a request arrives from a remote procedure call. Events are handled differently depending on the type of event, who sends it and who receives it. A task is an event that contains a specific response to be played-upon-opening by its servicing thread. You create tasks, in order to post them to other threads to perform, and as such, they may contain any data necessary to their completion. Tasks have less overhead than events and remain efficient with small payloads down to a few hundred machine cycles. Events are normally processed in the order they were posted; tasks may or may not have ordering requirements, depending on the tasks and their context. Some additional related points include: •• Events are data objects you send to recipients, who expect them and provide their own response •• Tasks are like events, but define their own response for the recipient to run upon opening •• Concurrent data transactions in events/tasks must provide their own synchronization as needed •• Transactions performed completely within specific threads don’t need synchronization •• Tasks require at least 250 machine cycles to create, initialize and post for servicing Events and tasks are posted to threads and stored in queues for subsequent servicing in the order they arrived. Windows, threads and other active objects each have their own event queues, so there is no global queue being maintained anywhere. Events are handled by responses defined by the active objects to which those events were posted. Responses are typically defined for a complete set of events, which is created and selected into the active object involved. DoEvents ( ) applies the correct response for each event it services, and executes each task directly in the order they arrive. On completion, calling Release ( ) on the task/event finishes the transaction. Tasks can be defined easily and directly: class SimpleTask : public TaskEvent { // Tasks are simple and straightforward to define Integer value; String string; // define whatever data your task needs to run public: SimpleTask (Integer i, String s) : value(i), string(s) { your_init ( ); } virtual void action ( ) { do_something ( ); } // this is what you want the task to perform };
  9. 9. 7 Later, you can post a new task in this manner: thread‑>postevent (new SimpleTask (666,“test”)); any number of these may be created and posted like this. Each task is run by the receiving thread, followed by task‑>Release ( ) on completion. Tasks generally find their operational data in, or referred to in the task object, as setup by its own constructor. Task objects can live on for repeated posting, by calling task‑>AddRef ( ) for each additional cycle, before reposting them (normally with further set up). Ordered tasks and events may be posted to active objects and threads, by any thread at any time. Each thread only attends to servicing its own queue(s), allowing others to do the same. There are no particular limits on the number of threads that may participate in this manner, nor limits on event queue capacity, other than the available memory resources and enough processors to move things along. Event coordinators Posting tasks to specific threads is simple, fast and and direct. But sometimes this can be too direct, as it expects posting threads to know what specific threads to use. This is fine for simple applications, but can introduce unwanted task management dependencies in more complex situations. The solution to this is the EventCoordinator abstraction, for indirectly posting tasks which are then (or later) routed to specific destination threads. Participating end points register themselves with known coordinator objects, that are programmed to distribute incoming tasks to appropriate threads or other end point objects—without having to know the specific threads ahead of time. There are many interesting ways to use this abstraction. A coordinator can choose an under-utilized end point over those that are saturated, to balance loads and increase concurrency. A coordinator can hide logical distances between threads, routing some over networks and others to local threads or processes. Internal task routing policies can be modified within coordinators without changing the participating client thread logic. Client threads can be isolated from details having security or proprietary sensitivities, while allowing them to run freely, without complications. Like many objects in Proton, coordinator lifetimes are reference count managed, to operate for as long as they are being used, and automatically freed when not. Task interthreading Normally, tasks live to execute their action ( ) just once, followed by Release ( ) and they are gone. But it can be useful in a task’s action ( ) to be able to call a switch_to (thread) function, to allow specific threads to contribute their own pieces to a larger task assembly. Task interthreading is simpler and faster than generating a new task for every thread switch along the way, because there is only one task object passed around for the duration. This is possible because tasks provide functional closure, designed to be posted to, and run in, other threads. Until the task action ( ) itself finally returns, your code can switch from thread to thread, running different code in each thread it visits, along an arbitrary finite state machine. Such a sequence can itself be considered a virtual thread, and arbitrarily many such threads can be active simultaneously, over no more actual threads than the available processors permit.
  10. 10. 8 TaskGroups and completion tasks A TaskGroup is a temporary object that associates a set of tasks, of arbitrary number, with a common completion task, posted to the thread of your choice after all dependent tasks have been completed. The included tasks are those posted from the calling thread, to other threads for servicing, while the TaskGroup object is active. The posting logic and the tasks themselves require no change to be used in this manner, and no knowledge of the group is required by the included tasks to function properly. Hence you can group the tasks of arbitrary existing task logic behind a common completion task--from the outside. Multiple groups of completing tasks can run like this independently and concurrently. A likely use for task groups, is to group multiple tasks sent to the Proton workheap with other tasks, behind a single completion task that safely acts once those tasks have completed. For example, computing a screen graphic using multiple task threads of the workheap normally requires a completion task to invalidate the window rectangle--but only after the results all stablize. The completion task is ultimately executed by a chosen target thread. Completion tasks can start up other sequences with their own completion tasks, and march on indefinitely over the prevailing thread activity. Virtual blocking Each thread carries on its own process, independent of other threads. When a thread blocks for input, time delays, or other application-level wait conditions, its uses a virtualized blocking state. To software, this state appears like normal blocking, except during this period the thread may continue to run, staying productive performing other activities while waiting for the logical block to release. Such activities include: •• servicing thread-local events and tasks arriving in its queues •• thread-centric housekeeping, resource management, maintaining runtime statistics •• servicing the application concurrent workheap, alongside other processing threads •• performing potential deadlock detection and other run-time integrity testing •• really blocking when there is nothing else to do, but wakeable for workheap service In this environment, threads stay productive longer, more fully utilizing their time-slice, and support concurrent tasks, up to the number of available processors. Porting single-threaded software into multi- threaded designs can benefit from this approach. The request-block-continue style prevalent in single- threaded designs can often remain a sensible logical model, as long as its blocking state can be made productive and stay out of (deadlock) trouble. Proton makes no attempt to “hook into” the blocking mechanisms provided by the operating system to provide virtual blocking capabilities. Rather, you explicitly substitute Proton blocking methods for the standard waiting functions, wherever you desire virtual blocking services. Proton provides WaitForObject ( ) and WaitForAnyObject ( ) functions to use for blocking with time-outs on waitable objects, like threads, events, signals, processes, etc. Virtual blocking works with any object on which you would normally call WaitForSingleObject ( ) and other related functions, with similar argument structure and return values, so using them in existing C++ code is pretty painless. Combining SignalReactors with the DoEvents ( ) architecture adds significant flexibility to Proton event handling. These are tasks that you define and post to install a prepared signal and response into any thread’s DoEvents  ( ) handler. It is designed to enable custom DoEvents ( ) response to some set of signals that become active while a thread processes its regular task and event traffic. It can be used for any
  11. 11. 9 thread-specific application signals whose responses fit this model. It services its signal until either you remove ( ) it or its thread exits. Activity monitor Pressing ctl-alt- into a running Proton application window, brings up a side-window into the immediate activity and statistics across its running threads. This includes thread names, event latency, atomic retry contention, critical section collisions, exceptions thrown, memory allocated, timers active, number of windows and controls, etc. It helps you identify threads that appear to be non-responsive, modal, stuck in a loop or creeping recursion, leaking memory, threads with latency to spare, and those barely awake. It shows unprocessed work levels, events posted and processed, event latency and relative thread loads. Putting the mouse over any of its data columns elicits a short description of it, making it largely self-explanatory. The monitor lets you view your multiprocessing operation in action, to see that key indicators are occurring as they should, in real-time. Its display runs continuously until you terminate it, and may be viewed in any Proton application that allows it to be viewed, enabled by default. The monitor state at application exit is restored when the application starts up the next time. It becomes invaluable during the debug, shakeout and testing phases of multi-threaded software development, showing significant internal run-time characteristics not available from debuggers or external process viewers. Its event-driven operation incurs a negligible 0.1% machine load when running, and acts as a continuously running Proton self-test, to demonstrate core integrity during any running Proton application session. Multiprocessing unordered tasks The application workheap holds batches of tasks to be multiprocessed at the highest achievable processing rates. Proton puts idle application threads to work processing the earliest work phases, up to the number of available processors. When there are not enough available and willing threads to service the workload, additional threads are started automatically, which run until the workload is consumed, then exit after a few moments of quiescence. But you can’t just stuff anything you like into the multiprocessing workheap. Events and tasks that logically must follow the completion of others, must all be posted together, to a specific thread, to guarantee the servicing order they require. The multiprocessing workheap is intended only for tasks that can be processed independently, in any order within a work phase, that make little or no use of critical sections. See the workheap->reserve(n) function for a method to preserve fine-grained task order in the workheap. Normally, events and tasks are posted for single threaded servicing in the order posted, by the receiving thread. But when you mark an event “unordered” and post it, postevent ( ) routes it to the multiprocessing workheap, where it goes into the current work phase. Calling the function workheap‑>postevent(task) also puts a task into the calling thread’s active work phase. The act of posting unordered events and tasks, advertises a thread’s work for discovery by active service threads, looking to find and process any available workload. Other threads will often be similarly engaged. The service threads are merely those that happen to call DoEvents ( ), or those that were blocking and reawakened to be put into service. Neither the posting threads, nor the service threads need any particular awareness of this process—it just happens. Tasks are initialized as “unordered” or not, so any thread posting them can implicitly participate in the process.
  12. 12. 10 Processor considerations Proton applications can operate on just one or two processors, but things get more interesting with more. With four processors well utilized, you can expect a throughput resembling 3.8 effective processors, and higher when memory contention stays low. Similar ratios for 6 and 12 processors should hold. Proton however will use all the physical CPUs that are made available to it, excluding hyperthreads. Hyperthreading is supported, doubling the number of schedulable processors when HT is active. But since each pair of HT processors share one physical core, they don’t typically add further performance, and can run a bit slower with twice the threads doing half the work. Because of this, the workheap limits itself to available physical processors, even when HT is active. The workheap‑>exclude (n) function lets you reduce the effective processor count even more, to keep other things running the way you want. Specific threads that cannot tolerate any extra delays can mark themselves with thread->is (RealTime) to keep from being conscripted into service by the workheap. When Proton detects that adding further processors finds them under-achieving, the added processors can automatically backoff and try again later. This reduces the occasion of system overload and degradation of real-time response. Splitting up work In Proton, individual tasks are not processed by multiple threads, rather, tasks are usually single-threaded and many tasks are distributed among multiple threads. To more effectively use the multiprocessing engine, the work you run though it can be self-splitting, or be already split into some number of smaller chunks. This characterization is intentionally vague, because there is a bit of latitude here, but there is also a range. Generally dividing the work into hundreds of subtasks will work well, particularly when their sizes are roughly the same. Larger tasks that are marked Immediate also divide well among concurrent threads, but enough of them still must be posted before all processors will ramp up. Statistical load balancing becomes effective when the total task count is much higher than the thread count. If the chunks are few, or vastly different sizes, concurrency tuning may show gains by “gear shifting”. A workset with one large task can represent the same load as a workset with many small tasks. Adjusting for that (when viable) helps to balance the work load among processors. There are several control points: •• Mark large tasks as Immediate in their constructors (who ought to know). Such tasks go out to the work heap immediately upon posting them, along with any others present in the current workset. •• Call workheap‑>postevent (0) to flush the task buffer immediately, increasing the number of worksets to divide among hungry service threads. •• Set the cutoff point in the current thread for workset posting, by calling workheap‑>dwell (n), so that when the workset task count reaches n, it is moved to the multiprocessing workheap. You set n low for large tasks, high for small tasks with a default of 1. •• Tasks can divide their work by reposting portions of themselves as new tasks to the workheap for others to similarly act on. This cascades across the available workheap processors and quickly ramps up processing to full capacity until completion. It takes worksets lining up before a second processor is added, the load must keep building to add a third, and so on. If you generate too many tasks, that is ok, the work just queues up and service threads crank through it until the backlog subsides.
  13. 13. 11 When tasks are hefty enough by themselves, setting task‑>is (Immediate) before the posting sequence, will immediately put them into the active work phase, by themselves if necessary, without waiting for more tasks to stack up. This exposes more opportunities for concurrent processing with large task sequences. Avoid using Immediate on small tasks, allowing them to buffer normally for higher bandwidth. Not splitting up work Sometimes you want to ensure that a short sequence of tasks will be scheduled together, as a unit later processed in order, by one thread, but still by the workheap (e.g. you might have many such sequences to work off). Calling workheap‑>reserve (n) will do just that, by checking the room for n tasks per workset in the calling thread, and flushing it out for a fresh one if necessary. This does not obligate you to actually post anything, and not doing so makes calling reserve (n) another way to flush the task buffer. Setting the threshold too high will be limited by a lower dwell (n) setting, or by Immediate tasks that arrive. Grouping tasks in this manner is a practical way of serializing several tasks, at any time, in the midst of unordered multiprocessing chaos. The posting side is fiercely single-threaded, so these things are performed with the convenience of making your own non-atomic synchronous choices. Multi-phase workheap strategy A thread’s workheap can be partitioned into two or more work phases that are sequentially serviced, for ordering at larger scales without giving up any concurrency within each work phase. Work phases permit unordered computation in one phase to depend on and wait for results from completed earlier phases. Individual phases are unbounded and dynamically managed. Each thread always posts into its own current work phase, but may advance at any time to a new phase. Prior phases are multiprocessed in order, with phase i completed before phase i+1 begins. Servicing proceeds through each phase and finishes with the latest phase. A thread does not have to divide its work into phases, but may choose instead to post and process the current phase continuously, without ever advancing it. Work phases are independent by thread and exist for the convenience and benefit of their respective thread. Work phases in separate threads are temporally unrelated, but can expect continuous attention from multiprocessing services. Only the thread posting the work generally knows whether any such dependencies might exist. Threads post unordered tasks into an internal thread-local workset, consisting of a light-weight single cacheline buffer of task pointers. When the workset fills or is submitted, it is appended to its current work phase, which may itself be in a state of being serviced. Worksets spread the transaction costs across multiple tasks, particularly useful for fine-grain tasks. Worksets hold tasks until they are submitted to the workheap for servicing, but Proton performs that submission automatically at various opportunities. The first post into an empty and idle per-thread workheap, attempts to awaken a blocked thread for duty. If there are none, a new thread may be started if there are more processors to bring into play. Each thread looks at its own situation and asks for more help (i.e. from other threads) when it seems necessary. Threads that do not opt out of workheap service, enter workheap service during DoEvents ( ) calls and during its blocking state. Only the oldest unfinished work phases are ever processed at any given moment, followed by each phase after that, if any. A servicing thread begins by choosing an active subscriber at random, then working off its earliest work phase, choosing again, and so on.
  14. 14. 12 Work distribution across threads The workheap consists of a set of thread-local queues assigned across participating threads. Each thread posts its work only to its own private queue within the workheap, without interference to/from other threads. Posting to just one of these queues is still serviced by all participating threads, but that can also create a hotspot, with too many threads going after one queue. A better work distribution puts more work in front of more processors more quickly, for a faster completion. A cascading work distribution approach, posts large chunks of work, which are picked up by other threads and reposted in smaller pieces, quickly spreading the work across all participating threads. This results in many servers with many clients, rather than one server with all the clients, scaling easily to any number of processors that rapidly converge to completion. Such a work task simply processes a small chunk, or divides a larger chunk in half, reposting one of them back to the workheap, and repeating again with the remainder. This rapidly breaks down work while minimizing added transactions into the local workheap queue. Different scenarios and types of work define their own tasks for custom task breakdown and servicing. This method has been tested and demonstrated in the fractal application, and shown to be highly effective in rapidly and evenly distributing large requests across all participating threads. Such work distribution activity can be viewed and evaluated in real-time, from the Proton thread activity monitor. See the fractal.cpp source code for further details about fractal.exe implementation. Concurrent service threads Global workheap structures maintain a list of subscribing threads, their active state, how many processors are servicing each and in total. Additional servicing threads use this information to choose which subscriber to service next and when to request additional processor assistance. With insufficient threads available for service, the workheap makes more threads available by starting EventHandler threads (shown as “zombies”), which do nothing but block—waiting for events to process. After a few moments without any new material to process, these supporting threads time-out and exit automatically. On phase complete, service threads randomly choose another active subscriber or exit the workheap. Similarly, prospective service threads that find the earliest phase in an empty state, cannot make any progress and randomly choose another active subscriber or exit the workheap. The final processor to find the service phase empty, advances the service phase to the next phase (but not past the posting phase). This guarantees all earlier phases are complete before processing the next one. Calling workheap‑>service ( ) performs a service cycle in the calling thread, but the function returns harmlessly, doing nothing if already entered, disabled, nothing to do, etc. If the workheap draws too much processor attention away from other application activities, you can exclude one or more processors from workheap service with the workheap‑>exclude (n) function, that acts globally to limit processor participation (0 or 1 are the best values). This can also help overall system responsiveness under high loads. Any thread may opt out of being called into service by calling thread->is (RealTime), which indicates the thread cannot tolerate much delay or latency. For example, the thread monitor and the winpump threads do not service the workheap. This keeps the message flow moving and the monitoring display current, even when all processor effort is servicing the workheap. Similar accomodation may be necessary in a
  15. 15. 13 variety of multithreaded application scenarios, specifically for threads involved in activities considered time-critical. Within each thread, this setting can be changed back and forth as often as needed. You can view the task latency for any system of Proton threads through the activity monitor. Service workchannel architecture To manage the multi-threaded operation of workheap activities, the workheap maintains a fixed set of internal “workchannels” that are assigned to threads for posting worksets into the workheap. Among other things, the workchannel ensures that puts ( ) and gets ( ) always access the proper work phase queues, queues are allocated, initialized and freed at the right times, new threads are started when needed, and available work is presented to prospective threads for servicing. Workchannels are internal to the workheap, so there is no API for them, but their existence can be useful to know about. Workchannels remain connected to their thread until released by their thread. They are never taken away by another thread in need of a work channel. Instead, empty and inactive channels are returned to the workheap for reassignment to another posting sequence. This occurs at the end of a work sequence, after its servicing has completed. The channel array is pre-allocated in the workheap, to avoid problems with stale pointers and other multithreaded complexities. Its defined channel capacity should be set sufficiently high to meet the largest possible instantaneous thread demand. The channel internally allocates and frees the queues it uses to manage its own multi-phase work load operation. Workchannel assignment comes out of a wait-free allocation bit table. This map is quickly searched and modified by any number of threads. It is read often, but not so much written. Each channel knows its thread owner, and each thread knows its channel. Thread ownership over a workchannel is obtained and surrendered by the posting thread, the later happening after work is complete with nothing further posted. This promptly recycles the unused workchannels for immediate availability to threads when needed, cycling once per continuous sequence completion. Thread notifiers Notifiers are objects a thread holds, that implement notification from other threads. Completing the local workheap can activate a thread notifier, if defined in the client thread ahead of time. Notifiers are run in the servicing thread, not the client thread, so its client virtual activate ( ) call should be written with that in mind. Applications may use a notifier to invalidate and update the display on completion of any work in its local workheap. It could do anything else that is sensible there as well, i.e. multi-threaded sensibilities. Notifiers don’t have the open-ended flexibility of completion tasks, but they can handle situations that cannot tolerate the task latency. Scalability Workheap capacity ramps up automatically at multiple scales. First, workset capacity expands to accomodate the current dwell ( ) setting. The filled worksets are posted to bounded queues, but when a queue fills up, a new queue is created and linked in to allow unobstructed posting. Multiple queues provide more opportunities for keeping processors busy with low contention, helping to isolate posting from processing and processing from processing, increasing bandwidth at the very moment when the load is getting larger.
  16. 16. 14 This arrangement is spread across all the threads that post work to the workheap, with all available processors consuming it everywhere as quickly as possible. The ramifications of all this can be viewed in the Proton activity monitor in real-time, to assist application debugging and balancing. Putting unlimited processors on one service channel is feasible because multiple servers are spread across multiple queues to avoid contention. Each client (posting) thread maintains its own part of the workheap, so multiple clients posting work increase the available servicing bandwidth, when there are more processors to run them. One way to spread the work across all participating threads, is to post larger tasks that are picked up by other service threads and reposted as many smaller tasks. This seeds the distributed workheaps to multiply concurrent servicing possibilities. You cannot post tasks directly into the work phases of other threads, but they do post into their own channels, as part of their concurrent servicing operation of tasks you post. This method is used by the fractal sample application, whose tasks repeatedly repost half the task down to the tile level. See the code in fractal.cpp for implementation details. Workchannels are limited to 64, meaning “only”that many threads can be actively posting tasks into the workheap simultaneously. But since just one posting thread can often easily overwhelm all available processors with task overload, it is questionable whether that many posting threads are even servicable, short of putting sufficient processors on the job. However, many threads streaming tasks to the workheap at sustainable rates, will all see the gain in performance and reduced latency they expect. Which work channel to process next is selected at random, by incoming service threads, from those eligible to run. Overload is not usually a problem however, because tasks simply queue up in dynamic queues until servicing arrives. Once serviced to empty and quiescent, channels are returned to the workheap for reassignment to another thread that may start posting to the workheap at any moment. This keeps all unused channels available. If a thread cannot obtain a channel, it services the workheap until a channel becomes available, making progress in either case. Future Proton releases may process work phases differently for more effective scheduling (as internal ordering is undefined), but the present architecture can efficiently utilize at least 64 processors when many threads are both posting and servicing. The workheap actively avoids employing any more service threads than the actual number of processors made available to the running process. You can hold back some processors from the work heap by calling workheap‑>exclude (n), to keep other threads responsive in specific scenarios. Excluding at least one processor from situations involving continous high-load processing, is beneficial allowing the rest of the system to breathe and process normally when your application is taking over everything. By monitoring their own workloads, service threads can drop out of service if they see utilization below 25% for some time interval. Low utilization indicates thread is not doing much now and can leave as long as other threads are present. Such threads often hold resources useful to the other threads that remain, and releasing them will recycle those resources. Hyperthreading the workheap Because Proton workheap scheduling tends to even out the load among multiple service threads, for computation-bound material, hyperthreaded workheap servicing provides no real advantage. Hyperthreading works better with uneven loads among threads, where under-utilization in one thread increases availability for another thread. However such cooperative match-ups are highly dependent on the application loads present. Therefore Proton effectively hides HT from consideration, with the workheap depending only upon the physical processor count, even when HT is enabled.
  17. 17. 15 I/O bound material, like that involving file processing and network access, may expose more opportunities to enable modest gains using HT. Hyperthreading is a system-wide setting, and usually chosen to benefit system-wide performance. Individual applications should be able to accomodate this situation as it arises, either way. Hyperthreading is also helpful for testing multi-threaded software behavior across a wider variety of multi-processor configurations. Processor limits in Proton have nothing to say about how many threads you choose to run, where HT can provide its full bounty. Without HT, it is sometimes useful to reserve at least one processor, leaving it under-utilized, to avoid starving the rest of the system of processor attention. Hyperthreading makes this less an issue, because employing all physical processors still leaves hyperthreads available to keep the rest of the system advancing. Multiprocessing resources in Proton class ActiveRole generic thread class in Proton from which you derive your thread classes, you define their virtual behavior ( ), add your data, and start them with postevent (0), virtual functions define initialize/teardown, blocking/unblocking, etc thread‑>behavior ( ); virtual function where you define and implement your thread logic, it is called by thread startup services in Proton, not by the application thread‑>postevent (0); brief thread wakeup, or start thread if suspended or never started, starts only after construction, never from inside a base class constructor thread‑>is (RealTime); set to indicate a thread with no tolerance for extra delays, clear this state on threads willing to service the workheap (default), each thread manages its own real-time setting thread‑>notify (obj); set thread notifier to activate when local workheap tasks are complete thread‑>notify (enum); attempt to trigger a specific notifier defined (or not) in the thread
  18. 18. 16 y y y class TaskEvent base class for tasks, you define its virtual action ( ) and data to suit thread‑>action ( ); virtual function where you define and implement your task logic, it is called by event processing services in Proton, not by the application task‑>is (Unordered); marks task as unordered (often set in the task constructor) task‑>is (Immediate); marks tasks that immediately post, potentially with others, to the workheap task‑>is (LowPriority); marks tasks that are serviced after (regular) high priority tasks thread‑>postevent (task); posts ordered tasks to a thread, or unordered tasks to the workheap task‑>postback ( ); posts a task back to the thread it came from after being received TaskGroup z (task, target); associates a completion task to the tasks posted while z remains in scope DoEvents ( ); processes all thread-local events, services the workheap as needed, no waiting DoEvents (n); processes all thread-local events/workheap, waiting up to n msecs for work WaitSecs (period); calls DoEvents ( ) for a time period, with accurate timing, blocking as needed, both WaitSecs ( ) and DoEvents ( ) return 0=exiting, 1=idle, 2=active y y y WorkHeap *workheap; application global workheap, always defined throughout app session workheap‑>postevent (tsk); post a task to the calling thread’s work channel into the workheap workheap‑>postevent (0); post any buffered tasks from the calling thread out to the workheap workheap‑>dwell (lim); set how many tasks accumulate before being put into the active work phase workheap‑>service ( ); services the workheap in the calling thread, as needed and enabled, fast out when no work, has enough processors, is disabled or is unneeded, willing threads call-in automatically to ensure fast workheap response workheap‑>newphase ( ); begin a new local work phase, so that subsequent posted work is not started until earlier phases finish. workheap‑>reserve (nt); keeps the next nt posts in the same workset buffer (up to dwell) workheap‑>exclude (np); excludes np processors from service that would normally be available y y y y y y
  19. 19. 17 A workheap rather than work-stealing? Work-stealing is a proactive approach to concurrent task scheduling that has similarities with Proton workheap services. It is however but one concept of many that go into building a practical multiprocessing system. Here are some other considerations: •• Work-stealing is a pull-technology, Proton’s workheap is more of a push-pull work- distribution technology, where threads know about and cooperate with one another. •• Rather than chasing volatile thread objects and stealing work from their queues, the workheap centralizes multiprocessing choices, making available work more directly accessible to interested threads, with less contention. Still volatile, but much more contained, with fewer cache-line load implications. •• Once you let work-stealing threads pull tasks out of your thread queues, processing your task queue in posted-order can no longer be guaranteed. Since support for posted-order is mandatory for many things, work-stealing by itself is inadequate. •• Proton threads support both ordered tasks, processed by specific threads, and unordered tasks, multiprocessed by many threads. Workheap multiprocessing activities are independent from the ordered tasks normally posted to specific threads, so ordered and unordered tasks really represent separate workflows. •• Each thread posts unordered work in one or more phases, to be multiprocessed in that order, one by one. Multiple processors concentrate on each phase and complete it before starting the next phase. The current phase may be posted and processed concurrently, prior phases are processed until empty and released. •• The Proton posting side to things is fast, lock-free with few atomics, and distributed across participating threads. The servicing side is wait-free within and across all work phases. •• The workheap knows how to wake up blocked threads and put them to work as unordered work piles up awaiting service. This wakeup occurs when threads try to post things. Tasks posted and on their way to a thread queue, are diverted to the workheap when marked as “unordered tasks”. This indicator can be set on task creation, or marked somewhere along the way. Those responsible for constructing and posting tasks can be expected to know which of their tasks have thread affinity, which tasks do not, and to mark them accordingly before posting them. By default, tasks and events are normally ordered, and processed by their target thread.
  20. 20. 18 Fractal explorer application This sample application illustrates some of Proton’s standard multiprocessing services brought to bear. Fractal.exe was an older single threaded application with crushing computational needs, requiring the highest CPU performance and a tiny display to make it at all interesting and responsive. It was never designed for multi-processing, and earlier attempts to make it so, found achieving it overly complicated, with mixed and disappointing results. This made fractal.exe a perfect test candidate for applying Proton multi-processing technology. All source code for fractal.exe is included as a sample Proton project, with code you can reshape and derive your own work from as needed. Originating from user-initiated changes in view, the required graphics are split into hundreds of graphic tile regions (under 1200 pixels each) that are posted to the workheap for computation. The finished tiles are then posted to the winpump for direct screen rendering, which side-steps having massive update of critical sections (a big cost) by instead using single threaded direct rendering (a tiny cost, in this case). Window updates occur when all events are (fleetingly) complete and the display has been marked modified. Since individual tiles require a relatively large computational effort, just one task per workset can be used for this application. Realistically, running fractal.exe requires a 16Ghz uniprocessor, but any quad-core CPU clocked over 3Ghz will do. Even dual-core CPUs are not quite enough to keep things interesting. Fractal runs twice as fast on a 4Ghz-6 processor (Gulftown), as compared with a 3Ghz-4 processor (Bloomfield). The thread monitor shows the multi-threaded blow-by-blow action (brought up with ctl-alt-). Larger window sizes have more pixels to compute, so you get faster display response with smaller windows. With a large enough image size to render and animate, Fractal.exe can still bring any single CPU to its knees, no matter how many processors it has, no matter how high its clock rate (i.e. for now). You can change the display size, drag the fractal image around with the mouse, and zoom in/out with the mouse wheel, or with a 1st/2nd mouse button click. The mouse wheel is more fun when you keep the window size smaller, but you can expand back when you arrive some place interesting. Its windows apply different sizing responses to form-resize, depending on the shift key state. The application remembers where you left it on the screen at the last exit, and comes up with the same content the next time. Full screen and multi-screen image sizes are supported. Fractal.exe is particularly interesting and instructive when you have the thread activity monitor up (ctl‑alt-), while you explore the fractal. The effects from variable loads quickly lead to resource changes and action in more or fewer service threads, as they compete for the work available, and show up to you in real-time samples. Fractal.exe has been designed and used as a test program to exercise and hammer the Proton multi- processing architecture, to watch how it survives variable loads, and to learn things that go back into making Proton services better. As such, a number of performance opportunities have been foregone in fractal.exe, as its sledgehammer characteristics remain important for Proton quality-assurance testing. Future versions of fractal.exe may go beyond the present one, when time permits, with some of the cool ideas already documented in its source code.