# HG changeset patch # User Sean Halle # Date 1391090394 28800 # Node ID d6450ce874a90cf20c3b39e63f68f600e04cff48 # Parent 8cb89eed66b40defa21297122e444d11e9fc4950 worktable, PR toolkit, figures, HiPEAC poster, CV, Entr workshop diff -r 8cb89eed66b4 -r d6450ce874a9 0__Papers/Languages/Work_Table/WorkTable_tutorial/latex/Sept_2011__WorkTable_tutorial.tex --- a/0__Papers/Languages/Work_Table/WorkTable_tutorial/latex/Sept_2011__WorkTable_tutorial.tex Tue Dec 31 10:49:41 2013 -0800 +++ b/0__Papers/Languages/Work_Table/WorkTable_tutorial/latex/Sept_2011__WorkTable_tutorial.tex Thu Jan 30 05:59:54 2014 -0800 @@ -262,7 +262,7 @@ \author{Sean Halle} %\email{seanhalle@yahoo.com} -\institute{OpenSource Stewardship Foundation} +\institute{Open Source Research Institute} \maketitle @@ -277,44 +277,51 @@ To do a unit of work, take a copy of the instructions, plus a box-paper, go get the box, then follow the instructions on the contents. We’ll call the list of instructions plus the box named by a box-paper a WorkUnit, and we’ll call the box containing the materials a WorkUnit Box. + So, what’s inside a box? A WorkUnit Box (WUB) holds all the materials, plus notes from previous work done on those materials, like post-it notes to let later work-units know about things. Let’s say we’re writing a system for a team of workers to build a house. During construction, the parts of the house can’t actually be put into the box, so instead, a ticket representing ownership of the material is put into -the box, like a ticket that owns “the south wall”. Only boxes that have a ticket for part of the house can -do work on that part of the house. -Let’s say, a box-paper lands in the “put in window” pile, the box will contain a ticket that owns a wall +the box, like a ticket that owns “the south wall”. Only boxes that have a ticket for part of the house can do work on that part of the house. + +Let’s say, a box-paper lands in the ``put in window'' pile, the box will contain a ticket that owns a wall in the house, plus a ticket that owns one of the windows sitting in the storage lot. The reason for the tickets is to give us a way to make sure two workers don’t try to use the same material for two different bits of work at the same time! We’ll see more about keeping the workers out of each other’s hair soon. + Great, now we have a box that holds all of the tickets for the materials used in the work, and we have piles that have the instructions for what to do on the materials, and we have a table to organize it all, how does the work actually happen? BLIS provides the table, and a person who sits at the table and does the moving the box-papers -among the piles. We’ll call them the Scheduler Person, or just the Scheduler. A worker walks up to the +among the piles. We`ll call them the Scheduler Person, or just the Scheduler. A worker walks up to the table, and the Scheduler Person hands them a list of instructions plus a box-paper. The worker walks off, finds the box, and follows the instructions on the materials owned by the box. When done, they hand the box-paper back to the Scheduler person, who follows the arrows, puts the paper in a new pile, then picks another set of instructions plus box-paper and hands it to the worker, and so on. Because the Scheduler Person is part of the BLIS Work Table system, you, the programmer, don’t know what choices they will make, but you can tell them constraints on their choices, more on that later. + That’s the basic idea of how a system is used. As a programmer, your task is to state the piles -needed, draw the arrows between the piles, and write the instructions for each pile. When done, you’ll +needed, draw the arrows between the piles, and write the instructions for each pile. When done, you'll have a system for making the thing you want made, and any crew of workers will be able to show up and make one, no matter how many people they might have! -1 + +\subsection{Walk throuh} Let’s work through an example of making such a system for building a house. As a programmer, the first thing you write is the instructions that tell the scheduler person how to set up the table, including what piles to make, the arrows between the piles, and the instructions to put at each pile. + When the program starts, these setup instructions run first. After the table is set up, the scheduler person starts picking box-papers from the piles and handing them to workers, with a copy of the pile- instructions, then getting box-papers back from the workers, following the arrows to put those into the appropriate piles, and repeating the picking and handing out, until the job is done. + The instructions for how to set up the table is called the “setup” function. In setup, you’ll use built-in -functions that start with “BLIS_WT__”, to tell the scheduler person the name and type of each pile -(BLIS_WT__create_work_pile...), the pile-instructions (a standard C function), and the arrows -between the piles (BLIS_WT__connect_pile_to_pile...). +functions that start with ``BLIS\_WT\_\_", to tell the scheduler person the name and type of each pile +(BLIS\_WT\_\_create\_work\_pile...), the pile-instructions (a standard C function), and the arrows +between the piles (BLIS\_WT\_\_connect\_pile\_to\_pile...). Like this: +\begin{verbatim} BLIS_WT__start_work_table_setup( WT_FOR_HOUSE_CONSTRUCTION ); BLIS_WT__create_work_pile( FRAME_A_WALL, &make_the_frame_for_a_wall); BLIS_WT__create_work_pile( PUT_IN_WINDOW, &put_window_into_wall); @@ -323,1069 +330,114 @@ BLIS_WT__connect_pile_to_pile(FRAME_A_WALL, PUT_IN_WINDOW); BLIS_WT__connect_pile_to_pile(PUT_IN_WINDOW, PUT_IN_DOOR); BLIS_WT__end_work_table_setup( WT_FOR_HOUSE_CONSTRUCTION); +\end{verbatim} + When creating a pile, the first argument is the name of the pile, as an enum constant, the second is the pile-instructions, a pointer to a function. When connecting piles, the first argument is the name of -the from-pile, the second the name of the to-pile. Notice, this work table arrangement only has paral- -lelism for doing multiple walls at the same time. On a single wall, the work goes in-order, frame, then +the from-pile, the second the name of the to-pile. +Notice, this work table arrangement only has paral- +lelism for doing multiple walls at the same time. On a single wall, the work goes in-order: frame, then put in windows, then put in door. This shortcoming will be fixed with the new kinds of piles introduced next. + Not all piles are the same, some are special, used only by the scheduler person, in effect to make sure no two workers interfere with each other, while still keeping all workers busy. For example, going back to installing a window, imagine that one of the workers took the window and tried to install it before the wall was finished! Or, that a worker painted the wall before the window was in, which would leave the window frame unpainted! -To prevent these problems, there are piles called “hold until” – a box-paper is held in one of these piles + +To prevent these problems, there are piles called "hold until" – a box-paper is held in one of these piles until something happens that releases it. The instructions that go with a HoldUntil pile get performed by the scheduler person, and tell them, for a given box-paper that comes in, which releasing box-paper to watch for and where to watch for it (or them, because sometimes a held box-paper has multiple propen- dent box-papers). + In this example, let’s say when a wall is done, that two box-papers are made for the WUB that owns it, one box-paper goes to a pile where all the windows get put into the wall, the other box-paper goes to the pile where the wall gets painted. If the second box-paper goes right in, it might get handed to a -worker too early! It’s only when a box-paper leaves the “Install windows” pile, that the box-paper for the -same box can leave the “hold until windows installed” pile and flow into the “paint” pile. +worker too early! It’s only when a box-paper leaves the “Install windows” pile, that the box-paper for the same box can leave the "hold until windows installed" pile and flow into the "paint" pile. + One complication is that the scheduler person has to do some bookkeeping when watching for propen- dent box-papers. So for each spot that a HoldUntil might tell the scheduler person to watch, a Watcher is created, which causes a place to be setup for the scheduler person to keep track of which box-papers have passed that spot. + Here’s what the setup code for such a HoldUntil would look like: +\begin{verbatim} BLIS_WT__create_Watcher_at_pile_exit(INSTALL_WINDOWS,INSTALL_WINDOWS_EXIT); BLIS_WT__create_HoldUntil_pile(HOLD_FOR_WINDOWS, &calc_ID_of_releasing_box); BLIS_WT__connect_pile_to_pile(HOLD_FOR_WINDOWS, PAINT_WALL); -The first line creates a watcher – the first arg is the pile-exit to watch, the second is the watcher’s +\end{verbatim} + +The first line creates a watcher – the first arg is the pile-exit to watch, the second is the watcher's name. The second line creates the HoldUntil – the first argument is the name of the pile, the second is a pointer to a C function that calculates which WorkUnit Box is being waited upon, and where to watch for it. This function will use the name of the watcher that was created on the first line. + Another kind of scheduler pile is called a ReJoiner. Sometimes, two different materials will have things done to them separately, but they have to join up again and be used together. For example, a door fits to a particular frame, the two get given to two different work unit boxes, and separately get sanded down, primed, and stained. When both are done, the door has to be joined back together with its frame, so a ReJoiner pile is made that matches the work-unit box that owns the door to the work-unit -box that owns the frame. You, the programmer, write a function that takes a box and calculates -a “match number” for it. The Scheduler Person sitting at the table will take the match_number function -you write and use it to find pairs of work-unit-boxes. They put the pairs, together, onto the next pile. +box that owns the frame. + +You, the programmer, write a function that takes a box and calculates a "match number" for it. The Scheduler Person sitting at the table will take the match_number function you write and use it to find pairs of work-unit-boxes. They put the pairs, together, onto the next pile. + Continuing the example, to let doors and their frames be worked on separately, one of the piles has to separate them! The pile that does this will receive the door + frame as a unit, and has two choices on how to proceed – the first is to make two new work-unit-boxes and give one ownership of the door, the other ownership of the frame, then send those two along different arrows. The second is to make a copy of the piece of paper with the ID of the work-unit box, then send one copy down the path for doors and the other copy down the path for frames. + The second option is less work, so try that. At the spot on the table where the two paths come back together, place a ReJoiner, which will have two inputs, one from each path. To each input attach a func- tion that just takes the ID of the work-unit-box and returns that as the match-ID. This will cause one copy of the paper with the work-unit-box ID to be matched with the other copy. This way, it’s certain that both the door and the frame have completed before doing any more work with them. + Here’s the code: +\begin{verbatim} BLIS_WT__ (dispatch pile to two others – a sand door pile and a sand frame pile – then a ReJoiner pile) dispatch pile – takes “house materials”, outputs “doors”, “door-frames”, “windows”, “materials” BLIS_WT__(ReJoiner pile made with the make-match-ID function for each input (show function that extracts ID and returns as match_ID) +\end{verbatim} + The next scheduler pile is called a Keeper pile, which keeps around work-unit boxes that might be -needed at some point. As an example, say one of the work-piles makes a special jig for cutting the holes -in a door to install a door-knob. Some of the doors arrive with the door-knob already installed, so it’s -not certain which doors will need the jig. A separate jig is made for each kind of door, and when com- -plete, the jig flows into a Keeper. More precisely, a ticket for the jig is put into a work-unit box and the -box-paper for that box flows into the Keeper-pile. +needed at some point. + +As an example, say one of the work-piles makes a special jig for cutting the holes in a door to install a door-knob. Some of the doors arrive with the door-knob already installed, so it's not certain which doors will need the jig. A separate jig is made for each kind of door, and when complete, the jig flows into a Keeper. More precisely, a ticket for the jig is put into a work-unit box and the box-paper for that box flows into the Keeper-pile. + We will need some way to steer box-papers, in this case either sending to the Keeper so a door-knob can be installed, or by-passing the door-knob-install process. Another kind of scheduler pile gets used for this, called a Steer-pile. The instructions for one of these tell the scheduler person which of many outputs -to send a box-paper on. All work-piles have only a single output spot, so a Steer-pile is needed to steer a -box-paper among many possible routes. -Continuing the example, let’s say the box-paper for a door flows into a work-pile. The instructions -say to check if the door has a knob installed. The worker looks, and this door is missing the knob, so the -instructions tell the worker to write a post-it note saying the box needs a knob installed, and another -post-it saying the type of door. The instructions say where to put these notes in the box. -From there, the box-paper flows into a Steer-pile, where the instructions tell the scheduler person to -look inside the work-unit box and find the post-it left there by the worker. If the note says the door -needs a knob installed, it flows out the “need knob” output, otherwise it flows out the “has a knob” output. -The “needs knob” output was connected to the request-pile of the Keeper that has the jigs, where the -instructions say to get the post-it that has the kind of door written on it. The instructions then have a -numbered list and tell the scheduler person to match the name on the post-it to a name in the list, then -look at the list number. This number is the ID of the work-unit-box that owns the jig needed for that -kind of door! -At this point, the scheduler person goes and checks if the work-unit-box with that ID on it is sitting in -the Keeper room. If not, they leave a note for themselves, which they’ll check when putting work-unit- -boxes into the room. Either way, whenever the work-unit box is in the room, the scheduler person sees it -and puts its box-paper together with the box-paper from the request-pile, and flows them out of the -Keeper along the arrow leaving the Keeper’s “response” output. In this example, the pointed-to will to be -the work-pile for installing a door-knob! -When the door-knob installation is complete, the work-unit box that owns the jig gets sent back to the -Keeper-pile. When it arrives, the scheduler person checks their notes whether any requests are waiting -for that work-unit box. -Notice that the scheduler person is the only one with access to the Keeper room, so the scheduler -alone controls which request-response pairs have a box-paper for a box that’s in the Keeper room. The -way the scheduler person keeps track of this requires more depth to explain, but the end-effect is that the -scheduler can be told to control Kept boxes in a few different ways: ensure that the Kept work-unit box is -the only one with a ticket to the thing it has a ticket to, or let there be any number of work-unit boxes -out there that have a ticket to that thing, and, in either case, can tell the scheduler person to control how -many box-papers to a Kept box it gives out before requiring one of them to return. For this simple tuto- -rial, just assume that the Kept box is the only one with a ticket to the thing inside it, and that only a -single box-paper for the Kept box can be outstanding at a time. -For parallelism, notice that a request waiting in the Keeper does not occupy any workers, nor the -scheduler. It’s only when the requested Kept gets sent along the response arrow that a new work-unit -gets created, to occupy a worker. Also, the scheduler person only spends time checking for a box when -the request comes in and when a box-paper for a Kept box enters the Keeper. So, the workers and sched- -uler person are free to do other things while the request for the Kept box sits in the Keeper, waiting for -the box-paper for the Kept box to come back into the Keeper. -When designing your own code, the more places you can make copies of a box-paper and have it go -down different paths on the work table, the better! The trick will be putting in HoldUntils and ReJoiners -to be sure workers don’t get the chance to do things in the wrong order, or to fight over the same thing, -trying to use it at the same time! The debugging process will help with finding all the places you need to -add HoldUnitls and ReJoiners, so feel safe that if one’s needed, you’ll hear about it! -That’s it for this simple introduction to the WorkTable patterns, now get one of the example pro- -grams, and keep this tutorial by your side to remind you of what each of the BLIS_WT__... calls in the -example program does. Happy coding! +to send a box-paper on. All work-piles have only a single output spot, so a Steer-pile is needed to steer a box-paper among many possible routes. +Continuing the example, let’s say the box-paper for a door flows into a work-pile. The instructions say to check if the door has a knob installed. The worker looks, and this door is missing the knob, so the instructions tell the worker to write a post-it note saying the box needs a knob installed, and another post-it saying the type of door. The instructions say where to put these notes in the box. +From there, the box-paper flows into a Steer-pile, where the instructions tell the scheduler person to look inside the work-unit box and find the post-it left there by the worker. If the note says the door needs a knob installed, it flows out the "need knob" output, otherwise it flows out the "has a knob" output. +The "needs knob" output was connected to the request-pile of the Keeper that has the jigs, where the instructions say to get the post-it that has the kind of door written on it. The instructions then have a numbered list and tell the scheduler person to match the name on the post-it to a name in the list, then look at the list number. This number is the ID of the work-unit-box that owns the jig needed for that kind of door! +At this point, the scheduler person goes and checks if the work-unit-box with that ID on it is sitting in the Keeper room. If not, they leave a note for themselves, which they’ll check when putting work-unit-boxes into the room. Either way, whenever the work-unit box is in the room, the scheduler person sees it and puts its box-paper together with the box-paper from the request-pile, and flows them out of the Keeper along the arrow leaving the Keeper’s "response" output. In this example, the pointed-to will to be the work-pile for installing a door-knob! +When the door-knob installation is complete, the work-unit box that owns the jig gets sent back to the Keeper-pile. When it arrives, the scheduler person checks their notes whether any requests are waiting for that work-unit box. -======================================= -\section{Motivation} +Notice that the scheduler person is the only one with access to the Keeper room, so the scheduler alone controls which request-response pairs have a box-paper for a box that’s in the Keeper room. The way the scheduler person keeps track of this requires more depth to explain, but the end-effect is that the scheduler can be told to control Kept boxes in a few different ways: ensure that the Kept work-unit box is the only one with a ticket to the thing it has a ticket to, or let there be any number of work-unit boxes out there that have a ticket to that thing, and, in either case, can tell the scheduler person to control how many box-papers to a Kept box it gives out before requiring one of them to return. For this simple tutorial, just assume that the Kept box is the only one with a ticket to the thing inside it, and that only a single box-paper for the Kept box can be outstanding at a time. -Thread parallelism constructs have been well documented to be -difficult to program with. They directly expose low-level concurrency -to the programmer. Arbitrary non-deterministic behavior and deadlocks -can arise from improperly synchronized code. Efficient execution -requires non-blocking algorithms whose correctness require deep -understanding of weakly consistent memory models. In addition, the -operating system abstraction for threads comes with a very high -context-switching and synchronization overhead. +For parallelism, notice that a request waiting in the Keeper does not occupy any workers, nor the scheduler. It’s only when the requested Kept gets sent along the response arrow that a new work-unit gets created, to occupy a worker. Also, the scheduler person only spends time checking for a box when the request comes in and when a box-paper for a Kept box enters the Keeper. +So, the workers and scheduler person are free to do other things while the request for the Kept box sits in the Keeper, waiting for the box-paper for the Kept box to come back into the Keeper. -%% AC. The following is wrong. -%\paragraph{\bf Problems with threads: } Thread parallelism constructs have been well documented to be difficult to program with. In addition, threads are problematic when one has multiple physical cores because the thread abstraction hides the hardware, disabling control over placement of tasks. High performance requires such control. Hence, not only are threads difficult to use directly in applications, but they prevent easier parallelism constructs from being built on top of them that have high performance. +When designing your own code, the more places you can make copies of a box-paper and have it go down different paths on the work table, the better! The trick will be putting in HoldUntils and ReJoiners to be sure workers don’t get the chance to do things in the wrong order, or to fight over the same thing, trying to use it at the same time! The debugging process will help with finding all the places you need to add HoldUnitls and ReJoiners, so feel safe that if one’s needed, you’ll hear about it! -%\paragraph{\bf Partial solution: } To deal with this, in practice, a language's runtime turns off +That’s it for this simple introduction to the WorkTable patterns, now get one of the example programs, and keep this tutorial by your side to remind you of what each of the BLIS\_WT\_\_... calls in the example program does. Happy coding! -\paragraph*{A partial solution.} -To deal with the last problem, a parallel language's runtime turns off -operating system threads by pinning one to each physical core. This -way, the custom runtime is assured that the software thread is -one-to-one with a physical core. It then implements a user-level -thread package that lets it control which OS thread a computational -task is assigned to. Finally, the runtime then implements the -language's parallel semantics in terms of those user threads. -The user-level threading approach addresses the system overhead issue, -but it still hides important events such as input-output or -node-to-node communications in a cluster. -%% AC. Irrelevant for a parallelism-centric paper -%software faults, and other hardware events -Scalable runtimes need to coordinate task assignment -to cores with application access of input and output, memory -allocation over non-uniform cache and memory hierarchies, offloading -to hardware accelerators, power management, and inter-node -communication in a cluster. The user-level threading approach is hampered in addressing these needs, and -further makes the parallel runtime implementation cumbersome, error-prone and -complex, because it is still written in terms of threads. -Ideally, the OS would be in terms of a mutable hardware abstraction, and export mutations as new behavior. We define a mutable hardware abstraction to be an interface to hardware-level behaviors that are normally inside the OS or below it. Examples include communication between cores, allotting time-slots to applications, and establishing ordering of events among cores (which is what atomic memory operations and equivalent patterns of instructions do). The kernel itself would be implemented in terms of such an abstraction, and would accept mutations the same way it accepts device-drivers. It would then export the mutated behaviors for the language to trigger. - -A language runtime would be in the form of a mutation. Being inside the OS, it has secure access to kernel-only hardware mechanisms. It could directly negotiate with the kernel to manage physical resources, in a low-overhead way. The arrangement enables the runtime to control which task is assigned to which processing element at what time. Both high performance and low-energy depend on this for implementing data affinity techniques. For example, the runtime could track data within the memory hierarchy and assign tasks to locations close to their consumed data. - -% \paragraph{\bf Support for portability: } This separation between -% application executable and language runtime also supports -% portability. It packages the language-specific scheduler into a -% separate machine-specific module that is separately installed. The -% scheduler is thus optimized for the combination of language with -% hardware. This isn't a full solution to portability but it is a needed -% precursor. - -% An interesting portability technique enabled by a separate -% language-plus-hardware specific runtime is tracking which cores are -% likely to already have data in the cache which a new task requires as -% input. It could place tasks where the input data is likely to reside, -% reducing communication to increase performance and power efficiency. - -\paragraph*{Contribution.} - -We show in this paper the definition and implementation of such a -mutable hardware abstraction, albeit at user-level rather than in -the kernel. The abstraction lets a language's runtime be implemented as -a mutation, which we call a plugin. The plugin implements parallelism constructs and assignment of tasks to cores. - -We focus in this introductory paper on the definition of the -abstraction and its support for parallelism constructs, postponing -exploration of assignment of tasks onto cores and other performance -optimizations to following papers. This paper defines multiple time-lines in a program, and a virtual timeline that globally orders events among them. It demonstrates three sets of parallelism constructs: synchronous \texttt{send}-\texttt{receive} -motivated by process calculi; \texttt{spawn} and \texttt{sync} from -Cilk {\cite{Fri98,CILKHome}}; and \texttt{mutex} and \texttt{condition - variable} from pthreads. The assignment policy we implemented with -them is simply first-come first-served. - -We call the abstraction Virtualized Master-Slave, or VMS. It -exports facilities to create virtual processors and control how their timelines relate to each other, and relate to physical time. It also exports facilities to suspend a virtual processor and for an executable to interact with the plugin. The plugin embodies most of a language's runtime. A wrapper-library or keyword is what appears in application code, and is what triggers the runtime. - -\paragraph*{Organization of paper.} - -Section~\ref{secAbsModel} provide the original concepts and -definitions of VMS. Section~\ref{secInternal} focuses on the -implementation, describing the elements and how they interact, then -relating them back to the theoretical definition. -Section~\ref{secApp} takes the point of view of the application code, -studying the usage and implementation of parallel language constructs -as a VMS plugin. To wrap up, measurements of effectiveness appear in -Section~\ref{secResults} and conclusions in -Section~\ref{secConclusion}. - -\section{Background and Related Work} - -User-level thread packages and most parallel language runtimes have to -side-step OS threads, by pinning one to each core, which effectively -gives the user-level package control over the core. Our VMS -implementation also does this. We are not claiming in this paper to -have the OS level implementation of VMS that is possible -- but just -the user-space version. - -\paragraph*{Related work.} - -The most primitive methods for establishing ordering among cores or distributed processors are atomic instructions and clock-synchronization techniques \cite{Lamport78,Christian89}. - -Meanwhile, the most closely related work is Scheduler Activations -\cite{SchedActivations}, which also allows modifying concurrency -constructs and controlling assignment of virtual processors onto -cores. However it has no virtual time to guarantee globally consistent -sequentialization, and no interface for plugins. - -BOM \cite{BOMinManticore}, which is used in Manticore to express -scheduling policies and synchronization, also bears resemblances to -VMS, but at a higher level of abstraction. BOM is a functional -language, rather than a primitive abstraction meant to sit at the -hardware-software boundary as VMS is. - -Coroutines is a high-performance means of switching between -tasks. Coroutine scheduling and stack handling techniques were well -suited to the user-space implementation of VMS. - -Other related work either provides an abstraction of the thread model, -or is a full language with specific parallelism constructs. As a -protypic example of user-level threads, Cilk {\cite{Fri98,CILKHome}} -provides a simplified abstraction with an efficient scheduling and -load balancing algorithm, but limited to fork-join concurrency. OpenMP -{\cite{OpenMPHome}} is a typical example of a parallel extension of -sequential languages; it allows creating tasks and controlling their -execution order. We claim that both Cilk and OpenMP, as well as most -thread abstractions or parallel languages may be implemented via -plugins to VMS, with similar performance. - -VMS is unique in that it doesn't impose its own concurrency semantics as a programming model, but rather takes preferred ones as plugins. This -makes it only a \emph{support} mechanism to -implement language runtimes -- VMS is hidden from the application, underneath the language. Parallelism constructs may be implemented -as VMS plugins, easily, quickly, and with high performance as -indicated in Section~\ref{secResults}. - -This work presents a first incarnation of VMS. We plan -to explore the embedding into VMS of a variety of parallel languages, -with a special interest for coordination languages -\cite{Gelernter85Linda,CnCInHotPar,CnCHome}. We will also explore -VMS's compatibility with different concurrent semantics -\cite{Kah74,hoare78,milner99,Hewitt10,Actors97}. One particularly -important application would be to use VMS to facilitate the design and -implementation of the emerging hybrid programming models, such as -MPI+OpenMP, or OpenMP+OpenCL \cite{Car10,OpenCLHome}. - - -\section{Abstract Definition of VMS} -\label{secAbsModel} - -We start with an intuitive overview, then precise the definitions and -properties in the following sub-sections. - -\paragraph*{Definitions:} - -1) We want to avoid the confusion associated with the various -interpretations for the terms ``thread'' and ``task'' so will use -the term \emph{Virtual Processor} (VP), which we define as state in -combination with the ability to animate code or an additional level of -virtual processors. The state consists of a program counter, a stack -with its contents, a pointer to top of stack, and a pointer to the -current stack frame. 2) A \emph{physical processor} executes -a sequential stream of instructions. -3) A program-timeline is the sequence of instructions animated by a -Slave VP, which is in turn animated by a physical processor. - - -\paragraph*{Intuitive Overview.} - -VMS can be understood via an analogy with atomic instructions, such as Compare and Swap (CAS). These are used to establish an ordering among the timelines of cores. They consist of two parts: 1) the semantics of what is done to the memory location, 2) a mechanism that establishes an ordering among the cores. For CAS, the semantics are: ``compare value in this register to value at the address, and if same, then put value in second register into the address.'' Multiple kinds of atomic instructions share the same order-establishing mechanism, they simply provide different semantics as a front-end. - - VMS can be viewed as virtualizing the order-establishing mechanism. It allows the semantics to be plugged-in to it. This breaks concurrency constructs into two parts: the VMS mechanism, which establishes an ordering between events in different timelines; and the plugin, which supplies the semantics. - - Below the interface, hardware mechanisms are employed to order specific points in one physical processor's timeline relative to specific points in another's timeline. Above the interface, a plugin provides the semantics that an application uses to invoke creation of the ordering. - -Together, VMS plus the plugin form a parallelism construct, by which an application controls how the time-lines of its virtual processors relate. Such constructs also guarantee relations of VP time-lines to hardware events. - -As an example, consider a program where one VP writes into a data structure then calls a \texttt{send} construct. Meanwhile, a -different VP calls the \texttt{receive} construct then reads -the data structure. The semantics of the \texttt{send} and \texttt{receive} constructs are that all data -written before the \texttt{send} is readable in the other time-line -after the \texttt{receive}. - To implement these constructs, VMS provides the mechanism to enforce the ordering, and to include the writes and reads in that ordering. The plugin directs that mechanism to order the \texttt{send} event before the \texttt{receive} event. -%% AC. Too early. -% VMS provides a primitive guarantee that -% plugin code builds upon to provide such higher-level guarantees. The -% primitive guarantee is defined in Section~\ref{scTime}, after the -% necessary vocabulary is established. - -\paragraph{What the VMS interface provides:} The interface provides primitive operations -to create and suspend VPs; a way for plugins to control when and where -each VP is (re)started; a way for application code to send requests to the plugin; and a way to order a specific point in one VP time-line relative to a specific point in another VP time-line. All implementations of the VMS interface provide these, whether it is on shared memory or distributed, with -strong memory consistency or weak. - -\paragraph*{Specification in three parts.} - -We specify the observable behavior of a VMS system \emph{with plugins present}. Hence, -the specified behaviors remain valid with any parallelism construct -implementable with VMS. First we give the specification of a computation system that VMS is compatible with; then specify a notion of time and the key VMS guarantee; and lastly specify -virtual processor scheduling states and transitions between them. - -%% AC. Too early, Master has not been defined. I renamed VMS-core into -%% VMS afterwards, finding no ambiguity. -% The Master mentioned in the definition is an abstract entity, with a -% plugin present. In practice, this Master entity is implemented as part -% of a core VMS, and plugins later added. This VMS-core is the hardware -% abstraction. It hides the physical hardware behind an interface that -% creates virtual processors and enforces well-defined time-behavior. - -\subsection{The Specifications for a VMS-compatible Computation System} - -\begin{itemize} -\item An application creates multiple VPs, which are Slaves, each with an - independent time-line. -\item A schedule of Slaves is generated by a Master entity, from - within a hidden time-line(s). -\item A schedule is the set of physical locations and time-points at which Slaves are - (re)animated. -\item All semantic parallelism behavior is invoked by Slaves communicating - with the Master. -\item A Slave communicates with the Master by using a VMS primitive, - which suspends the Slave. -\end{itemize} - -\paragraph{Where we define:} - Semantic Parallelism Behavior is the actions taken by a parallelism construct, which establishes an ordering among events in different Slave timelines. - - -\paragraph{Discussion:} The key point is that -\emph{scheduling is separated from the application code}. This is enforced by the schedule being generated in a time-line hidden from the application. The rest of the requirements are consequences of that separation. - -The Master entity appears as a single entity, to the slaves. However it may -be implemented with multiple (hidden) timelines. This is the approach taken in our initial implementation, which has several Master VPs hidden inside the VMS -implementation. - - - - -%% AC. Calling memory a processing element only adds confusion. -% {\bf Definitions:\ }VMS is intended only for hardware systems that -% consist of processing elements connected by communication. We define -% a memory-space to be a processing element, albeit without the ability -% to transform data. - - -\subsection{The Time-Related Specifications of VMS} -\label{secTime} - -To prepare for the time-related specifications, we give an advance peek of the following section, 3.3. There, Slave VPs are specified to have three scheduling states: Animated, Blocked, and Ready. When a parallelism construct starts execution, the Slave transitions from Animated to Blocked. When it ends execution, the Slave transitions from Blocked to Ready. VMS provides a way to control the order of these state-transitions, which is equivalent to controlling the order of the parallelism-constructs. Controlling the state transitions is how the ordering among constructs in different timelines is established. - -With that background, here are time-related specifications for VMS: \begin{itemize} -\item VMS provides a Virtual timeline that globally orders changes of scheduling state of Slave VPs. -\item Ordering is created among construct-invocations by controlling the order of Blocked to Ready transitions in the Virtual timeline. -\item -Causally tied construct-invocations are tied-together inside the Master. \item VMS enforces ordering of \textit{observations} of physical events in Slave timelines to be consistent with the Virtual time ordering. -\item Virtual time defines only ordering, but not spans, nor widths. -\end{itemize} - -\paragraph{Discussion:} -Most importantly here, Virtual time defines a global ordering among Slave state-transitions. To make this useful for parallelism, VMS must be implemented so that observations of physical events, like reads and writes to shared memory, are consistent with that ordering. - -The Virtual timeline plays the same role as the mechanism added to memory systems to support atomic instructions. All atomic instructions require hardware that establishes an ordering among the timelines of physical cores. That hardware sequentializes execution of atomic memory accesses to the same address. VMS virtualizes this mechanism. It provides the same ordering function. - -An important point is that the Virtual timeline is generated inside the Master. When a Slave uses the VMS primitive to send a parallelism-construct request, it suspends. However, that Slave doesn't actually transition state from Animated to Blocked until the Master \textit{acknowledges} the suspension. It is the acknowledgement that adds the Slave transition into the Virtual timeline. - -The essential value of VMS is using it to control the order of observing events. It has to be able to causally link the execution of a parallelism construct in one timeline to the execution of a construct in a different timeline. Establishing such a causal link is called \textit{tying together} two construct executions. - It is specific executions from different timelines that are causally linked with such a tie. -\paragraph*{The key VMS guarantee:} - -the order of observing physical events is consistent with the order of tied together parallelism constructs. - -To explain this, take as given: two Slaves both execute parallelism constructs, those are tied together by the Master, establishing a causal ordering from one to the other. So, one construct is the \textit{before}-construct, the other is the \textit{after}-construct. Now, the guarantee means that any events triggered before the before-construct, in its timeline, are guaranteed to be detected in the other timeline as also preceding the after-construct. In addition, events triggered after the after-construct are guaranteed not visible before the before-construct in its timeline. This two-part guarantee is the result of the above specifications of VMS's time-related behavior. - - - - -\paragraph*{Definitions:} - -Some more definitions, to prepare for the next explanation. 1) An \textit{ordering-point} exists in a Slave VP's timeline as a zero-width event that can be tied to ordering points in other timelines. It is initiated by a Slave VP executing the suspend -primitive, and ended by the Master transitioning the Slave back to Animated. 2) A trace-segment is a portion of a Slave VP's stream of instructions bounded by ordering-points. - -Hence, the timeline of a Slave is a sequence of trace-segments. Each -trace-segment is animated by a single physical processor, but not -necessarily the same as animated the Slave's other trace segments. - - - - -\begin{figure}[ht] - \includegraphics[width=3.5in]{../figures/Time_in_VMS_2.eps} - \caption - {Time Behaviors: \ Shows Ordering Point 1.1 being tied to Ordering Point 2.1. As a result, VMS guarantees that events triggered in Trace Segment 1.1 are seen as having taken place in the past in Trace Segment 2.2. Also shows that there is no common tied ordering point between segments 1.2 and 2.2, so VMS provides no guarantees about what order segment 2.2 sees events triggered in segment 1.2. - } -\label{figTimeMapping} -\end{figure} - -\paragraph*{Relating time-lines to each other.} - -Figure \ref{figTimeMapping} shows two ordering points being tied together. A trace segment starts, at the same time an ordering-point ends, by its Slave transitioning to Animated. Because the transition to Animated exists as a point in Virtual time, the start of a trace-segment has a well-defined position within Virtual time. Likewise, a trace-segment is ended by its Slave executing the suspend primitive of VMS. Although this does not have a well-defined point in Virtual time, every execution of suspend is acknowledged by the Master, which transitions the Slave to Blocked. That transition does have a well-defined position in Virtual time. Hence, the end of every trace-segment is associated with a well-defined position in Virtual time. - -As a result, trace segments can be ordered relative to each other, by checking their start and end points in Virtual time. If they have no overlap in virtual time then they have a total ordering. However, if any portion of them overlaps in Virtual time, then they are considered concurrent trace-segments, and their Slaves are considered to be executing in parallel between those points of Virtual time. - -Note that this is conservative because it doesn't take into account the physical wait time between a Slave suspending and the Master acknowledging. The Slave may stop executing at a physical time-point that would map onto an earlier point in Virtual time. In some cases, ending the Slave's trace-segment at the earlier point would eliminate the overlap with a particular other trace-segment. But VMS's set of specifications doesn't allow such mapping of suspend-execution onto Virtual time (for implementation-related reasons, which require downloading the code and gaining experience with it, to establish a common language, for an explanation to be understood). - - - -A subtlety is that events triggered -before one tied ordering-point, {\em{might}} be visible in the other timeline before the other tied ordering-point. In the figure, segment 2.1 might be able to see events from segment 1.1 if it looked. The VMS guarantee doesn't cover overlapped trace-segments. Physical events triggered before are only guaranteed visible -{\em{after}} the tie point, and events after are only guaranteed -{\em{not}} visible {\em{before}} the tie point. - -We call this bounded non-determinism, because events within overlapped trace-segments have non-deterministic ordering, but the region of non-determinism can be bounded by tied ordering-points. This allows a program to specify non-determinism, but control the region of non-deterministic behavior. For example, a reduction construct could be created that non-deterministically assigns portions of the reduction work to overlapped Slave segments. It would tie together ordering points from all the Slaves that mark the end of reduction. Hence, the outcome is deterministic, but the path to get there is not. - -\paragraph*{Sequential algorithms for parallel constructs.} -The globally-consistent sequential order in Virtual time enables one -of VMS's main benefits: sequential algorithms for parallel constructs. -An implementation to -tie ordering points together equals an implementation of parallel constructs. A plugin has an ordering of state transitions available, and chooses from those. Sequential algorithms rely on an ordering existing, while concurrent algorithms must include operations that establish an ordering. Plugins have Virtual time ordering available, so they can use sequential algorithms. -\subsection{Specification of Scheduling State} - -Scheduling state is used in VMS to organize internal activity, for -enforcing the guarantees. - -\begin{itemize} -\item VPs have three scheduling states: {\em{Animated}}, - {\em{Blocked}}, {\em{Ready}}; see Figure~\ref{figStates}. -\item VPs in Animated are {\em{allowed}} to advance Program time with - (core-local) physical time. -\item VPs in Blocked and Ready do not advance their Program time. -\item Animated has two physical states: {\em{Progressing}} and - {\em{Stalled}}. -\item VPs in Progressing advance Program time with (core-local) - physical time, those in Stalled do not (allowing non-semantic - suspend for hardware interrupts). -\item Scheduling states are defined in Virtual time only. -\item Progressing and Stalled are defined in (core-local) physical - time only; the distinction is invisible in Virtual time. -\end{itemize} - -\begin{figure}[h!tb] -\begin{minipage}{.55\textwidth} - \includegraphics[width=\textwidth]{../figures/Scheduling_states_2.eps} - \caption{Scheduling states of a slave VP in the VMS model. Animated, Blocked, and Ready are only defined in Virtual Time and only visible in the Master. Progressing and Stalled are only visible in physical-processor local time, not visible in Virtual time.} - \label{figStates} -\end{minipage} -\hfill -\begin{minipage}{.43\textwidth} - \includegraphics[width=\textwidth]{../figures/VMS-core__plugins.eps} - \caption - { - The Master, split into a generic core and a language-specific plugin. - The core encapsulates the hardware and remains the same across applications. - The plugin implements the semantics of the parallelism-constructs. - } - \label{figMasterSplit} -\end{minipage} -\end{figure} - - -Some important points: (1) only VPs Animated can trigger physical -events that are seen in other program time-lines; (2) the distinction -between Blocked vs Stalled is that a Slave has to explicitly execute a -VMS primitive operation to enter Blocked. In contrast, Stalled happens -invisibly, with no effect on semantic behavior. It is due to hardware -events hidden inside VMS, such as interrupts. - -The Ready state is used to separate the parallelism-construct behavior -from the scheduling behavior. It acts as a ``staging area'' for -scheduling. VPs placed into this state are {\em{ready}} to be -animated, but the scheduler decides when and where. - -An interesting point is that in VMS, the causal tie between timelines is created by actions {\em{outside}} program timelines. In contrast, memory-based lock algorithms place the concurrency-related behavior {\em{inside}} program -timelines. - -\paragraph*{Transition Between Slave Scheduling States.} - -\begin{itemize} -\item VPs transition states as shown in Figure \ref{figStates}. -\item Animated$\rightarrow$Blocked is requested by a Slave executing suspend, but takes place in Virtual time at the point the Master acknowledges that request. -\item Blocked$\rightarrow$Ready is determined by the semantics - implemented in the plugin. -\item Ready$\rightarrow$Animated is determined by the scheduler in the - plugin. -\item Transitions in scheduling state have a globally consistent order - in Virtual time. -\end{itemize} - -The parallelism primitives executed by a program do not directly control change -in scheduling states. Rather they communicate messages to the Master, -via a VMS supplied primitive. If it suspended when sending the request, then the act of the Master acknowledging the request places the Animated$\rightarrow$Blocked transition into Virtual time. Inside the Master, the plugin then processes the message. Based on contents, it performs changes in state from Blocked$\rightarrow$Ready, -creates new VPs, and dissipates existing VPs. Most communication from -Slave to Master requires the Slave to suspend when it sends the -message. A few messages, like creating new Slave may be sent without -suspending. - -The suspend primitive decouples local physical time from Virtual time. -Execution causes immediate transition to Stalled in physical time, -later the Master performs Animated$\rightarrow$Blocked, fixing that -transition in Virtual time. The only relationship is causality. This -weak relation is what allows suspension-points to be serialized in -Virtual time, which in turn is what allows using sequential algorithms -to implement parallelism constructs. - -\subsection{Plugins} - -The Master entity has two parts, a generic core part and a plugin -(Figure \ref{figMasterSplit}). The core part of the Master is -implemented as part of VMS itself. The plugin supplies two functions: -the communication-handler and the scheduler, both having a standard -prototype. The communication-handler implements the parallelism -constructs, while scheduler assigns VPs to cores. - -An \emph{instance} of a plugin is created as part of initializing an -application, and the instance holds the semantic and scheduling state -for that run of the application. This state, combined with the virtual -processor states of the slaves created during that application run, -represents progress of the work of the application. For example, -multi-tasking is performed simply by the Master switching among -plugin instances when it has a resource to offer to a scheduler. The -parallelism-semantic state holds all information needed to resume -(hardware state, such as TLB and cache-tags is inside VMS). - -\section{Internal Workings of Our Implementation} -\label{secInternal} - -For our example implementation, we name the elements and describe their -logical function, then relate those to the abstract model. We then step -through the operation of the elements. - -\paragraph*{Elements and their logical function.} - -As illustrated in Figure~\ref{figInternals}, our VMS implementation is -organized around physical cores. Each core has its own {\em{master - virtual-processor}}, \texttt{masterVP}, and a {\em{physical-core - controller}}, which communicate via a set of scheduling slots, -\texttt{schedSlot}. The Master in the abstract definition is -implemented by the multiple \texttt{masterVP}s plus a particular -plugin instance with its shared parallelism-semantic state (seen at -the top). - -On a given core, only one of: the core-controller, \texttt{masterVP}, -or a slave VP, is animated at any point in local physical time. Each -\texttt{masterVP} animates the same function, called -\texttt{master\_loop}, and each slave VP animates a function from the -application, starting with the top-level function the slave is created -with, and following its call sequence. The core controller is -implemented here as a Linux pthread that runs the \texttt{core\_loop} -function. - -Switching between VPs is done by executing a VMS primitive that -suspends the VP. This switches the physical core over to the -controller, by jumping to the start of the \texttt{core\_loop} -function, which chooses the next VP and switches to that (switching is -detailed in Section \ref{secApp} Figure \ref{figAssembly}). - -\paragraph*{Relation to abstract model.} - -We chose to implement the Master entity of the model by a set of -\texttt{masterVP}s, plus plugin functions and shared -parallelism-semantic state. VMS consists of this implementation of the -Master, plus the core-controllers, plus the VMS primitive libraries, -for creating new VPs and dissipating existing VPs, suspending VPs, and -communicating from slave VP to Master. In Figure~\ref{figInternals}, -everything in green is part of VMS, while the plugin is in red, and -application code appears as blue, inside the slave VP. - -Virtual time in the model is implemented via a combination of four -things: a \texttt{masterLock} (not shown) that guarantees non-overlap -of \texttt{masterVP} trace-segments; the \texttt{master\_loop} which -performs transition Animated$\rightarrow$Blocked; the -\texttt{comm\_handler\_fn} which performs Blocked$\rightarrow$Ready -and the \texttt{scheduler\_fn} which performs -Ready$\rightarrow$Animated. \ Each state transition is one step of -Virtual time; is guaranteed sequential by the non-overlap of -\texttt{masterVP} trace segments; and is global due to being in -parallelism-semantic state that is shared (top of -Figure~\ref{figInternals}). - -Transitions Progressing$\rightleftarrows$Stalled within the Animated -state are invisible to the parallelism semantics, the Master, and -Virtual time, and so have no effect on the elements seen. - -\begin{figure*}[h!tb] - \centerline{\includegraphics[width=5in]{../figures/VMS-core__internal_workings.eps}} - \caption - { Internal elements of our example VMS implementation - } - \label{figInternals} -\end{figure*} - -\paragraph*{Steps of operation.} - -The steps of operation are numbered, in Figure \ref{figInternals}. -Taking them in order: - -\begin{enumerate} -\item \texttt{master\_loop} scans the scheduling slots to see which - ones' slaves have suspended since the previous scan. -\item It hands these to the \texttt{comm\_handler\_fn} plugged in -(which equals transition Animated$\rightarrow$Blocked). -\item The VP has a request attached, and data in it causes the - \texttt{comm\_handler\_fn} to manipulate data structures in the - shared parallelism-semantic state. These structures hold all the - slaves in the blocked state (code-level detail and example will - come in Figure~\ref{figReqHdlr}, Section~\ref{secApp}). -\item Some requests cause slaves to be moved to a \texttt{readyQ} on - one of the cores (Blocked$\rightarrow$Ready). Which core's - \texttt{readyQ} receives the slave is under plugin control, - determined by a combination of request contents, semantic state and - physical machine state. -\item During the scan, the \texttt{master\_loop} also looks for empty - slots, and for each calls the \texttt{scheduler\_fn} plugged in. It - chooses a slave from the \texttt{readyQ} on the core animating - \texttt{master\_loop}. -\item The \texttt{master\_loop} then places the slave VP's pointer - into the scheduling slot (Ready$\rightarrow$Animated), making it - available to the \texttt{core\_loop}. -\item When done with the scan, \texttt{masterVP} suspends, switching - animation back to the \texttt{core\_loop}. -\item \texttt{core\_loop} takes slave VPs out of the slots. -\item Then \texttt{core\_loop} switches animation to these slave VPs. -\item When a slave self-suspends, animation returns to the - \texttt{core\_loop} (detail in code in Figure 9), which picks - another. -\item Until all slots are empty and the \texttt{core\_loop} switches - animation to the \texttt{masterVP}. -\end{enumerate} - -\paragraph*{Enabling sequential implementation of parallelism semantics.} - -All these steps happen on each core separately, but -%% AC. Cannot be both an essential property of VMS and "un this particular..." -% in this particular implementation -we use a central \texttt{masterLock} to ensure that -only one core's \texttt{masterVP} can be active at any time. This -guarantees non-overlap of trace-segments from different -\texttt{masterVP}s, allowing the plugins to use sequential algorithms, -without a performance penalty, as verified in -Section~\ref{secResults}. - -Relating this to the abstract model: the parallelism-semantic behavior -of the Master is implemented by the communication handler, in the -plugin. It thus runs in the Master time referred to, in the model, in -Section \ref{secAbsModel}. Requests are sent to the Master by -self-suspension of the slaves, but sit idle until the other slaves in -the scheduling slots have also run. This is the passive behavior of -requests that was noted in Section~\ref{secAbsModel}, which allows the -\texttt{masterVP}s to remain suspended until needed. This in turn -enables the \texttt{masterVP}s from different cores to be -non-overlapped. It is the non-overlap that enables the algorithms for -the parallelism semantics to be sequential. - -\section{Code Example} -\label{secApp} - -To relate the abstract model and the internal elements to application -code and parallelism-library code, we give code snippets that -illustrate key features. We start with the application then work down -through the sequence of calls, to the plugin, using our SSR -{\cite{VMSHome}} parallelism-library as an example. - -In general, applications are either written in terms of a parallel -language that has its own syntax, or a base language with a -parallelism library, which is often called an {\em{embedded - language}}. Our demonstrators, VCilk {\cite{VMSHome}}, Vthread, -and SSR, are all parallelism libraries. A parallel language would -follow the standard practice of performing source-to-source transform, -from custom syntax into C plus parallelism-library calls. - -\paragraph*{SSR.} - -SSR stands for Synchronous Send-Receive, and details of its calls and -internal implementation will be given throughout this section. It has -two types of construct. The first, called {\em{from-to}} has two -calls: \texttt{SSR\_send\_from\_to} and -\texttt{SSR\_receive\_from\_to}, both of which specify the sending VP -as well as the receiving VP. \ The other, called {\em{of-type}} also -has two calls: \texttt{SSR\_\_send\_of\_type\_to} and -\texttt{SSR\_\_receive\_of\_type}, which allow a receiver to accept -from anonymous senders, but select according to type of message. - -% An example of a double column floating figure using two subfigures. -% (The subfig.sty package must be loaded for this to work.) -% The subfigure \label commands are set within each subfloat command, the -% \label for the overall figure must come after \caption. -% \hfil must be used as a separator to get equal spacing. -% The subfigure.sty package works much the same way, except \subfigure is -% used instead of \subfloat. -% -%\begin{figure*}[!t] -%\centerline{\subfloat[Case I]\includegraphics[width=2.5in]{subfigcase1}% -%\label{fig_first_case}} -%\hfil -%\subfloat[Case II]{\includegraphics[width=2.5in]{subfigcase2}% -%\label{fig_second_case}}} -%\caption{Simulation results} -%\label{fig_sim} -%\end{figure*} -% -% Note that often IEEE papers with subfigures do not employ subfigure -% captions (using the optional argument to \subfloat), but instead will -% reference/describe all of them (a), (b), etc., within the main caption. - -\paragraph*{Application view.} - -Figure~\ref{figAnimVP} shows snippets of application code, which use -the SSR parallelism library. The most important feature is that all -calls take a pointer to the VP that is animating the call. This is -seen at the top of the figure where slave VP creation takes a pointer -to the VP asking for creation. Below that is the standard prototype -for top level functions, showing that the function receives a pointer -to the VP it is the top level function for. - -The pointer is placed on the stack by VMS when it creates the VP, and -is the means by which the application comes into possession of the -pointer. This animating VP is passed to all library calls made from -there. For example, the bottom shows a pointer to the animating VP -placed in the position of sender in the \texttt{send} construct call. -Correspondingly, for the \texttt{receive} construct, the position of -receiving VP is filled by the VP animating the call. - -\begin{figure}[h!tb] -{\noindent -{\footnotesize -{\small Creating a new processor:} -\begin{verbatim} -newProcessor = SSR__create_procr( &top_VP_fn, paramsPtr, animatingVP ); -\end{verbatim} - -{\small prototype for the top level function:} -\begin{verbatim} -top_VP_fn( void *parameterStrucPtr, VirtProcr *animatingVP ); -\end{verbatim} - -{\small handing animating VP to parallelism constructs:} -\begin{verbatim} -SSR__send_from_to( messagePtr, animatingVP, receivingVP ); -messagePtr = SSR__receive_from_to( sendingVP, animatingVP ); -\end{verbatim} -} -} -\caption -{Application code snippets showing that all calls to the parallelism library -take the VP animating that call as a parameter. -} -\label{figAnimVP} -\end{figure} - -Relating these to the internal elements of our implementation, the -\texttt{animatingVP} suspends inside each of these calls, passing a -request (generated in the library) to one of the \texttt{masterVP}s. -The \texttt{masterVP} then calls the \texttt{comm-handler} plugin, and -so on, as described in Section \ref{secInternal}. - -For the \texttt{SSR\_\_create\_processor} call, the comm-handler in -turn calls a VMS primitive to perform the creation. The primitive -places a pointer to the newly created VP onto its stack, so that when -\texttt{top\_VP\_fn} is later animated, it sees the VP-pointer as a -parameter passed to it. \ All application code is either such a -top-level function, or has one at the root of the call-stack. - -The send and receive calls both suspend their animating VP. When both -have been called, the communication handler pairs them up and resumes -both. This ties time-lines together, invoking the VMS guarantee. Both -application-functions know, because of the VMS guarantee -(Section~\ref{secAbsModel}), that writes to shared variables made -before the send call by the sender are visible to the receiver after -the receive call. This is the programmer's view of tying together the -local time-lines of two different VPs, as defined in -Section~\ref{secAbsModel}. - -\paragraph*{Concurrency-library view.} - -A parallelism library is a wrapper. Each call, in general, only creates a request, -sends it, and returns, as seen below. To send a request, it uses the -combined request-and-suspend VMS primitive that attaches the request -then suspends the VP. The primitive requires the pointer to the VP, -to attach the request and to suspend it. - -In Figure~\ref{figImplLib}, notice that the request's data is on the -stack of the virtual processor that's animating the call, which is the -\texttt{receiveVP}. The \texttt{VMS\_\_send\_sem\_request} suspends -this VP, which changes the physical core's stack pointer to a -different stack. So the request data is guaranteed to remain -undisturbed while the VP is suspended. - -Figure~\ref{figAssembly} shows the implementation of the VMS suspend -primitive. As seen in Figure \ref{figInternals}, suspending the -\texttt{receiveVP} involves switching to the \texttt{core\_loop}. In -our implementation, this is done by switching to the stack of the -pthread pinned to the physical core and then jumping to the -start-point of \texttt{core\_loop}. - -This code uses standard techniques commonly employed in co-routine -implementations. Tuning effort spent in \texttt{core\_loop} is -inherited by all applications. - -\begin{figure}[h!tb] -{\noindent -{\footnotesize -\begin{verbatim} -void * SSR__receive_from_to( VirtProcr *sendVP, VirtProcr *receiveVP ) - { SSRSemReq reqData; - reqData.receiveVP = receiveVP; - reqData.sendVP = sendVP; - reqData.reqType = receive_from_to; - VMS__send_sem_request( &reqData, receiveVP ); - return receiveVP->dataReturnedFromRequest; - } -\end{verbatim} -} -} -\caption{Implementation of SSR's receive\_from\_to library function.} -\label{figImplLib} - -{\noindent -{\footnotesize -\begin{verbatim} -VMS__suspend_procr( VirtProcr *animatingVP ) - { animatingVP->resumeInstrAddr = &&ResumePt; //GCC takes addr of label - animatingVP->schedSlotAssignedTo->isNewlySuspended = TRUE; //for master_loop to see - - - - ResumePt: - return; - } -\end{verbatim} -} -} - \caption -{Implementation of VMS suspend processor. -Re-animating the virtual processor reverses this sequence. \ It saves the -\texttt{core\_loop}'s resume instr-addr and stack ptr into the VP structure, -then loads the VP's stack ptr and jmps to its \texttt{resumeInstrAddr}. -} -\label{figAssembly} - -{\noindent -{\footnotesize -\begin{verbatim} -handle_receive_from_to( VirtProcr *requestingVP, SSRSemReq *reqData, SSRSemEnv *semEnv ) - { commHashTbl = semEnv->communicatingVPHashTable; - key[0] = reqData->receiveVP; key[1] = reqData->sendVP; //send uses same key - waitingReqData = lookup_and_remove( key, commHashTbl ); //get waiting request - if( waitingReqData != NULL ) - { resume_virt_procr( waitingReqData->sendVP ); - resume_virt_procr( waitingReqData->receiveVP ); - } - else - insert( key, reqData, commHashTbl ); //receive is first to arrive, make it wait - } -\end{verbatim} -} -} - \caption -{Pseudo-code of communication-handler for -\texttt{receive\_from\_to} request type. The \texttt{semEnv} is a pointer -to the shared parallel semantic state seen at the top of -Figure~\ref{figInternals}. -} -\label{figReqHdlr} -\end{figure} - -\paragraph*{Plugin view.} - -SSR's communication handler dispatches on the \texttt{reqType} field -of the request data, as set by the \texttt{SSR\_\_receive\_from\_to} -code. It calls the handler code in Figure \ref{figReqHdlr}. This -constructs a hash-key, by concatenating the from-VP's pointer with the -to-VP's pointer. Then it looks-up that key in the hash-table that SSR -uses to match sends with receives, which is in the shared semantic -state seen at the top of Figure \ref{figInternals} in -Section~\ref{secInternal}. - -The most important feature in Figure~\ref{figReqHdlr} is that both -send and receive will construct the same key, so will find the same -hash entry. Whichever request is handled first in Virtual time will -see the hash entry empty, and save itself in that entry. The second -to arrive sees the waiting request and then resumes both VPs, by -putting them into their \texttt{readyQ}s. - -Access to the shared hash table can be considered private, as in a -sequential algorithm. This is because our VMS-core implementation -ensures that only one handler on one core is executing at a time. - -\section{Results} -\label{secResults} - -We implemented blocked dense matrix multiplication with right -sub-matrices copied to transposed form, and ran it on a 4-core -Core2Quad 2.4Ghz processor. - -\paragraph*{Implementation-time.} - -As shown in Table~\ref{tabPersonDaysLang}, time to implement the three -parallel libraries averages 2 days each. As an example of -productivity, adding nested transactions, parallel singleton, and -atomic function-execution to SSR required a single afternoon, totaling -less than 100 lines of C code. - -\paragraph*{Execution Performance.} - -Performance of VMS is seen in Table~\ref{tabOverheadCycles}. The code -is not optimized, but rather written to be easy to understand and -modify. The majority of the plugin time is lost to cache misses -because the shared parallelism-semantic state moves between cores on a -majority of accesses. Acquisition of the master lock is slow due to -the hardware implementing the CAS instruction. - -Existing techniques will likely improve performance, such as -localizing semantic data to cores, splitting malloc across the cores, -pre-allocating slabs that are recycled, and pre-fetching. However, in -many cases, several hundred nano-seconds per task is as optimal as the -applications can benefit from. - -\begin{table}[h!tb] -\begin{minipage}{.35\textwidth} -\begin{tabular}{|l|l|l|l|} - \cline{2-4} - \multicolumn{1}{l|}{} & SSR & Vthread & VCilk\\ - \cline{2-4} - \noalign{\vskip2pt} - \hline - Design & 4 & 1 & 0.5\\ - Code & 2 & 0.5 & 0.5\\ - Test & 1 & 0.5 & 0.5\\ - L.O.C. & 470 & 290 & 310\\ - \hline -\end{tabular} -\caption -{Person-days to design, code, and - test each parallelism library. L.O.C. is lines of (original) C code, - excluding libraries and comments. -} -\label{tabPersonDaysLang} -\end{minipage} -\hspace{-.05\textwidth} -\begin{minipage}[t]{.7\textwidth} -\hfill -\begin{tabular}{|l|l|r|r|} -\cline{3-4} -\multicolumn{2}{l|}{} & comp only & comp + mem\\ -\cline{3-4} -\noalign{\vskip2pt} -\hline -VMS Only & \texttt{master\_loop} & 91 & 110\\ -& switch VPs & 77 & 130\\ -& (malloc) & 160 & 2300\\ -& (create VP) & 540 & 3800\\ -\hline -Language: & & & \\ -\quad SSR & plugin -- concur & 190 & 540\\ -& plugin -- all & 530 & 2200\\ -& lock & & 250\\ -\quad Vthread & plugin -- concur & 66 & 710\\ -& plugin -- all & 180 & 1500\\ -& lock & & 250\\ -\quad VCilk & plugin -- concur & 65 & 260\\ -& plugin -- all & 330 & 1800\\ -& lock & & 250\\ -\hline -\end{tabular} -\caption -{Cycles of overhead, per scheduled - slave. ``comp only'' is perfect memory, ``comp + mem'' is actual cycles. - ``Plugin-concur'' only concurrency requests, ``plugin-all'' includes - create and malloc requests. Two significant digits due to variability. -} -\label{tabOverheadCycles} -\end{minipage} - -\vskip-1.5cm -\begin{tabular}{|l|lr|} -\hline - Matrix size&Lang.&sec.\\ -[2pt]\hline - $81\times81$ & Cilk & 0.017\\ - & VCilk & 0.008\\ -\hline - $324\times324$ & Cilk & 0.13\\ - & VCilk & 0.13\\ -\hline - $648\times648$ & Cilk & 0.71\\ - & VCilk & 0.85\\ -\hline - $1296\times1296$ & Cilk & 4.8\\ - & VCilk & 6.2 \\ -[2pt]\hline -\end{tabular} -\hfill -\begin{tabular}{|l|r|r|r|r|} -\multicolumn{5}{l}{}\\ -\multicolumn{5}{l}{}\\ -\multicolumn{5}{l}{}\\ -\hline - operation & \multicolumn{2}{c|}{Vthread} & pthread & ratio\\ - & comp only & total & & \\ -[2pt]\hline - mutex\_lock & 85 & 1050 & 50,000 & 48:1\\ - mutex\_unlock & 85 & 610 & 45,000 & 74:1\\ - cond\_wait & 85 & 850 & 60,000 & 71:1\\ - cond\_signal & 90 & 650 & 60,000 & 92:1\\ -\hline -\end{tabular} -\caption -{On left, exe time in seconds for MM. To the right, overhead for - pthread vs. Vthread. First column is cycles - for perfect memory and second is total measured cycles. pthread cycles are - deduced from round-trip experiments. -} -\label{tabHeadToHead} -\end{table} - -\paragraph*{Head to head.} - -We compare our implementation of the \texttt{spawn} and \texttt{sync} -constructs against Cilk 5.4, on the top in Table~\ref{tabHeadToHead}, -which shows that the same application code has similar -performance. For large matrices, Cilk 5.4's better use of the memory -hierarchy (the workstealing algorithm) achieves 23\% better -performance. However, for small matrices, VCilk is better, with a -factor 2 lower overhead. Cilk 5.4 does not allow controlling the -number of spawn events it actually executes, and chooses to run -smaller matrices sequentially, limiting our comparison. - -When comparing to pthreads, our VMS based implementation has more than -an order of magnitude better overhead per invocation of mutex or -condition variable functionality, as seen on the bottom of Table -\ref{tabHeadToHead}. Applications that inherently have short trace -segments will synchronize often and benefit the most from Vthread. - -\section{Conclusion} -\label{secConclusion} - -We have shown an alternative to the thread model that enables -easier-to-use parallelism constructs by splitting the scheduler open, -to accept new parallelism constructs in the form of plugins. This -gives the language control over assigning virtual processors to -physical cores, for performance, debugging, and flexibility -benefits. Parallelism constructs of programming languages can be -implemented using sequential algorithms, within a matter of days, -while maintaining low run-time overhead. - \bibliography{../helpers/bib_for_papers} \end{document} diff -r 8cb89eed66b4 -r d6450ce874a9 0__Papers/PR/PR__Toolkit__FORTH_present__driven/figures/PR__system_component_breakdown.pdf Binary file 0__Papers/PR/PR__Toolkit__FORTH_present__driven/figures/PR__system_component_breakdown.pdf has changed diff -r 8cb89eed66b4 -r d6450ce874a9 0__Papers/PR/PR__Toolkit__FORTH_present__driven/figures/PR__system_component_breakdown.svg --- a/0__Papers/PR/PR__Toolkit__FORTH_present__driven/figures/PR__system_component_breakdown.svg Tue Dec 31 10:49:41 2013 -0800 +++ b/0__Papers/PR/PR__Toolkit__FORTH_present__driven/figures/PR__system_component_breakdown.svg Thu Jan 30 05:59:54 2014 -0800 @@ -78,8 +78,8 @@ inkscape:pageopacity="0.0" inkscape:pageshadow="2" inkscape:zoom="0.82" - inkscape:cx="317.92683" - inkscape:cy="565.89538" + inkscape:cx="127.68293" + inkscape:cy="560.60014" inkscape:document-units="px" inkscape:current-layer="layer1" showgrid="false" @@ -106,7 +106,7 @@ id="layer1"> + transform="translate(-58,40)"> module for platform 5) + transform="translate(-98,0)"> module for platform 3) + transform="translate(-138,-40)"> + transform="translate(70,6)"> + transform="translate(-118,180)"> WL and plugin for HW class 2) Language Library + id="tspan5593">Language A Library WL and plugin for HW class 1) WL and plugin for HW class 2) + id="g6693" + transform="translate(62,0)"> Language Library + style="font-size:20px;text-align:center;text-anchor:middle">Language B Library WL and plugin for HW class 1) - - - This machine is an instance ofplatform 3,which is amember ofHW Class 1 + id="tspan6835">HW Class 2 + + + diff -r 8cb89eed66b4 -r d6450ce874a9 0__Papers/PR/PR__Toolkit__FORTH_present__driven/latex/PRT__toolkit_FORTH_present_driven.pdf Binary file 0__Papers/PR/PR__Toolkit__FORTH_present__driven/latex/PRT__toolkit_FORTH_present_driven.pdf has changed diff -r 8cb89eed66b4 -r d6450ce874a9 0__Papers/PR/PR__Toolkit__FORTH_present__driven/latex/PRT__toolkit_FORTH_present_driven.tex --- a/0__Papers/PR/PR__Toolkit__FORTH_present__driven/latex/PRT__toolkit_FORTH_present_driven.tex Tue Dec 31 10:49:41 2013 -0800 +++ b/0__Papers/PR/PR__Toolkit__FORTH_present__driven/latex/PRT__toolkit_FORTH_present_driven.tex Thu Jan 30 05:59:54 2014 -0800 @@ -280,7 +280,11 @@ -We propose defining a modularization of runtimes that consists of three modules, as seen in Fig. \ref{fig:PR_three_pieces}. One module, called the \textit{proto-runtime core controller,} encapsulates the low-level hardware details involved with protecting constraint state and communicating it between CPU cores. It presents a generic interface that simplifies the language specific modules. The language implementor then supplies the top two modules. One implements the semantics of parallel constructs, and sees the constraint state as if it had exclusive access, allowing the implementation to be written as if it were sequential code. The other module embodies an algorithm for choosing which work to assign to which hardware resource. It is handed free cores by the proto-runtime and chooses which work to execute on each. These modules plug in via the interface. +This and the next two sections fill in the details of our proposal. We ask the reader for their patience, due to the circular nature of the concepts. We break the circle by iterating. We repeat descriptions multiple times, focusing on different aspects and inserting additional mental connections each time. Things that seem abstract and vague at the first mention will be clarified once enough structure has been introduced to provide a context within which to clarify. +When frustrated by unclear words, the reader may trust that clarification will come further along, once circular dependencies within the explanation process have been satisfied. + +We propose defining a modularization of parallel runtimes that consists of three modules, as seen in Fig. \ref{fig:PR_three_pieces}. One module, called the \textit{proto-runtime core controller,} encapsulates the low-level hardware details involved with protecting constraint state, communicating it between CPU cores, and triggering processing of consequences of the change. It presents a generic interface that simplifies the language specific modules, which are seen at the top. The language implementor then supplies the top two modules. One implements the semantics of parallel constructs, and sees the constraint state as if it had exclusive access, allowing the implementation to be written as if it were sequential code (as explained further later). The other module embodies an algorithm for choosing which work to assign to which hardware resource. It is handed free cores by the proto-runtime and chooses which work to execute on each (also expanded on later). These modules plug in via the interface. + The interface has a standard, basic, portion that remains constant, no matter which hardware platform it is implemented on top of. Extensions are added, which expose generic, simplified, aspects of hardware that the language needs to be aware of. An extension is added for distributed memory, and performance related specializations are added for certain classes of hardware. Tuning related to low level hardware variations is still hidden below the interface. On a given hardware target, all languages reuse the given implementation of the interface. @@ -301,7 +305,7 @@ \begin{figure}[ht] \centering - \includegraphics[width = 3.5in]{../figures/PR__system_component_breakdown.pdf} + \includegraphics[width = 3.3in]{../figures/PR__system_component_breakdown.pdf} \caption{The components that are collected under the title `proto-runtime eco system' } \label{fig:PR_system_components} \end{figure} @@ -310,17 +314,21 @@ -To give a high level view, and to get our terminology straight, Figure \ref{fig:PR_system_components} shows all things related to the words `proto-runtime'. We call this collection the \textit{proto-runtime eco system}. It includes a formal model called the \textit{tie-point mode}l that provides the concepts used to elegantly break a runtime system into modules. Additionally, we see a toolkit, which is the collection of header files and source code that is made use of by a language implementor during development of the modules for their runtime system, and also by application developers. Next, we see many versions of the proto-runtime core controller, one for each hardware platform, each distributed as a dynamic library. Finally, we see that each language supplies a small number of versions of their runtime system implementations, one for each group of hardware that it chooses to take advantage of the performance extension. Each version is distributed as a dynamic library. +To give a high level view, and to get our terminology straight, Figure \ref{fig:PR_system_components} shows all things related to the words `proto-runtime'. We call this collection the \textit{proto-runtime eco system}. It includes a formal model called the \textit{tie-point mode}l that provides the concepts used to elegantly break a parallel runtime system into modules. Additionally, we see a toolkit, which is the collection of header files and source code that is made use of by a language implementor during development of the modules for their runtime system, and also by application developers. Next, we see many versions of the proto-runtime core controller, one for each hardware platform, each distributed as a dynamic library. Finally, we see that each language supplies a small number of versions of their runtime system implementations, one for each group of hardware for which it chooses to take advantage of the performance extension. Each version is distributed as a dynamic library. At the left, we see a given machine. This depicts that a particular implementation of the core controller library is installed onto the machine, as is a particular dynamic library for each language. Later, application executables will be installed, and will dynamically link to whatever core controller and language libraries it finds. This is how an executable can be compiled in a generic way, then during a run automatically link up to the highly tuned runtime system for the hardware it happens to be executing on. -Not depicted is the active system that exists during a run, called the \textit{proto-runtime system.} The proto-runtime system can be thought of as an independent operating system that runs in user space. It asks the host OS to give it control over the hardware, then internally makes its own equivalent of threads and manages suspending and resuming those, and distributing them among the cores. The proto-runtime system is started by the application, by making a call, which is implemented inside the proto-runtime toolkit. Once the system has started, a proto-process can be started inside it, which executes application code that is written in terms of proto-runtime implemented languages. +Not depicted is the active system that exists during a run, called the \textit{proto-runtime system.} The proto-runtime system can be thought of as an independent operating system that runs in user space. It asks the host OS to give it control over the hardware, then internally makes its own equivalent of threads and manages suspending and resuming those, and distributing them among the cores. + +The proto-runtime system consists of internal bookkeeping state plus instances of the core controller, and whatever native OS objects are required to gain control. For coherent shared memory multicore machine running Linux, there is an OS thread created for each core and pinned to it. A core controller instance runs in each thread, but never uses any OS thread constructs. Rather it makes its own internal equivalent to threads and swaps out the OS thread context, as explained later. + +The system is started by the application, by making a call, which is implemented inside the proto-runtime toolkit. Once the system has started, a proto-process can be started inside the system, which executes application code that is written in terms of proto-runtime implemented languages. -A valuable benefit of this approach is that all language runtimes plug into the same core controller instances (which are created when the proto-runtime system is started) so they have a common element to their runtime systems. This fact has been exploited, allowing multiple languages to automatically inter-operate and their constructs to be intimately intermixed within application code without penalty. This proves especially valuable for parallel libraries, allowing applications to call library functions written in different languages, without the language's runtime systems fighting over the hardware resources and so hurting performance. - -Another benefit is that debugging facilities are included in the core controller and thereby inherited into the language. For example, the record and replay feature allows the chance to capture the scheduling choices that led to erroneous results and replay them, while single stepping and inspecting memory contents. This proves especially valuable for declarative languages, which are notoriously difficult to debug. This feature gives them reproducibility and the chance to single step through their parallel execution in a controlled way. +A valuable benefit of this approach is that all language runtimes plug into the same core controller instances (which are created when the proto-runtime system is started) so they have a common element to their runtime systems. This fact has been exploited, allowing multiple languages to automatically inter-operate and their constructs to be intimately intermixed within application code without penalty. This proves especially valuable for parallel libraries, allowing applications to call library functions written in different languages, without the languages' runtime systems fighting over the hardware resources and so hurting performance. + +Another benefit is that debugging facilities are included in the core controller and thereby inherited into the language. For example, the record and replay feature allows the chance to capture the scheduling choices that led to an erroneous result and replay them, while single stepping and inspecting memory contents. This proves especially valuable for declarative languages, which are notoriously difficult to debug. This feature gives them reproducibility and the chance to single step through their parallel execution in a controlled way. %In this paper, we present work that applies to coherent shared memory machines, both single chip and multiple chip. Extensions beyond this are currently in progress, to address multiple-address-space machines and hierarchical heterogeneous collections of processors, which will appear in future papers. @@ -330,7 +338,7 @@ -To use proto-runtime languages within an application, the proto-runtime system is first started, via the PR\_\_start() call. Then one or more processes are created via the PR\_\_create\_process(seedFn) call. The seedFn (seed function) is the equivalent of the familiar `main' function, which is the entry point of the proto-runtime based portion of the application. Proto-runtime creates a given process with exactly one VP in it. The VP is born running the seed function, and starts executing as soon as the process creation completes. The seed function then proceeds to use language constructs to create VPs and tasks, which execute the work of the application. +To use proto-runtime languages within an application, the proto-runtime system is first started, via the PR\_\_start() call. Then one or more processes are created via the PR\_\_create\_process(seedFn) call. The seedFn (seed function) is the equivalent of the familiar `main' function, which is the entry point of the proto-runtime based portion of the application. The proto-runtime system creates a given process with exactly one VP in it. The VP is born running the seed function, and starts executing as soon as the process creation completes. The seed function then proceeds to use language constructs to create VPs and tasks, which execute the work of the application. \subsection{Language organization} \label{subsec:LangOrg} @@ -338,16 +346,16 @@ -A language's runtime behavior is broken into three parts, as seen in Fig. -\ref{fig:langBreakdown}. The first contains the functions that are available to be called by application code, which are just thin wrappers, called the \textit{wrapper library}. These only package information, then -context switch the core over to the proto-runtime context and jump to the core controller. The second and third parts are the modules that execute inside the proto-runtime context. Together, these are called -the \textit{language plugin} or just plugin. This is where the behavior of the language constructs is implemented, and the algorithm for choosing which work to start next on a given free core. +A language's runtime behavior is broken into three parts, which are compiled together into a dynamic library, as seen in Fig. +\ref{fig:langBreakdown}. The first part defines the API that is available to be called by application code. It consists of functions that are just thin wrappers, called the \textit{wrapper library}. These only package information, then +context switch the core over to the proto-runtime context and jump to the core controller. The second and third parts are the modules that execute inside the proto-runtime context. Together, these two are called +the \textit{language plugin} or just plugin. This is where the behavior of the language constructs is implemented, as well as the algorithm for choosing which work to start next on a given free core. \begin{figure}[ht] \centering \includegraphics[width = 2.8in, height = 1.1in]{../figures/proto-runtime__modules_lang_breakdown.pdf} - \caption{The language's runtime system is implemented + \caption{The language's runtime behavior is implemented in three separate pieces which are compiled together into a dynamic library.} \label{fig:langBreakdown} \end{figure} @@ -396,7 +404,7 @@ For example, take a thread library, which we consider -a parallel programming model (which is essentially just the runtime system of a parallel language, stripped of the syntactic portion of the language). The thread library provides a way to explicitly create new timelines, where a thread represents an independent timeline. The library controls relative progress of timelines via the mutex acquire and release commands, which ``block'', meaning the associated timeline suspends; it stops +a parallel programming model (which is essentially just the runtime system of a parallel language, stripped of the syntactic portion of the language). The thread library provides a way to explicitly create new timelines, where a thread has an associated timeline. The library controls relative progress of timelines via the mutex acquire and release commands, which ``block'', meaning the associated timeline suspends; it stops making forward progress. The release in a different thread clears the block, which resumes the timeline. That linkage between suspend and resume of different timelines is the control the language exerts over which timelines are actively progressing. \subsection{Building up to tie-points} @@ -501,7 +509,7 @@ timeline. These visibilities are equivalent to establishing an order between events before the tied points versus those after the tied points. Both timelines agree on what events are before -versus after the tied point. } +versus after the common tied point. } \label{fig:tie-pointGuarantees} \end{figure} @@ -563,6 +571,7 @@ \subsection{Primitives required to create tied points} +\label{subsec:primitivesForTiePt} Fig \ref{fig:dualTimelineWHidden} showed how a tie-point can be generated. It was established by a combination of primitive mechanisms. These include: 1) suspend plus send; 2) a `hidden' timeline that executes @@ -593,15 +602,15 @@ \label{subsec:protoMappedToTiePoint} -The proto-runtime system treats a point on a timeline as a point within the sequence of instructions executed by a CPU core. Suspending a timeline is equivalent to saving the state of the core, including the context of calls made up to that point (the stack), and the program counter. Hence, when we say that we suspend a timeline, we mean that we perform the steps of saving the current core context into a data structure, from which it can be later restored to the active core. - -The proto-runtime system has two entities that have their own timeline associated: virtual processors and tasks. Internally, a VP or task is a data structure that holds the information required to create or restore a CPU core's context. A VP has its own stack, stack pointer, and program counter, while a task only has the parameters to place onto an externally provided stack, along with the pointer to a birth function (that is because a task has no stack context when it starts, leaves no stack context when it ends, and does not suspend). When a VP suspends, the core's registers are saved onto its stack, while the VP data structure stores the stack pointer, frame pointer, and program counter at which to resume. - -There is a proto-runtime core controller context on each core of the machine. The core alternates between being loaded with the context of an application VP (or task), versus loaded with the core controller context. - -The core controller context is what provides the hidden timeline abstraction. To get to it, the application calls a wrapper library function, which creates a request data structure and fills it with information about the wrapper call, such as the ID of a mutex, and the request handler to use for this request. The wrapper then hands the request structure to a proto-runtime primitive, which suspends the application timeline and switches to the core controller timeline (the hidden timeline). The core controller receives the request data. The controller finds the handler inside the data and calls it, passing it the request data to the handler. The handler thus runs inside the core controller's context, operating upon data from the wrapper library. - -This sequence is seen in Fig. \ref{fig:dualTimelineWHidden}, as the vertical dashed arrows. The arrows are implemented by the toolkit's ``suspend and send request'' primitive. It does a sequence of things: first, the request information that was packaged by the wrapper library is made available via setting a pointer. Next, the application timeline that is currently loaded into the core during the wrapper library call is suspended, by saving the core registers into the VP data structure. Lastly, the context of the core controller is swapped in, by loading the core's registers. The core controller then obtains the request structure by reading the pointer. +The proto-runtime system treats a point on a timeline as a point within the sequence of instructions executed by a CPU core. Suspending a timeline is equivalent to saving the state of the core, including the stack, which is the context of calls made up to that point, and the program counter. Hence, when we say that we suspend a timeline, we mean that we perform the steps of saving the core's current context into a data structure, from where it can be later restored to the core. + +The proto-runtime system has two entities that have their own timeline associated: virtual processors and tasks. Internally, a VP or task is a data structure that holds the information required to create or restore a core's context. A VP has its own stack, stack pointer, and program counter, while a task only has the parameters to place onto an externally provided stack, along with the pointer to a birth function (that is because a task has no stack context when it starts, leaves no stack context when it ends, and does not suspend). When a VP suspends, the core's registers are saved onto the VP's stack, while the VP data structure stores the stack pointer, frame pointer, and program counter at which to resume. + +There is a proto-runtime core controller context for each core of the machine. The core alternates between being loaded with the context of an application VP (or task), versus loaded with the core controller context. + +The core controller context is what provides the hidden timeline abstraction. To get to it, the application calls a wrapper library function, which creates a request data structure and fills it with information about the wrapper call, for example a mutex, and the request handler to use for the request. The wrapper then hands the request structure to a proto-runtime primitive, which suspends the application timeline and switches to the core controller timeline (the hidden timeline). The core controller receives the request data. The controller finds the handler inside the data and calls it, passing it the request data to the handler. The handler thus runs inside the core controller's context, operating upon data from the wrapper library. + +This sequence is seen in Fig. \ref{fig:dualTimelineWHidden}, as the vertical dashed arrows. The arrows are implemented by the toolkit's ``suspend and send request'' primitive. Fig. X shows the sequence it performs: first, the request information that was packaged by the wrapper library is made available via setting a pointer. Next, the application timeline that is currently loaded into the core during the wrapper library call is suspended, by saving the core registers into the VP data structure. Lastly, the context of the core controller is swapped in, by loading the core's registers. The core controller then obtains the request structure by reading the pointer. While inside the proto-runtime context, on a given core, the controller uses hardware mechanisms to enforce the single hidden timeline abstraction. It executes request handlers from the language plugin inside this abstraction, so all request handlers appear to run in the same timeline. They are sequentialized within this timeline, giving the plugin the illusion of running on a single core, and allowing it to be written using a sequential mental model. @@ -612,7 +621,6 @@ For example, the handler for acquire mutex has a mutex data structure available in the hidden timeline environment, inspects it to see whether it is currently owned, and if so, places a pointer to the VP that invoked the handler into the mutex data structure. Later, a release handler will inspect the mutex structure, see the waiting VP, and resume it. That ties the acquire to the release, forming a tie-point between their wrapper library calls. - The primitives make available from the toolkit, that relate to tie-points, include these: %It provides the primitive that suspends a timeline and then causes language plugin behavior to execute in the gap. @@ -622,19 +630,6 @@ %\subsection{More about the proto-runtime} -\begin{itemize} -\item create a virtual processor (which has a suspendible timeline) -\item create a task (which has a timeline that runs to completion) - -\item suspend a timeline, switch contexts over to the hidden timeline, and deliver a request message from the suspended timeline. The request message is processed on the hidden timeline. (suspend of a task automatically converts it to a virtual processor) -\item resume a virtual processor, which makes it ready for execution -\item start a waiting task, which makes it ready for execution -\item end a virtual processor -\item end a task (which may trigger start of dependent tasks) -\item a trigger is implemented, which offers a free core to an assigner function. The assigner chooses which ready virtual processor or task to begin execution on the offered -core. - -\end{itemize} The reason for having both VPs and tasks is a practical one, as tasks are simpler, with less overhead, and many languages have the semantics of short, atomic, units of work that @@ -679,7 +674,7 @@ already owns the mutex, the calling thread is put into a collection of waiting threads, and stops making progress. \item Release Mutex: A thread calls the construct and -provides the name of the mutex, then continues. If the mutex has waiting threads in its collection, then the next thread is taken out and given ownership of the mutex. That thread is also resumed, to once again make progress. +provides the name of the mutex, then continues. If the mutex has waiting threads in its collection, then the next thread is taken out and given ownership of the mutex. That thread is resumed, to once again make progress. \end{itemize} In our implementation, we shall represent a mutex with a data structure that has two fields: @@ -707,11 +702,11 @@ -\subsection{Low Level Example of Construct Implementation} - -To understand how to implement a language's runtime, the first thing to look at is a sample application. - -Fig \ref{fig:Main} shows the main function of an application. It calls \texttt{PR\_\_start()}, which initializes the proto-runtime core controller. Then it calls \texttt{PR\_\_create\_process}, which creates a proto-runtime process. This is analogous to creating a process in the OS. The OS process is created with a main thread that comes into existence running the \texttt{main} function. The proto-runtime process is created with a seed VP that comes into existence executing the process birth Fn. +\subsection{Hands-on Example of Construct Implementation} + +To gain context, we look at starting up proto-runtime, creating a process, and then calling the our construct in a sample application. + +Fig \ref{fig:Main} shows the main function of our application. It calls \texttt{PR\_\_start()}, which initializes the proto-runtime core controller. Then it calls \texttt{PR\_\_create\_process}, which creates a proto-runtime process. This is analogous to creating a process in the OS. The OS process is created with a main thread that comes into existence running the \texttt{main} function. The proto-runtime process is created with a seed VP that comes into existence executing the process birth Fn. @@ -726,7 +721,7 @@ PR__start(); process = PR__create_process( &process_birth_fn, params ); PR__wait_for_process_to_end( process ); - + ... //display results, which are inside 'params' PR__end(); } \end{verbatim} @@ -797,12 +792,12 @@ {\noindent {\footnotesize \begin{verbatim} -void * thd_birth_gn( void *data, VirtProcr *executingVP ) +void * thd_birth_fn( void *data, VirtProcr *executingVP ) { ... params = (MyThdParams *)data; birth fn - - mutex = params->mutex; //put into params in process birth fn + ... //do some work + mutex = params->mutex; //put into params by process birth fn vthread__acquire_mutex( mutex, executingVP ); //synchronize ... } @@ -936,38 +931,103 @@ %%%%%%%%%%%%%%%%%%%%%%%% %% %%%%%%%%%%%%%%%%%%%%%%%% -\section{Proto-runtime toolkit and core controller implementation details} +\section{Putting it all together: implementation details of the core controller} \label{sec:PRImpl} -The proto-runtime -toolkit presents the full API of calls used by both language implementors and application developers (who need calls such as \texttt{PR\_\_Start()} and \texttt{PR\_\_create\_process()}). The API is a combination of calls available to the application code, calls available to the wrapper library, and calls available to request handers and the assigner. used by language implementors, it only contains implementation of some of the calls. The rest are implemented inside the core controller, which is distributed separately in to form of a dynamic library. We term the subsect of these calls that is used for creating a language's runtime system the \textit{proto-runtime primitives}. The full set of calls, including the primitives and their semantics, can be found on the proto-runtime web site \cite{}. - -These calls can be implemented via many different approaches, and many of the primitives are even suitable for direct hardware implementation. The approach that works best depends on the hardware details. Hence, there exists no definitive implementation approach for the proto-runtime core controller, nor for the primitives. - -Here, however, we describe a few possibilities then sketch the approach we used for multi-core coherent shared memory machines. +We want to turn now to putting all the pieces together, showing full detail of the implementation, so that the concepts become clear. This means revealing implementation details. However, before doing so, some readers may jump to the conclusion that there is only one best implementation, and will rightly claim that what we show may be good on some hardware, but it is a terrible solution on other hardware. They would be right. So we preface this section by pointing out the difference between the value of the API and general approach versus the value of any one particular implementation on one particular hardware platform. The beauty of the approach is that it is purposely designed to have many different implementations, all at the same time, each one best for a small segment of hardware. Thus, criticism that the simple implementation we show here is bad on many hardware platforms is both correct and misses the point. + + The proto-runtime API contains calls for use within the application code, the wrapper library, request handers, and the assigner. The implementation of these calls is split between the toolkit and the core controller, which are distributed separately. The toolkit contains header files and static libraries, which remain the same across most machines, while the core controller is a dynamic library, the implementation of which changes often between machines. The full set of calls, including the primitives and their semantics, can be found on the proto-runtime web site \cite{}. + +There is a subset of the API\ that is used for the purpose of creating a language's runtime system. We term this subset the \textit{proto-runtime primitives}. These include the tie-point creation primitives noted in Section \ref{subsec:primitivesForTiePt}. They are used inside a wrapper library to save the current context into a VP structure and switch over to the core controller, and are also used inside the request handlers to do things like make a VP ready to have its context restored to a core. We show one implementation of these in action throughout this section. + +The implementation of these primitives can take many different approaches, and many of the primitives are even suitable for direct hardware implementation. The approach that works best depends on the hardware details. Hence, there exists no definitive implementation approach for the proto-runtime core controller, nor for the primitives. + +Here, we describe several possibilities then give the approach we used for single socket (AKA single chip) multi-core, coherent shared memory machines. Despite its simplicity, it has been measured to be the best approach on this hardware seen so far. \subsection{Possible architectures and approaches} -From a perfomance point of view, the most discriminating feature of hardware is the memory system, whether it has coherent shared memory or non-coherent shared memory or distributed memory. The second most discriminating feature is the nature of the communication between cores, especially latency, and variations due to the hierarchy in the hardware. - -On a multi-core coherent shared memory machine, two main approaches to proto-runtime execution are available. The first is to designate a single core as the ``manager'' core, and exclusively execute the proto-runtime core controller on that core. The other approach is to run a separate copy of the core controller on every core, and then coordinate among them. - -For this paper, we focus on our implementation for coherent shared memory hardware. We have chosen the approach of running a separate copy of the proto-runtime core controller on every core. +The choice of implementation approach for the core controller is determined mainly by performance considerations. From a performance standpoint, the most discriminating feature of hardware is the memory system, whether it has coherent shared memory, non-coherent shared memory, or distributed memory. The second most discriminating feature is the nature of the communication between cores, especially latency, and variations due to the hierarchy in the hardware. + +On a multi-core coherent shared memory machine, two main approaches to core controller execution are available. One is to designate a single core as the ``manager'' core, and exclusively execute the proto-runtime core controller on that core. The other approach is to run a separate copy of the core controller on every core, and then coordinate among them. + +For single socket coherent shared memory hardware, we chose to run a separate copy of the proto-runtime core controller on each core. + \subsection{Walk through of execution on a multi-core machine} \label{sec:PRExecModel} -At this point, it may be helpful to understand the time sequence of execution, with a walk through of a simple example. +To understand the implementation of the primitives, it may be helpful to understand the time sequence of execution, with a walk through of an example. Each activity in the figure has a corresponding line in the sample code in Sec. \ref{sec:LangImpl}. + +Fig \ref{fig:TimeSeq} shows the sequence of calls and context switches. The figure starts at the left, showing application code being performed in an application VP's context. As seen, the code calls a synchronization construct, and enters the wrapper library for it. The wrapper library then calls the proto-runtime \texttt{send\_and\_suspend} primitive, which accepts request data, suspends the VP, switches to the core controller context, and hands the request data to the core controller code. The context switch is represented as the vertical arrow pointing down. The request data is communicated by setting a pointer in a fixed location. + +Inside the core controller code, we chose to protect shared core controller data structures with a global lock. This choice can vary between machines. This lock also protects the language environments, so that access is serialized among the cores. It turns out that protection of internal runtime state has different best practices than protection of the work data that forms application computations. A global lock implemented with an atomic instruction, in this case CAS, proves to have the highest measured runtime overhead performance versus the alternatives tried, on low core count single chip coherent shared memory machines. The rule of thumb that a global lock is bad for performance proves to not apply to internal runtime state on this hardware, as supported by measurements seen in Sec \ref{sec:Measurement}. + + +\begin{figure}[H] + \centering + \includegraphics[scale=1] + {../figures/PR__system_level_wrapper_lib_to_req_hdlr.pdf} + \caption{The sequence of calls and context switches, starting in application code, going into a call to a synchronization construct's wrapper library, switching over to the controller, and eventually back to application code.} + \label{fig:TimeSeq} +\end{figure} + -Given that have chosen to place a separate core controller on each core, when a wrapper library calls the \texttt{send\_and\_suspend} primitive on a given core, the primitive switches context over to the core controller for that core. That instance then calls the request handler, and passes it the language environment. This language environment is the means by which the request handler invocations communicate with each other. So it is also the means by which the cores communicate with each other. +After the lock is acquired, the core controller gets the request data and uses the pointer inside it to call the request handler function. It finds the language environment for that request type and gives a pointer to that environment to the request handler. Because it is running inside the lock, the handler can simply use sequential reasoning to implement its functionality. + +The request handler accesses data in the lang env and modifies it, then returns to the core controller. Next, the controller chooses a language whose runtime has work ready and calls that language's Assigner. It passes the Assigner the lang env for that language, which contains some data structure that holds VPs ready to be resumed. + +The Assigner function, supplied by the language implementor, looks in the data structure of ready VPs, which was populated by request handlers, which determined that these VPs are free of constraints and ready to be resumed. The Assigner picks one of these ready VPs and calls a proto-runtime primitive that inserts a pointer to the VP data structure into a fixed spot. Then the Assigner returns to the core controller. + +The core controller then releases the lock and calls the assembly code that performs the context switch. It passes the pointer to the VP that was chosen by the Assigner, and the assembly code switches the physical CPU over to the context stored in that VP data structure. This context switch is seen as the vertical arrow going up. + +The result of the context switch is that execution resumes inside the proto-runtime primitive that did the suspend. It returns, back into the wrapper library code that called the primitive. That returns back into the application code, as seen at the right of the figure. + + + + +\begin{figure}[H] + \centering + \includegraphics[scale=1] + {../figures/PR__system_level_ctxt_switching_LangEnv_comm_1.pdf} + \caption{context switching between core controllers and Application VPs or `Idle' VPs. The vertical arrows are context swtiches, the horizontal sections are segments during which one of the contexts is animated, and the color indicates which physical CPU core is doing the animating. An application VP can be animated by either core, but a core controller context is one-to-one with a physical core, and is only ever animated by that core.} + \label{fig:CtxtSwitches1} +\end{figure} + +Fig \ref{fig:CtxtSwitches1} shows an illustrative pattern of context switches on a two core machine. Red represents one CPU core, and green the other. Vertical lines represent context switches. Note that in places where both cores are inside the core controller context, one of the contexts must wait to acquire the global lock (in this particular implementation for this hardware). + +Fig \ref{fig:CtxtSwitches2} adds dotted lines that represent the communication of changes made to the language environment. The changes are communicated from one core controller invocation to the next. One request handler invocation makes changes inside the langauge environment, which are passed to the next request handler invocation, which may be on the same core or on a different core. + +The language environment is the means by which the request handler invocations communicate with each other. So it is the means by which the cores communicate changes in constraint state to each other. + + +\begin{figure}[H] + \centering + \includegraphics[scale=1] + {../figures/PR__system_level_ctxt_switching_LangEnv_comm_2.pdf} + \caption{The same figure, with dotted lines added, which indicate communication of changes in the language environment, from one invocation of a core controller context to the next. The changes are made by the request handler and Assigner.} + \label{fig:CtxtSwitches2} +\end{figure} + + + + +\begin{figure}[H] + \centering + \includegraphics[scale=1] + {../figures/PR__system_level_ctxt_switching.pdf} + \caption{Shows context switches together with a blow up of one portion, where the detailed flow of calls and context switching can be seen.} + \label{fig:CtxtPlusTimeSeq} +\end{figure} + + Imagine 4 cores. And imagine two tasks, task1 and task2. Task1 produces a result, and task2 takes that result as input. Therefore, task2 cannot start execution until task1 completes. This fact is represented as the contents of a data structure that in turn is inside the language environment. -Now, task1 is executing on core1. It finishes, and invokes the ``end of work" wrapper library call. This switches context to the proto-runtime core controller on core1. That instance in turn ensures atomic ownership of the language environment, and passes the environment to the request handler for ``end of work". - The request handler is still running on core1. + +Now, task1 is executing on core1. It finishes, and invokes the ``end of work" wrapper library call. This switches context to the proto-runtime core controller on core1. That instance in turn ensures atomic ownership of the language environment, and passes the environment to the request handler for ``end of work". The request handler is still running on core1. The request handler looks inside the language environment, sees that task1 is the propendent of task2, and so it knows that task2 is now free to start execution. It places a pointer to task2 into a queue of ready to run tasks. The queue is inside the language environment. @@ -977,6 +1037,18 @@ When the Assigner returns, the proto-runtime core controller then performs a context switch, switching itself out, and switching in the work that was assigned to the core. Execution of task2 then proceeds on core3. Task2 continues there until it performs a wrapper library call, which then switches to proto-runtime, and so on. + +\subsection{What needs to be implemented} +The toolkit provides an implementation of the assembly language code that saves the + +-] Save current context into VP data structure +-] receive request struct, update counts of live work, transform suspended task into suspended VP, call request handler, pick a process, and a language, call assigner for it. Provide means to set and get lang-specific info attached to a VP and a task. + +-] Show VP data structure, explain the fields + +\subsection{Code details} +The core controller + \subsubsection{Performance related issues} A few items to note about the example. First, the example describes only one of many different possible implementations of the proto-runtime interface. In particular, multiple choices are available for assuring atomicity of access to the shared language environment. Atomicity could be assured via a global lock (which itself has many possible implementations), or via transactional memory, or the language environment state could be managed using distributed style quorum techniques. In practice, each hardware platform has to be carefully measured and the details of the protection scheme adapted to it. @@ -993,12 +1065,12 @@ %%%%%%%%%%%%%%%%%%%%%%%% \section{Measurements} \label{sec:Measurement} -We provide runtime overhead measurements in \S\ref{subsec:OverheadMeas} and implementation time measurements in \S\ref{subsec:ImplTimeMeas} +To verify the claims regarding high performance as well as high productivity, we provide runtime overhead measurements in \S\ref{subsec:OverheadMeas} and implementation time measurements in \S\ref{subsec:ImplTimeMeas} %%%%%%%%%%%%%%%%%%%%%%%% \subsection{Overhead Measurements} \label{subsec:OverheadMeas} -For the following, we use a 1 socket 4-core 2.4Ghz machine, and a 4 socket by 10 core each machine. +For the following overhead measurements, we use a 1 socket 4-core 2.4Ghz machine, and a 4 socket by 10 core each machine. @@ -1040,7 +1112,7 @@ %%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{VSs Versus StarSs (also known as OMPSs)} -Figure \ref{fig:VSsPerf} shows the same experiment performed twice, once with the Nanos runtime, which is part of the official OMPSs distribution, and a second time with a proto-runtime based runtime. In an experiment, several runs are performed. The total work is the same for each run, but each run divides the work into a different number of tasks. To the left of the graph is a large number of small tasks, to the right is a small number of larger tasks. The Nanos experiment, on the left, shows that the overhead grows as task size shrinks, while the proto-runtime experiment, on the right, shows a much smaller overhead. +Figure \ref{fig:VSsPerf} shows a given experiment performed twice, once with the Nanos runtime, which is part of the official OMPSs distribution, and a second time with a proto-runtime based runtime. In an experiment, several runs are performed. The total work is the same for each run, but each run divides the work into a different number of tasks. To the left of the graph is a large number of small tasks, to the right is a small number of larger tasks. The Nanos experiment, on the left, shows that the overhead grows as task size shrinks, while the proto-runtime experiment, on the right, shows a much smaller overhead. This indicates that the proto-runtime implementation achieves similar performance on the actual work, but has much lower overhead than the Nanos runtime. The Nanos runtime required many man-months to develop and tune, while the proto-runtime version was completed in two weeks of part time effort. @@ -1087,7 +1159,7 @@ \label{subsec:ImplTimeMeas} Here, in Table \ref{tabPersonHoursLang}, we summarize the time to develop the runtime systems for a variety of languages. During development, we tracked time spent to design, code, and debug an initial working version. The results are shown in the same order we created them, with SSR the first. As we gained experience, design and coding became more efficient. These are hours spent at the keyboard or with pen and paper, and don't include think time during other activities in the day. -For comparison, Nanos, the runtime system for OMPSs required several man-months to implement and tune, versus two weeks part-time for proto-runtime, and that higher effort achieves slightly worse performance. The sophisticated Cilk runtime that ships with the MIT distribution required man-years of effort, versus a few hours for proto-runtime, but is much more sophisticated and has better overhead. The previous Reo runtime required 2 to 4 times more effort, but was based on Java, which in turn sits on OS threads. It consumes hundreds of times more overhead. +For comparison, Nanos, the runtime system for OMPSs required several man-months to implement and tune, versus two weeks part-time for proto-runtime, and that higher effort achieves slightly worse performance. The sophisticated Cilk runtime that ships with the MIT distribution required man-years of effort, versus a few hours for proto-runtime, but is much more sophisticated and does have better overhead. The previous Reo runtime required roughly twice the effort, but was based on Java, which in turn sits on OS threads. It consumes dozens of times more overhead. \begin{table} \caption @@ -1152,9 +1224,9 @@ \end{table} -Using OS Threads involves implementing the target runtime system on top of the existing OS thread (or alternative such as TBB) runtime system. In this case, the target language runtime runs on top of the lower-level runtime. This places control of work placement inside the lower-level runtime, blocking the language, which hurts application-code performance, due to inability to use data locality. In addition, OS threads have operating system overhead and OS-imposed fairness requirements, which keeps runtime performance poor as seen in Section \ref{sec:VthreadVsPthread}. - -This forces the implementation to manage concurrency explicitly, using constructs from the lower-level runtime such as mutexes. Hence, implementation effort is only fair for this approach. +Using OS Threads involves implementing the target runtime system on top of the existing OS thread (or alternative such as TBB) runtime system. In this case, the target language runtime runs on top of the lower-level runtime. This places control of work placement inside the lower-level runtime, blocking the language, which hurts application-code performance due to the inability to use data locality. In addition, OS threads have operating system overhead and OS-imposed fairness requirements, which keeps runtime performance poor as seen in Section \ref{sec:VthreadVsPthread}. + +This forces the implementation to manage concurrency explicitly, using constructs from the lower-level runtime such as mutexes. Hence, implementation effort is only middle for this approach. Porting in this approach can be low effort, but only at the expense of losing yet more performance and worsening overhead. Most language runtime systems have parameters and design choices that will benefit from being tuned for the hardware specifics. Hence, it scores only fair on porting effort as well. @@ -1174,14 +1246,7 @@ %%%%%%%%%%%%%%%%%%%%%%%% \section{Conclusions and Future Work} \label{sec:conclusion} -The main takeaways from the paper are first, the role that the proto-runtime approach can play in simplifying the runtime aspect of implementing a large number of parallel languages across the many hardware targets. -%The proto-runtime approach does this by modularizing the runtimes, providing reuse of centralized services, and reuse of the hardware-specific performance tuning, which is performed once per hardware, on the proto-runtime, then enjoyed by all the eDSLs. Hence, the proto-runtime approach provides a significant piece of the puzzle of providing eDSLs, to bring parallel programming into the mainstream. - - -%[[Hypothesis: Embedded-style DSLs -> high productivity + low learning curve + low disruption + low app-port AND quick time to create + low effort to lang-port + high perf across targets]] - - -Specifically, we have shown how the approach modularizes runtime code, in a way that appears applicable to any language or execution model. It isolates the hardware-specific portion from language behavior as well as from the language-driven placement of work onto resources, via a clean interface. +We have shown how the proto-runtime approach modularizes parallel runtime code, in a way that appears applicable to any language or execution model. It isolates the hardware-specific portion from language behavior as well as from the language-driven placement of work onto resources, via a clean interface. The modularization reduces the effort of implementing a new language. It causes the low level hardware portion to be reused by each language. And, the behavior implementation is simplified, by handling shared state inside the proto-runtime and exporting a sequential interface for the behavior module to use. Evidence of effort reduction was supplied by measurements of implementation effort, and low overhead was verified by head to head comparisons. @@ -1213,6 +1278,36 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \section{Detailed Implementing a Language} diff -r 8cb89eed66b4 -r d6450ce874a9 0__Papers/PR/PR__Toolkit__FORTH_present__driven/latex/pickup_artists.txt --- a/0__Papers/PR/PR__Toolkit__FORTH_present__driven/latex/pickup_artists.txt Tue Dec 31 10:49:41 2013 -0800 +++ b/0__Papers/PR/PR__Toolkit__FORTH_present__driven/latex/pickup_artists.txt Thu Jan 30 05:59:54 2014 -0800 @@ -1,3 +1,3 @@ neil strauss -john sinn \ No newline at end of file +john sinn diff -r 8cb89eed66b4 -r d6450ce874a9 1__Presentations/14__Ja__HiPEAC_entr_workshop/Entrepreneur_in_Europe.odp Binary file 1__Presentations/14__Ja__HiPEAC_entr_workshop/Entrepreneur_in_Europe.odp has changed diff -r 8cb89eed66b4 -r d6450ce874a9 1__Presentations/14__Ja__HiPEAC_poster/14_Ja__HiPEAC_poster.svg --- a/1__Presentations/14__Ja__HiPEAC_poster/14_Ja__HiPEAC_poster.svg Tue Dec 31 10:49:41 2013 -0800 +++ b/1__Presentations/14__Ja__HiPEAC_poster/14_Ja__HiPEAC_poster.svg Thu Jan 30 05:59:54 2014 -0800 @@ -1637,7 +1637,7 @@ inkscape:pageshadow="2" inkscape:zoom="0.35" inkscape:cx="2067.0229" - inkscape:cy="2586.636" + inkscape:cy="2887.1898" inkscape:document-units="px" inkscape:current-layer="layer1" showgrid="true" @@ -1672,7 +1672,7 @@ image/svg+xml - + @@ -3628,16 +3628,16 @@ + d="m 2405.4611,-754.18514 c 0,1760.01734 0,1760.01734 0,1760.01734" + style="fill:none;stroke:#000000;stroke-width:14.12155342;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none" /> Portability - - Adoptability - Adoptability - - - Productivity - Productivity - - -] Parallelism in Everyday World - Driving a car - Putting on a wedding - Manufacturing - A business==> Motivate languages by these - -] Size of project & preferences - Large project best with team oriented features - Single creative person best with flexible (dyn) language - Domain expert best with language matches domain==> Wide variation -- support all - -] Priority of software segment - Embedded == performance - Web == productivity - Enterprise == dev process ==> Productivity, given priority + y="-774.11786" /> - -] Specialize to hardware - Slows dev cycle if specialize during development - Tools to specialize == good==> Separate specialize phase + + - - + transform="matrix(0,0.70182681,0.93371,0,-29.22302,-1323.2696)" + inkscape:label="vms_pthread_vthread_8_32_128_512thds__o30000__perfCtrs" + id="g5007"> - - - - - 1 - - - - - - 2 - - - - - - 3 - - - - - - 4 - - - - - - 5 - - - - - - 6 - - - - - - 7 - - - - - - 8 - - - - - - 9 - - - - - - 10 - - - - - - 0 - - - - - - 1000 - - - - - - 2000 - - - - - - 3000 - - - - - - 4000 - - - - - - 5000 - - - - - - 6000 - - - - - - 7000 - - - - - - 8000 - - - - - - Ratio of Total Execution to Total Work - - - Cycles in one Task - - - - - - - 1 - - - - - - 2 - - - - - - 3 - - - - - - 4 - - - - - - 5 - - - - - - 6 - - - - - - 7 - - - - - - 8 - - - - - - 9 - - - - - - 10 - - - - - - 0 - - - - - - 1000 - - - - - - 2000 - - - - - - 3000 - - - - - - 4000 - - - - - - 5000 - - - - - - 6000 - - - - - - 7000 - - - - - - 8000 - - - - - - Ratio of Total Execution to Total Work - - - Cycles in one Task - pthreads - - - - - - - 8 Threads - - - - - - 32 Threads - - - - - - 128 Threads - - - - - - 512 Threads - - - - - - - 1 - - - - - - 2 - - - - - - 3 - - - - - - 4 - - - - - - 5 - - - - - - 6 - - - - - - 7 - - - - - - 8 - - - - - - 9 - - - - - - 10 - - - - - - 0 - - - - - - 1000 - - - - - - 2000 - - - - - - 3000 - - - - - - 4000 - - - - - - 5000 - - - - - - 6000 - - - - - - 7000 - - - - - - 8000 - - - - - - Ratio of Total Execution to Total Work - - - Cycles in one Task - Vthread - - - - - - - 8 Threads - - - - - - 32 Threads - - - - - - 128 Threads - - - - - - 512 Threads - - - - - + transform="scale(0.12,0.12)" + id="g5009"> + + + + 1 + + + + + + 2 + + + + + + 3 + + + + + + 4 + + + + + + 5 + + + + + + 6 + + + + + + 7 + + + + + + 8 + + + + + + 9 + + + + + + 10 + + + + + + 0 + + + + + + 1000 + + + + + + 2000 + + + + + + 3000 + + + + + + 4000 + + + + + + 5000 + + + + + + 6000 + + + + + + 7000 + + + + + + 8000 + + + + + + Ratio of Total Execution to Total Work + + + Cycles in one Task + + + + + + + 1 + + + + + + 2 + + + + + + 3 + + + + + + 4 + + + + + + 5 + + + + + + 6 + + + + + + 7 + + + + + + 8 + + + + + + 9 + + + + + + 10 + + + + + + 0 + + + + + + 1000 + + + + + + 2000 + + + + + + 3000 + + + + + + 4000 + + + + + + 5000 + + + + + + 6000 + + + + + + 7000 + + + + + + 8000 + + + + + + Ratio of Total Execution to Total Work + + + Cycles in one Task + pthreads + + + + + + + 8 Threads + + + + + + 32 Threads + + + + + + 128 Threads + + + + + + 512 Threads + + + + + + + 1 + + + + + + 2 + + + + + + 3 + + + + + + 4 + + + + + + 5 + + + + + + 6 + + + + + + 7 + + + + + + 8 + + + + + + 9 + + + + + + 10 + + + + + + 0 + + + + + + 1000 + + + + + + 2000 + + + + + + 3000 + + + + + + 4000 + + + + + + 5000 + + + + + + 6000 + + + + + + 7000 + + + + + + 8000 + + + + + + Ratio of Total Execution to Total Work + + + Cycles in one Task + Vthread + + + + + + + 8 Threads + + + + + + 32 Threads + + + + + + 128 Threads + + + + + + 512 Threads + + + + - + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + id="g9148" /> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - - + id="g9644"> + + + + + + + + + + transform="translate(0,-0.88)" + id="g10353"> + id="g10355" /> + - + id="g10367" + clip-path="url(#clipPath10357)"> + id="g10369"> + id="g10371" + clip-path="url(#clipPath10361)"> - - - + id="g10373" + transform="matrix(432,0,0,303.12,0,0.88)"> + @@ -6957,1338 +6740,1338 @@ + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + id="g16586" + clip-path="url(#clipPath16588)"> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + transform="translate(460,160)"> @@ -10671,7 +10454,7 @@ + transform="matrix(0.05669767,0,0,0.11361797,1859.7264,-711.7834)"> + transform="matrix(0.06564474,0,0,0.13462398,1142.2833,-769.1981)"> The Tie-Point Model for Implementing Synchronization Constructs The Proto-Runtime Implementation of the Tie-Point Model Approaches Available for Implementing a Runtime SystemImplementing a Parallel Runtime System -- Approaches Available Measurements on Languages Implemented via Proto-Runtime Toolkit Implementation Effort Detailed Overhead of Vthread @@ -14048,155 +13831,119 @@ Modularization Primitives (API) Language Organization How to Impl a LanguageRuntime Software Stack: where doesproto-runtime fit? Using OS Thds - Custom HW Primitives Modify Existing - Modularize and Reuse - Tie-point model enables modularization-] Defines primitives-] Defines abstraction + transform="matrix(2.0842853,0,0,1.9453497,-5439.4603,2097.4129)"> OS Threads + y="-2183.7349">OS Threads Exe model HW primitives Using OS Threads + style="font-size:20px;text-align:center;text-anchor:middle">Use OS Threads: + transform="matrix(1.7723869,0,0,2.1220235,-3305.7103,1133.2414)"> Parallel Exe model + y="-1576.0261">Parallel Exe model HW primitives + x="2828.4958" + y="-1547.3362">HW primitives UsingProto-runtime Toolkit + style="font-size:20px;text-align:center;text-anchor:middle" + id="tspan29887">Use Proto-runtime Toolkit: + transform="matrix(2.0878994,0,0,1.938165,-5053.8371,-127.87346)"> Parallel Exe model + y="-1073.5068">Parallel Exe model HW primitives + y="-1042.7938">HW primitives Use HW primitives: + + + + Application + Using Parallel Exe model + HW primitives + style="font-size:16px">HW primitives + + Modify Existing: Approaches for Implementing a Runtime System + id="tspan6613" + y="-2034.6543" + x="55.204292" + sodipodi:role="line" + style="font-size:32px">Pros:-] Fairly low effort Cons:-] One execution model implemented on top of a different execution model-] High overhead (poor performance) + Pros:-] Possible to get fastest impl.Cons: -] High effort to use HW primitives-] Repeat effort on each HW platform + Pros:-] Possible to get fairly fast impl.Cons: -] High effort to modify obscure code-] Repeat effort on each HW platform + Pros:-] Best of both worlds:-] Fastest (or near) on all HW-] Lowest effortCons: -] For some simple Exe models,can get slightly faster with HWprimitives diff -r 8cb89eed66b4 -r d6450ce874a9 CVs/13_Dc__Industry_CV/CV_w_pubs.pdf Binary file CVs/13_Dc__Industry_CV/CV_w_pubs.pdf has changed diff -r 8cb89eed66b4 -r d6450ce874a9 CVs/13_Dc__Industry_CV/CV_w_pubs.tex --- a/CVs/13_Dc__Industry_CV/CV_w_pubs.tex Tue Dec 31 10:49:41 2013 -0800 +++ b/CVs/13_Dc__Industry_CV/CV_w_pubs.tex Thu Jan 30 05:59:54 2014 -0800 @@ -157,7 +157,7 @@ HotPar11_w_Stack, PStackHome, VMS_LCPC_11, -VMSHome, +ProtoRuntimeHome, WorkTableHome, BaCTiLTechRep, CIPTechRep, diff -r 8cb89eed66b4 -r d6450ce874a9 CVs/13_Dc__Industry_CV/bib_for_papers_apr_2012.bib --- a/CVs/13_Dc__Industry_CV/bib_for_papers_apr_2012.bib Tue Dec 31 10:49:41 2013 -0800 +++ b/CVs/13_Dc__Industry_CV/bib_for_papers_apr_2012.bib Thu Jan 30 05:59:54 2014 -0800 @@ -1,3 +1,8 @@ +@Misc{ProtoRuntimeHome, + author = {Sean Halle}, + note = {http://opensourceresearchinstitute.org/pmwiki.php/PRT/HomePage}, + title = {{Proto-Runtime} Home Page} +} @techrep{SyncConstr_impl_w_distr_coherence_HW_Utah_96, author = {Carter, J. B. and Kuo, C.-C. and Kuramkote, R.}, title = { A comparison of software and hardware synchronization mechanisms for distributed shared memory multiprocessors}, @@ -62,8 +67,8 @@ } @Misc{WorkTableHome, author = {Halle, Sean}, - note = {http://musictwodotoh.com/worktable/content/refman.pdf}, - title = {The WorkTable Language Reference Manual}, + note = {http://opensourceresearchinstitute.org/pmwiki.php/WorkTable/HomePage}, + title = {The WorkTable Language Tutorial}, year = 2012 } @Misc{HWSimHome, @@ -200,67 +205,67 @@ } @Misc{StackTechRep_10, author = {Halle, Sean and Nadezhkin, Dmitry and Cohen, Albert}, - note = {http://www.soe.ucsc.edu/share/technical-reports/2010/ucsc-soe-10-02.pdf}, + note = {http://www.soe.ucsc.edu/research/technical-reports/UCSC-SOE-10-02}, title = {A Framework to Support Research on Portable High Performance Parallelism}, year = 2010 } @Misc{CTBigStepSemTechRep_06, author = {Halle, Sean}, - note = {http://www.soe.ucsc.edu/share/technical-reports/2006/ucsc-crl-06-11.pdf}, + note = {http://www.soe.ucsc.edu/research/technical-reports/UCSC-CRL-06-11}, title = {The Big-Step Operational Semantics of CodeTime Circuits}, year = 2006 } @Misc{MentalFrameworkTechRep_06, author = {Halle, Sean}, - note = {http://www.soe.ucsc.edu/share/technical-reports/2006/ucsc-crl-06-12.pdf}, + note = {http://www.soe.ucsc.edu/research/technical-reports/UCSC-CRL-06-12}, title = {A Mental Framework for use in Creating Hardware Independent Parallel Languages}, year = 2006 } @Misc{DKUTechRep_09, author = {Halle, Sean and Cohen, Albert}, - note = {http://www.soe.ucsc.edu/share/technical-reports/2009/ucsc-soe-09-06.pdf}, + note = {http://www.soe.ucsc.edu/research/technical-reports/UCSC-SOE-09-06}, title = {DKU Pattern for Performance Portable Parallel Software}, year = 2009 } @Misc{EQNLangTechRep, author = {Halle, Sean}, - note = {http://www.soe.ucsc.edu/share/technical-reports/2009/ucsc-soe-09-16.pdf}, + note = {http://www.soe.ucsc.edu/research/technical-reports/UCSC-SOE-09-16}, title = {An Extensible Parallel Language}, year = 2009 } @Misc{CTOSTechRep, author = {Halle, Sean}, - note = {http://www.soe.ucsc.edu/share/technical-reports/2009/ucsc-soe-09-15.pdf}, + note = {http://www.soe.ucsc.edu/research/technical-reports/UCSC-SOE-09-15}, title = {A Hardware-Independent Parallel Operating System Abstraction Layer for Parallelism}, year = 2009 } @Misc{SideEffectsTechRep, author = {Halle, Sean and Cohen, Albert}, - note = {http://www.soe.ucsc.edu/share/technical-reports/2009/ucsc-soe-09-14.pdf}, + note = {http://www.soe.ucsc.edu/research/technical-reports/UCSC-SOE-09-14}, title = {Parallel Language Extensions for Side Effects}, year = 2009 } @Misc{BaCTiLTechRep, author = {Halle, Sean}, - note = {http://www.soe.ucsc.edu/share/technical-reports/2006/ucsc-crl-06-08.pdf}, + note = {http://www.soe.ucsc.edu/research/technical-reports/UCSC-CRL-06-08}, title = {BaCTiL: Base CodeTime Language}, year = 2006 } @Misc{CTPlatformTechRep, author = {Halle, Sean}, - note = {http://www.soe.ucsc.edu/share/technical-reports/2006/ucsc-crl-06-09.pdf}, + note = {http://www.soe.ucsc.edu/research/technical-reports/UCSC-CRL-06-09}, title = {The Elements of the CodeTime Software Platform}, year = 2006 } @Misc{CTRTTechRep, author = {Halle, Sean}, - note = {http://www.soe.ucsc.edu/share/technical-reports/2006/ucsc-crl-06-10.pdf}, + note = {http://www.soe.ucsc.edu/research/technical-reports/UCSC-CRL-06-10}, title = {A Scalable and Efficient Peer-to-Peer Run-Time System for a Hardware Independent Software Platform}, year = 2006 } @Misc{CIPTechRep, author = {Halle, Sean}, - note = {http://www.soe.ucsc.edu/share/technical-reports/2005/ucsc-crl-05-05.pdf}, + note = {http://www.soe.ucsc.edu/research/technical-reports/UCSC-CRL-05-05}, title = {The Case for an Integrated Software Platform for {HEC} Illustrated Using the CodeTime Platform}, year = 2005 } @@ -284,14 +289,14 @@ year = 2008 } @Misc{VMSHome, - author = {Sean Halle and Merten Sach and Ben Juurlink and Albert Cohen}, - note = {http://virtualizedmasterslave.org}, + author = {Sean Halle and Merten Sach and Albert Cohen}, + note = {http://opensourceresearchinstitute.org/pmwiki.php/VMS/HomePage}, title = {{VMS} Home Page}, year = 2010 } @Misc{PStackHome, author = {Sean Halle}, - note = {http://pstack.sourceforge.net}, + note = {http://opensourceresearchinstitute.org/pmwiki.php/PStack/HomePage}, title = {{PStack} Home Page -- A Software Stack for Performantly Portable Parallelism}, year = 2012 }