VMS/0__Writings/kshalle

changeset 102:15f50e49ebb7

dealing with swiching disks -- commit and push as backup
author Sean Halle <seanhalle@yahoo.com>
date Tue, 17 Sep 2013 06:30:06 -0700
parents eb146c5c05a8
children 26b697944e73
files 0__Papers/Holistic_Model/Perf_Tune__long_version_for_TACO/latex/Holistic_Perf_Tuning.tex 0__Papers/PRT/PRT__formal_def/figures/PR__timeline_dual.pdf 0__Papers/PRT/PRT__formal_def/figures/PR__timeline_dual.svg 0__Papers/PRT/PRT__formal_def/figures/PR__timeline_dual_three_versions.svg 0__Papers/PRT/PRT__formal_def/figures/PR__timeline_dual_w_hidden.pdf 0__Papers/PRT/PRT__formal_def/figures/PR__timeline_dual_w_hidden.svg 0__Papers/PRT/PRT__formal_def/latex/PRT__full_w_Henning_derived_formal_def.tex 0__Papers/PRT/PRT__intro_plus_eco_contrast/helpers/07_F_26__The_Questions__blank.txt 0__Papers/PRT/PRT__intro_plus_eco_contrast/helpers/bib_for_papers.bib 0__Papers/PRT/PRT__intro_plus_eco_contrast/latex/PRT__intro_plus_eco_syst_and_contrast.tex 0__Papers/PRT/PRT__intro_plus_eco_contrast/latex/Paper_Design_2.txt 0__Papers/VMS/VMS__Foundation_Paper/VMS__Full_conference_version/figures/PR__timeline_dual_w_hidden.pdf 0__Papers/transfer_figures_from_attachment/VMS_flat.png 0__Papers/transfer_figures_from_attachment/VMS_nested.png 0__Papers/transfer_figures_from_attachment/VMS_numbers.txt 0__Papers/transfer_figures_from_attachment/nanos_flat.png 0__Papers/transfer_figures_from_attachment/nanos_nested.png 1__Presentations/13__Jy_01__DSLDI/software_stack.png 1__Presentations/13__Sp_08__DFM_workshop/Reo_plus_ProtoRuntime.odp 1__Presentations/13__Sp_08__DFM_workshop/Reo_plus_ProtoRuntime.pdf 1__Presentations/13__Sp_08__DFM_workshop/Reo_plus_ProtoRuntime.pot
diffstat 21 files changed, 5822 insertions(+), 577 deletions(-) [+]
line diff
     1.1 --- a/0__Papers/Holistic_Model/Perf_Tune__long_version_for_TACO/latex/Holistic_Perf_Tuning.tex	Sat Aug 03 19:24:22 2013 -0700
     1.2 +++ b/0__Papers/Holistic_Model/Perf_Tune__long_version_for_TACO/latex/Holistic_Perf_Tuning.tex	Tue Sep 17 06:30:06 2013 -0700
     1.3 @@ -79,100 +79,166 @@
     1.4  
     1.5  
     1.6  \begin{abstract}
     1.7 -Performance tuning is an important aspect of parallel programming that involves understanding both sequential behavior inside work-units and scheduling related behavior arising from how those units are assigned to hardware resources. Many good tools exist for identifying hot-spots in code, and tuning the sequential aspects there, as well as the data layout. They also identify constructs that spend a  long time blocked. However, they provide less help in understanding  why  constructs block for so long, and why no other work fills in.   The answer      often arises from chains of  causal interactions that involve the scheduling choices made in combination with runtime implementation details and  hardware behaviors. Identifying such a chain   requires visualizing each source of causality.  We propose supplementing existing tools with an additional tool to identify these chains and tie the behaviors back to specific spots in the code. To make such a tool, we apply a newly discovered model of the structure of parallel computation, as a guide to gathering each step in the scheduling and execution process. The visualizations of these are superficially similar to existing tools, but include additional  features  to identify unexpected behaviors within the chain and tie them to  related parallelism constructs within the code.
     1.8 -We show how to instrument a parallel language or programming model,   with no need to modify the application.  To simplify illustration, we instrumented the runtime of our own  pi-calculus inspired programming model, called SSR, and we walk through a tuning session on a large multi-core machine, which demonstrates the improvements in  generating hypotheses for both the causes of idle cores, and how to reduce the idleness. 
     1.9 +Performance tuning is an important aspect of parallel programming. It involves  tuning both sequential behavior, inside units of work, and also tuning the scheduling of those units onto hardware resources. Many good tools exist for the sequential aspects such as finding hot spots and adjusting the data layout. Many also show statistics related to scheduling, such as periods of idleness on each core and time spent trying to acquire locks. Some exceptional tools even shift blame onto specific parallelism construct instances in the code. However, more help is often desired, to understand the cause-and-effect among the events leading to the pattern of idleness.  Such cause-and-effect tells the coder why the blamed constructs consume so much time, and so enables forming a hypothesis for how to improve the code.  We propose an additional tool that visualizes the cause-and-effect cascades. It relies upon a novel theory of the structure of parallel computation, which we touch upon. The resulting visualizations appear superficially similar to existing tools, but visual details convey extra information, to reveal the cascades and  relate  them to the parallelism constructs within the code. The application is not modified, only the runtime system of the parallel language  or programming model.  We present a tutorial style introduction to the theory and the tool.  We instrumented the runtime of our own  pi-calculus inspired programming model, called SSR, and walk through using it to  read the cascade of causes of idle cores, and form hypotheses for how to reduce the idleness. 
    1.10  \end{abstract}
    1.11  
    1.12  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    1.13  \section{Introduction}
    1.14  \label{sec:intro}
    1.15  
    1.16 -Performance visualization and tuning tools for parallel programs are critical to achieving good performance.
    1.17 -Large factors of improvement can be realized by using existing tools to identify hot-spots and adjust the data layout and other internals of the most time-consuming functions.  
    1.18 -
    1.19 -
    1.20 -
    1.21 -However, once such optimization has made progress, the main area available for performance improvement undergoes a shift towards the scheduling  choices made by the runtime. Existing  tools tend to provide less help here, in understanding the complicated interactions among runtime behavior, scheduling decisions, and consequent hardware behavior. The interactions often give rise to unexpected chain reactions that leave cores idle. These involve the constraints stated by the application, the way the work is divided, and the choices made during the run for which unit of work to assign to which hardware resource at which point in time. To tune such behaviors, the chains need to be identified and understood.   
    1.22 +Performance visualization and tuning tools for parallel programs are critical to achieving good performance. Existing tools can be used to achieve large factors of improvement. The first way is through  identifying hot-spots and adjusting the data layout and other internals of the most time-consuming functions. This type of tuning considers mainly the sequential aspects of the code.  
    1.23 +
    1.24 +Once such optimization has made progress, further improvement shifts towards the scheduling aspects, which include the choices for which work is placed where.  This involves the internals of the runtime system. Improvement here relies on understanding the complicated interactions among runtime behavior, scheduling decisions, and consequent hardware behavior.  Unexpected performance losses often involve chain reactions that leave cores idle. The chains involve the constraints stated by the application, the way the work is divided, and the choices made during the run for which unit of work to assign to which hardware resource at which point in time. To tune such behaviors, the chains need to be revealed, in order to understand the causes and effects involved.   
    1.25  
    1.26  %The level of complexity is reflected by the fact that finding the optimal set of scheduling choices is known to be an  NP hard problem [].   
    1.27  
    1.28 -For example, a unit of work completes, which sends a signal to the runtime to update the state of the unit and the state of the hardware resource it occupied. This in turn causes the runtime to choose a different unit to own that hardware and sends the meta-information for that unit to the hardware. This in turn triggers communication, to move  the data consumed by the unit  to the hardware. Then the work of the new unit takes place there.
    1.29 -
    1.30 -Any one of these interactions could be individually abnormal, and an unexpected source of performance loss. For example, there may be congestion in the network that causes the data movement to be especially slow for that particular new unit. 
    1.31 -
    1.32 -Although current tools do a good job of identifying which constructs block, and which cores sit idle, they don't help much in deciphering such complex chains of causal interactions. This needs a visualization of each step in the causal chain.
    1.33 -
    1.34 -
    1.35 -
    1.36 -    
    1.37 -
    1.38 -
    1.39 -To help with this particular aspect of tuning, we propose an additional tool. It collects the causal interactions inside the runtime and hardware, and then visualizes them in a useful way. In addition, it helps to generate ideas for ways to reduce idleness by showing the scheduling-related structure of the application.
    1.40 -
    1.41 -  
    1.42 -
    1.43 -Our approach  is based on  a newly discovered model of the structure of parallel computation, which identifies each causal step. 
    1.44 -The theoretical model adds value   by indicating particular quantities to measure at specific points in the runtime system and by establishing a mental framework that simplifies hypothesis generation.
    1.45 -
    1.46 -The information used during tuning is collected by the runtime as the application executes, which makes it practical and backwards compatible. No modification of applications takes place.
    1.47 -
    1.48 -The information is used to visually link each unit to the units upon which its execution is conditional, via a chain of causal steps.   Semantics attached to the visual features  enable quickly generating the correct hypotheses for the causes of  lost computation opportunities, and quickly narrowing down what can be done, in the application code, to improve performance. In effect, the visualization serves as a map showing idle cores linked back to the sections of application code related.
    1.49 -
    1.50 -
    1.51 -To simplify seeing the value of the approach,  we walk through a session of tuning scheduling choices, in which we point out the semantics of features in the visualization, what each visual feature implies, and how these implications lead to hypotheses and fixes.   Although the views look similar to current tools, they differ in the semantics of the visual features.
    1.52 -
    1.53 -  
    1.54 -
    1.55 -We intend the  contribution of this paper to be the way the parallel model has been applied, rather than any aspects of the model itself.  The distinction is important because the space is used to convey the value gained by the  process of applying the model. Only enough of the model is stated, in \S \ref{sec:theory} and \S \ref{sec:Implementation}, to understand  where the value comes from and how to  instrument a runtime to gain it.
    1.56 -
    1.57 -This paper is organized as follows:  we introduce the semantics of features of our visualization  by walking through a tuning session in Section \ref{sec:casestudy}, and then expand on the theoretical model behind it in Section \ref{sec:theory}. Section \ref{sec:Implementation} gives  details of instrumenting a runtime to collect data according to the model. We relate the approach to other tuning tools in Section \ref{sec:related} and draw conclusions in Section \ref{sec:conclusion}.
    1.58 -
    1.59 +As an example of such a chain, a unit of work completes, which sends a signal to the runtime to update the runtime's internal state of the unit and the state of the hardware resource it occupied. This in turn causes the runtime to choose a different unit to own that hardware and sends the meta-information for that unit to the hardware. This in turn triggers communication, to move  the data consumed by the unit  to the hardware. Then the work of the new unit takes place there.
    1.60 +
    1.61 +Any one of these interactions could be individually abnormal, and an unexpected source of performance loss. In particular, there may be congestion in the network that causes the data movement to be especially slow for that new unit, or the runtime may experience contention for access to its internal state (IE, contention for an internal lock). 
    1.62 +
    1.63 +We propose augmenting current tools with a way to visualize each link in such cause-and-effect cascades.  The augmentation focuses on just this one aspect of performance tuning.  It adds the ability to see inside the runtime system's operation, so that all links in the cause and effect chains can be visualized.  It also assigns a meaning to each link, to aid in understanding the patterns.  The meaning is determined according to a theory of parallel computation. We touch upon this theory just enough to explain the meaning assigned to each link.  
    1.64 +
    1.65 +
    1.66 +%To help with this particular aspect of tuning, we propose an additional tool. It collects the causal interactions inside the runtime and hardware, and then visualizes them in a useful way. In addition, it helps to generate ideas for ways to reduce idleness by showing the scheduling-related structure of the application.
    1.67 +
    1.68 +%Our approach  is based on  a newly discovered model of the structure of parallel computation, which identifies each causal step. The theoretical model adds value   by indicating particular quantities to measure at specific points in the runtime system and by establishing a mental framework that simplifies hypothesis generation.
    1.69 +
    1.70 +No modification of the application takes place.  The new tool instruments the language implementation.  Instrumentation is inserted at key points in the runtime system of the language.  All the information used to visualize the cause and effect cascades is collected by the runtime system as the application executes. 
    1.71 +
    1.72 +The information is presented visually, as blocks of time taken by units of work that were scheduled, and as links among those blocks of time.  The time counted against a work-unit is broken down into categories such as: time taken to decide where to execute the unit; time to move the data of the unit; time inside the runtime to communicate completion of units and changes in the states of hardware resources; and time to move the meta-information of the unit.  The categories are defined by the computation theory.
    1.73 +
    1.74 +%presented as  broken down , and links link each unit to the units upon which its execution is conditional, via a chain of causal steps.   Semantics attached to the visual features  enable quickly generating the correct hypotheses for the causes of  lost computation opportunities, and quickly narrowing down what can be done, in the application code, to improve performance. In effect, the visualization serves as a map showing idle cores linked back to the sections of application code related.
    1.75 +
    1.76 +This introductory paper focuses on explaining the basics of the approach in a tutorial fashion. This involves learning the categories defined by the theory as well as the meaning of the fine details within the visualization. It also covers learning how to generate hypotheses, from the visualization, about what to change in the code, to modify the cascades of cause-and-effect.
    1.77 +
    1.78 +This paper does not cover advanced uses of the approach, as space requires choosing between providing tutorial style understanding, versus illustrating advanced features absent understanding of what's being presented.  Expert readers may find fault with our choice, desiring a more in depth comparison to the defacto tools, illustrating what this paper's tool adds beyond them.  We ask such readers to be patient, take the time to understand our approach, and allow us to present the more advanced features in a future paper.
    1.79 +
    1.80 +
    1.81 +We will present ways the proposed tool is separated from other tools, namely by adding detail to where a core's time has been spent, and by providing a wider variety of cause-and-effect links, that even cover runtime internal causations.  At the end of the section on the basic theory, \S\ref{subsec:SCG}, we revisit describing how the proposed tool goes further, to supplement existing ones.
    1.82 +
    1.83 +
    1.84 +%To simplify seeing the value of the approach,  we walk through a session of tuning scheduling choices, in which we point out the semantics of features in the visualization, what each visual feature implies, and how these implications lead to hypotheses and fixes.   Although the views look similar to current tools, they differ in the semantics of the visual features.
    1.85 +
    1.86 +%The contribution of this paper to be the way the parallel model has been applied, rather than any aspects of the model itself.  The distinction is important because the space is used to convey the value gained by the  process of applying the model. Only enough of the model is stated, in \S \ref{sec:theory} and \S \ref{sec:Implementation}, to understand  where the value comes from and how to  instrument a runtime to gain it.
    1.87 +
    1.88 +This paper is organized as follows:  we introduce the basics of the theory behind the tool in \S\ref{sec:basics}, then give the features of our visualization and their meaning in \S\ref{sec:visualization}. We illustrate their use by walking through a tuning session in \S\ref{sec:casestudy}. To prepare for explaining how to implement the approach, we expand on the theory behind it in \S\ref{sec:theory}. \S\ref{sec:Implementation} then gives the details of how to instrument a runtime to collect the data dictated by the theory. We briefly relate the approach to other tuning tools in \S\ref{sec:related} and draw conclusions in \S\ref{sec:conclusion}.
    1.89  
    1.90  
    1.91  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    1.92 -\section{Illustration of Tuning Scheduling Decisions}
    1.93 -\label{sec:casestudy}
    1.94 -
    1.95 -To demonstrate the value of the approach, we create a slightly contrived example  that brings out particular features of the tool. but uses a real application running on real hardware. 
    1.96 -
    1.97 -We start  by describing the hardware being run on and structure of the program being tuned, and follow by describing the parallel language used and instrumented. We then  advance to the features of the visualization.
    1.98 -
    1.99 - A sequence of visualizations follow. In each,  we point out how the  performance loss is identified, and which visual features suggest the hypothesis of the cause of the loss.  We show how the visual features direct the user to the specific sections of code that need to be changed, and how the model suggests what changes to try. 
   1.100 -
   1.101 -\subsection{Setup: Hardware and Application}
   1.102 -
   1.103 -
   1.104 -
   1.105 -We run our example on a machine with 4 sockets by 10 cores each, for a total of 40 physical cores. They are Intel WestmereEx cores running at 3.0GHz, with TurboBoost turned off for reproducability. 
   1.106 -
   1.107 -We wish to tune scheduling on a standard program that the reader  knows well, so we chose matrix multiply, with which the reader should be familiar.  This allows concentration on the tool without distraction about the application. 
   1.108 -
   1.109 -The application is structured as follows: it  creates one entity to divide the work into a number of pieces and  creates another entity for each piece of work. How many pieces is determined by the combination of a tuning parameter in the code with the number of cores. The work is distributed across the cores  in a round-robin fashion, unless otherwise specified.
   1.110 -
   1.111 -The application also creates an entity that manages the partial-results. Each piece of work sends its contribution, which is accumulated into the overall  result. The entity that divides waits for the entity that accumulates to signal completion then the language runtime shuts down.
   1.112 -
   1.113 -\subsection{Programming Model}
   1.114 -We chose a simple language that was convenient to instrument. It is inspired by pi-calculus, and called Synchronous Send-Receive (SSR).  It implements  rendez-vous style send and receive operations made between virtual processors (VPs), where a VP is similar to a software thread. The example application uses the commands for creating and destroying VPs, two kinds of send-receive paired operations, a parallel singleton, and scheduling control constructs. 
   1.115 -
   1.116 -The first kind of send-receive pair used, and instrumented, is precise about sender and receiver. Called \emph{send\_from\_to}, it specifies both sender and receiver VPs, and is used by the results VP  to tell the divider VP that the work is complete. The second pair, \emph{send\_of\_type\_to}, specifies only a specific receiver, leaving the sender anonymous. It has increased flexibility while maintaining some control over scope. The worker VPs use this to send their partial result to the results VP. 
   1.117 -
   1.118 -The application uses the \emph{singleton} construct to reduce the amount of work done by the (sequential) divider VP. The construct designates a piece of code as to be executed only once, even though it is invoked by multiple VPs. It is employed to shift the work of copying matrix fragments out of the divider and over to the worker-pieces. The first worker-piece to use a given input-matrix fragment performs the copy, which spreads the copying across the cores.
   1.119 -
   1.120 -We control the scheduling behaviors, in order to concisely illustrate the use and features of the tool. This is done with language constructs that force which core a virtual processor is assigned to. 
   1.121 -
   1.122 -A note on terminology: We often use the term  ``work-unit'',  which we define precisely, instead of ``task'', which has acquired multiple  meanings in the literature. Work-unit  is defined as the trace-segment performed on a core, between two successive scheduling events, plus the set of datums consumed by that trace segment. The word task often maps well onto this  definition, and we use both words, but mean the precise work-unit definition when we say task.
   1.123 -
   1.124 -
   1.125 +\section{Basics of the Approach}
   1.126 +\label{sec:basics}
   1.127 +
   1.128 +learning the categories defined by the theory
   1.129 +
   1.130 +as well as the meaning of the fine details within the visualization.
   1.131 +we introduce the features of our visualization and their meaning
   1.132 +
   1.133 +\subsection{Elements of the Theory}
   1.134 +The core purpose of the theory of parallel computation that we base the tool on is to connect the contributions of application code, language implementation, and hardware.  It is meant to first establish a set of primitive concepts that form the basis of parallel computation, and then establish relationships among those primitives.  The user of the theory can then look at application code, the language implementation and the hardware, and understand how the three interact. The interaction says how observed performance came about, or can predict performance while consuming minimal computation. Such understanding of the interaction can be useful in many ways.  The tool described in this paper is just one useful way to apply the theory. 
   1.135 +
   1.136 +\subsubsection{Primitive Elements of the Theory}
   1.137 +The theory chooses different primitives than are used in the broad literature and common practice.  It doesn't consider the concept of compiler to be well defined, nor the concept of runtime system, nor even the concept of programming language.  Yet, these things all, clearly are concepts in common usage and each plays a role in the performance of parallel code.  So, the theory attempts to identify patterns within those that are, indeed, well defined and also remain invariant across applications, languages, and tool chains.  Such a rearrangement of basic terms comes at the price of a steep learning curve, making the theory challenging to accept.
   1.138 +
   1.139 +The primitives chosen are: unit-of-work, constraints on scheduling a unit of work, managing the constraints, and mapping free units of work onto animators that perform the work.  These primitives can then be related to compilers, programming languages, runtime systems, and so forth.  Each of those things can be defined by the effect they have on the primitives of the theory.  We leave such definitions to other papers, while here simply asserting that this set of primitives appears to be universal, invariant, and capable of being a basis set upon which essentially all aspects of parallel computation can be understood.  We ask the patience of the reader to accept this rather bold claim for the moment, and examine it more closely in the context of future papers focused on such claims.
   1.140 +
   1.141 +These primitives are defined in terms of each other, so their defintions have a circular quality, but as a whole they form a consistent interlocking set.  Formal definitions do exist, but are outside the scope of this paper.
   1.142 +
   1.143 +\begin{description}
   1.144 +\item[A unit of work] (AKA work unit or just unit), simply, is the thing that is scheduled.  It is defined as the thing that constraints are stated about, which implies that a scheduling decision must be made for each unit of work, and that a single unit of work is indivisible from the point of view of scheduling.  Examples of work units are a single firing of a dataflow node, the trace of instructions between consecutive calls to pthread constructs, and a single iteration of a parallel-for loop.
   1.145 +
   1.146 +\item[A constraint on scheduling a unit of work], loosely, is some condition that must be satisfied before the unit of work can be assigned to an animator. Examples include "must acquire the lock variable", "must wait for a paired send from a different animator", and "comes after completion of the preceding work-unit".
   1.147 +
   1.148 +\item[Managing constraints] in short is part of scheduling. It is the process of communicating among the animators the changes in state of units of work and changes in internal constraint state, and computing on those such that all constraints on sheduling work are upheld during the evolution of a computation.  Some languages place the bulk of this management inside the compiler, others place the bulk inside a runtime system.
   1.149 +
   1.150 +\item[Mapping free units of work onto animators] in short is assignment of work onto cores.  This takes place after constraint management has communicated that particular units are free from constraints.  Sometimes this mapping is performed statically inside the compiler, sometimes it is performed as part of application code, and sometimes it is performed inside a language's or execution model's runtime system. This is the other significant part of the scheduling process. 
   1.151 +
   1.152 +\end{description}
   1.153 +
   1.154 +\subsection{Organizing information related to the primitives}
   1.155 +To organize information related to the primitive elements, the theory defines two kinds of graph.  They encode information for a given application written in terms of a given language, executed on given hardware.
   1.156 +
   1.157 +The first kind of graph, called a Unit and Constraint Collection (UCC), encodes the units of work and the constraints on scheduling them.  This is independent of any particular implementation of the language and independent of hardware on which the application runs.  It essentially characterizes the parallelism in the application.
   1.158 +
   1.159 +The second kind of graph, called a Scheduling Consequence Graph (SCG), also depicts the work units, but instead of constraints, it encodes the particular scheduling choices made during a given run on given hardware.  It accounts for all time taken by each processing core, dividing it among categories and charging it against particular work units.
   1.160 +
   1.161 +\subsection{UCC}
   1.162 +
   1.163 +The Unit \& Constraint Collection, or UCC, elements are the application's units of work  and the constraints on scheduling them. Constraints can be explicitly stated in the code, such as ``acquire lock", or  implied by language constructs, such as ``parallel-for''. They limit the choices available to the runtime. We use the  general term ``constraint'' instead of the specific term ``dependency'' because dependency only covers  one pattern: this unit before that one. Meanwhile  constraints are general,  such as e.g. a mutual exclusion on scheduling a group of units: any order, but only one at a time.
   1.164 +
   1.165 +
   1.166 +\subsection{The SCG, and categories of time}
   1.167 +\label{subsec:SCG}
   1.168 +The consequence graph shows the actual scheduling choices \emph{made}. The UCC defines which choices are \emph{allowed}, and from among those, a consequence graph depicts \emph{one} set of choices taken, along with the time consequences.
   1.169 +
   1.170 +Each bit of core time is accounted to a category, then added to the total for a particular unit of work.  This is depicted as boxes, one box for each unit, with a region inside the box for each category of time.  
   1.171 +
   1.172 +There is one category for each primitive: creating a unit of work, adding constraints on a unit of work, managing the constraints, mapping free units of work onto animators.  In addition, there are categories implied by those: performing the work, waiting idly for communication of work data,  waiting idly to receive a unit to animate, and doing internal runtime activities. 
   1.173 +
   1.174 +The internal runtime activities are always related to one of the primitive elements of the theory, but the structure of the runtime shifts which of the activities are significant enough to warrant separate accounting.
   1.175 +  Some examples include time spent communicating constraint updates from one core to another, time spent sending the meta-information about a work unit between cores, and time acquiring locks protecting internal runtime state such as shared constraint state.  On a shared memory machine, the communication times may be negligible and "rolled into" the management category, while on a distributed memory machine this communication may be a significant portion of the runtime overhead.  An SCG comes together with a definition of the internal runtime activities that it accounts in separate regions.
   1.176 +
   1.177 +Cause-and-effect exists between regions of different units.  These are depicted as arcs that link regions, or link entire boxes.  A region may source or sink multiple causality links. Each kind of cause and effect is represented by a corresponding kind of link.
   1.178 +
   1.179 +Groupings of kinds of causal links are defined, for convenience: constraint related cause-and-effect, runtime internal cause-and-effect, and hardware related cause-and-effect.  Constraint related links include satisfaction of a sequential dependency in the base language, and satisfaction of a parallel constraint, such as when one unit does something to satisfy a constraint on the other, causing it to be free to be scheduled. An internal runtime link example is when the runtime on one core releases an internal lock, causing the other core to acquire it. An example of a hardware causality is when one work-unit finishes on a core, it frees it, causing the runtime to calculate which work-unit to start there next.
   1.180 +
   1.181 +The proposed tool is separated from other tools by the presence of the regions, which add detail to where a core's time has been spent, and by the cause-and-effect links that cover even runtime internal causations.  Some tools, notably ones based on MPI, have the notion of unit of work, and depict time spent on each, and have some categories of cause-and-effect.  However, the proposed tool goes further by defining categories inside of a unit's time, making unit a universal concept, even for languages that don't naturally expose them, and by including additional sources of cause-and-effect, down into the runtime internals, when they are important to understand the cascades that result in idle cores.
   1.182 +
   1.183 +
   1.184 +
   1.185 +%======================
   1.186 +%%%%%%%%%%%%%%%
   1.187 +
   1.188 +%The categories are: 
   1.189 +%\begin{enumerate}
   1.190 +%\item creation of a meta-unit
   1.191 +%\item state updates that affect constraints on the unit
   1.192 +%\item the time to make the decision to animate the unit
   1.193 +%\item movement of the meta-unit plus data to physical resources that do the animation
   1.194 +%\item animation of the unit which does the work
   1.195 +%\item communication of state-updates about completion of the unit and freeing the hardware
   1.196 +%\item resulting constraint updates communicated within the runtime, possibly causing new meta-unit creations or freeing other meta-units to be chosen for animation
   1.197 +%\end{enumerate}
   1.198 +
   1.199 +%The runtime region has sub-activities, but we do not detail them here due to space. However, some will be stated in Section \ref{sec:Implementation} when we talk about instrumenting a runtime.
   1.200 +%%%%%%%%%%%%%%%
   1.201 +
   1.202 +%Arcs are gathered into groups according to the nature of the causality they represent.  The kinds of causal links are: satisfaction of a sequential dependency in the base language; satisfaction of a parallel constraint  (i.e., one unit did something to satisfy a constraint on the other, causing it to be free to be scheduled); a causal link internal to the runtime (for example, the runtime on one core releasing a shared lock, causing the other core to acquire it); and causal links in the hardware (for example, one work-unit finishes on a core, causing another work-unit to start there, modulo a choice by the runtime).
   1.203 +
   1.204 +%We will now expand on each of those kinds of causal link.
   1.205 +
   1.206 +%\paragraph{Constraint  causal link} Two entire boxes (units) are linked this way when   action by one unit contributes to satisfaction of a constraint blocking the other unit. This includes sequential dependencies from the base language (which are noted in the tool but normally not displayed).
   1.207 +
   1.208 +%Sequential dependencies may add superfluous constraints that  eliminate some otherwise allowed choices in the UCC. An example would be a sequential \texttt{for} loop that creates work-units -- no parallelism constructs cause the creations to be done in sequence, but the base C language sequentializes it nonetheless. 
   1.209 +
   1.210 +%\paragraph{Runtime internal causal link} Runtime implementation details may introduce ``extra" causalities between units. For example, the runtime we instrumented for this paper runs separately on each core and relies upon a global lock for accessing shared runtime information. This lock introduces a causal relationship  when the runtime on one core is attempting to process one unit, but must wait for the runtime on a different core to finish with its unit.
   1.211 +
   1.212 +% Normally, these are not displayed explicitly, due to clutter, but can be turned on when needed, for instance to determine the cause of a particular pattern of core usage.
   1.213 +
   1.214 +%\paragraph{Hardware causal link} The physical fact that a given resource can only be used by one work-unit at a time introduces hardware causalities. When multiple units are free to execute, but all cores are busy, then completion of a unit  on one core causes (in part) the next ready unit to run on that core. 
   1.215 +
   1.216 +%These are also not normally displayed, due to clutter, and not all hardware dependencies are directly measured. Future work will focus on using the performance counters and other instrumentation to add more information about communication paths taken as a consequence of the scheduling decisions made. It will start with the current linkage of application-code to runtime decisions, and add consequent usage of communication hardware. This gives an end-to-end linkage between runtime choices and caused behavior on the hardware. 
   1.217 +
   1.218 +%%%%%%%%%%%%%%%
   1.219 +
   1.220 +%Every unit has a meta-unit that represents it in the runtime. A  unit is defined as the work that exists after leaving the runtime, up until re-entering it. For example, the trace of instructions on a core, from the point of leaving the runtime up until the next invocation. Looking at this in more detail, every runtime has some form of internal bookkeeping state for a unit, used to track constraints on it and make decisions about when and where to execute. This exists even if that state is just a pointer to a function that sits in a queue. We call this bookkeeping state for a unit the meta-unit.
   1.221 +
   1.222 +%Each  unit also has a life-line, which progresses so:  creation of the meta-unit~\pointer~state updates that affect constraints on the unit~\pointer~the decision is made to animate the unit~ \pointer~movement of the meta-unit plus data to physical resources that do the animation~\pointer~animation of the unit, which does the work~\pointer~communication of state-update, that unit has completed, and hardware is free~\pointer~constraint updates within runtime, possibly causing new meta-unit creations or freeing other meta-units to be chosen for animation.  This repeats for each unit. Each step is part of the model.
   1.223 +
   1.224 +% Note a few implications: first, many activities internal to the runtime are part of a unit's life-line, and take place when only the meta-unit exists, before or after the work of the actual unit; second, communication that is internal to the runtime is part of the unit life-line, such as state updates; third, creation may be implied, such as in pthreads, or triggered such as in dataflow, or be by explicit command such as in StarSs. Once created, a meta-unit may languish before the unit it represents is free to be animated.
   1.225 +
   1.226 +%This explains why the visualizations remain largely the same across languages. The concepts of a meta-unit, a unit, constraints on a unit, and a unit life-line are all valid in every language.  The visualizations are based on these concepts, and so likewise largely remain the same.  In the UCC, only the constraint patterns that represent  the language's constructs change between languages. In the SCG, only which construct a line in the SCG represents changes.
   1.227 +%%%%%%%%%%%%%%
   1.228 +%==================
   1.229  
   1.230  \subsection{The Visualizations}
   1.231  \label{subsec:visualization_def}
   1.232 - The approach has two kinds of visualization, each corresponds to an aspect of the model. One focuses on just the application, conveying   its scheduling related structure. Its main value is understanding what's possible. The other focuses on how this structure interacts with the runtime and hardware. Its value is displaying the causal chains  and linking each step in a chain back to application code. 
   1.233 -
   1.234 - We refer to the first visualization  as a  Unit \& Constraint Collection, or UCC, which is explained in \S \ref{sec:UCCExpl}. Its elements are the application's units of work  and the constraints on scheduling them. Constraints can be explicitly stated in the code, or  implied by language constructs. They limit the choices available to the runtime. We use the  general term ``constraint'' instead of the specific term ``dependency'' because dependency only covers  one pattern: this unit before that one. Meanwhile  constraints are general,  such as e.g. a mutual exclusion on scheduling a group of units: any order, but only one at a time.
   1.235 -
   1.236 -We refer to the second visualization as a scheduling consequence graph (SCG), or just consequence graph, which is explained in \S \ref{sec:SCGExpl}. It depicts where the runtime assigned each of the units, how long the units executed, the sources of overhead, and the changes in constraint-state that triggered each runtime behavior. 
   1.237 -
   1.238 -In short, the UCC states the degrees of freedom enabled by the application, while the consequence graph states how those were made use of, by a particular runtime on particular hardware.
   1.239 +Here we describe the visual representations of a concrete UCC and an SCG. The UCC conveys the scheduling related structure of the application code, to provide understanding of what schedulings are possible. The SCG focuses on how the constraints in the application interact with the runtime and hardware. It displays the cause-and-effect cascades  and links each node in a cascade back to application code, runtime implementation, or hardware. 
   1.240 +
   1.241 +%The Unit \& Constraint Collection, or UCC, elements are the application's units of work  and the constraints on scheduling them. Constraints can be explicitly stated in the code, or  implied by language constructs. They limit the choices available to the runtime. We use the  general term ``constraint'' instead of the specific term ``dependency'' because dependency only covers  one pattern: this unit before that one. Meanwhile  constraints are general,  such as e.g. a mutual exclusion on scheduling a group of units: any order, but only one at a time.
   1.242 +
   1.243 +%We refer to the second visualization as a scheduling consequence graph (SCG), or just consequence graph, which is explained in \S \ref{sec:SCGExpl}. It depicts where the runtime assigned each of the units, how long the units executed, the sources of overhead, and the changes in constraint-state that triggered each runtime behavior. 
   1.244 +
   1.245 +%In short, the UCC states the degrees of freedom enabled by the application, while the consequence graph states how those were made use of, by a particular runtime on particular hardware.
   1.246  
   1.247  \subsubsection{UCC visualization} \label{sec:UCCExpl}
   1.248  
   1.249 @@ -192,7 +258,7 @@
   1.250  \subsubsection{SCG visualization}  \label{sec:SCGExpl}
   1.251  The Scheduling Consequence Graph is the main visualization used to detect chains of causal interactions between elements of the system.  As the  example from the introduction indicated, a unit may end, which sends a signal to the runtime to update the state of the unit and the state of the hardware resources it occupied. This  causes the runtime to choose a different unit to own that hardware and sends the meta-information for that unit to the hardware. This in turn triggers communication, to send the data consumed by the unit to the hardware. Then the work of the new unit takes place there.
   1.252  
   1.253 -Any one of these interactions could be individually abnormal, and an unexpected source of performance loss. The SCG allows selectively turning on visualization of each kind of causal interaction, until the culprit for the unexpected slowdown is seen.  
   1.254 +Any one of these interactions could be individually abnormal, and an unexpected source of performance loss. The SCG allows selectively turning on visualization of each kind of cause-and-effect, until the culprit for the unexpected slowdown is seen.  
   1.255  \begin{figure}[ht]
   1.256    \centering
   1.257    \includegraphics[width = 2in, height = 1.8in]{../figures/SCG_stylized_for_expl.pdf}
   1.258 @@ -201,23 +267,62 @@
   1.259  \end{figure}
   1.260  
   1.261  Fig. \ref{fig:SCG_expl} shows a consequence graph, stylized for purposes of explanation. 
   1.262 -It is composed of a number of columns, one for each core. A column represents time on the core,  increasing as one goes down, measured in clock cycles. It is broken into blocks, each representing the time accounted to one work-unit. Each block is further divided into regions, each a different color, which indicates the kind of activity the core was engaged in during that  region's time-span.
   1.263 -
   1.264 -The application code executed within a block is linked to the block. In our tool, the block is labelled with a unique unitID. This ID  is then linked to the code executed within that unit. In this way, the code of any block can be looked up, along with the parallelism constructs that mark the start and end of the block.
   1.265 -
   1.266 -The kinds of activities within a block are defined by the computation model that underlies the visualization. The first kind of activity is the actual work, plus  waiting for  cache misses. It is represented by a blue-to-red region where the color indicates  intensity of cache misses, with pure red representing at or above the maximum misses per instruction, and pure blue the minimum (the max and min are set in the tool that generates the visualization).
   1.267 -
   1.268 -  The second kind of activity is runtime overhead, represented by a gray region. This is the overhead spent on that particular work-unit. When desired by the user, it is further broken into pieces representing activities inside the runtime. The options include time spent on: constraints, when determining readiness of a work-unit; deciding which ready unit to assign to which hardware; and time spent switching from virtual processor, to the runtime, and back. In this paper, we show all runtime overhead lumped together, however in other circumstances a breakdown can be key to seeing where unexpected slowdown is taking place. 
   1.269 -
   1.270 -The other type of visual feature seen in Fig. \ref{fig:SCG_expl} is lines. Each represents a construct that influenced scheduling, where the color indicates what kind of construct.   A line represents two things: a constraint, whose satisfaction made the lower unit ready, and a decision by the runtime to start the lower unit on that core. 
   1.271 -
   1.272 -In general, lines may also be drawn that represent other kinds of interactions, which affect core usage. For example,  our runtime implementation only allows one core at a time to access shared  scheduling state. Visualization of this can be turned on, as additional lines linking the gray runtime regions of blocks (visualization of such interactions is turned off in this paper for simplicity). 
   1.273 -
   1.274 -Two work-unit blocks that appear in sequence and have no lines drawn between them often have a causal dependency, due to the semantics of the base language (visualization of these causalities is also turned off, but can be inferred via the link to the code).
   1.275 +Each column represents time on one core, increasing as one goes down, measured in clock cycles. Each box represents a unit, the time accounted to it. The categories on time are represented inside a box as regions, each a different color. The color indicates the kind of activity the core was engaged in during that region's time-span.
   1.276 +
   1.277 +A box is labelled with a unique unit ID that links to the code executed within that unit, as well as the parameters passed to it. With this the code of any box can be looked up, along with the parallelism constructs that mark the start and end of the unit.
   1.278 +
   1.279 +The first kind of activity is the actual work, combined with waiting for  cache misses. It is represented by a blue-to-red region where the color indicates  intensity of cache misses, with pure red representing at or above the maximum misses per instruction, and pure blue the minimum (the max and min are set in the tool that generates the visualization).
   1.280 +
   1.281 +  The second kind of activity is runtime overhead, represented by a gray region. This is the overhead spent on that particular work-unit. When desired by the user, it is further broken into pieces representing activities inside the runtime. The options include time spent on: constraints, when determining readiness of a work-unit; deciding which ready unit to assign to which hardware; and time spent switching from virtual processor, to the runtime, and back. 
   1.282 +
   1.283 +This paper is an introductory tutorial, so we simplify by setting the tool to show all runtime overhead lumped together. however in other circumstances a breakdown can be key to seeing where unexpected slowdown is taking place. 
   1.284 +
   1.285 +The other type of visual feature seen in Fig. \ref{fig:SCG_expl} is lines. Each represents a cause-and-effect, where the color indicates what kind of causality. In this paper, we set the tool to only show parallelism construct cause and effect.  Hence, a line represents two things: a constraint, whose satisfaction made the lower unit ready, and a decision by the runtime to start the lower unit on that core. The color, here, corresponds to the kind of construct.
   1.286 +
   1.287 +For more advanced tuning, lines can be turned on that represent the other kinds of cause-and-effect. For example, our SSR runtime implementation only allows one core at a time to access shared  scheduling state. Visualization can be turned on that draws a separate region for the time spent waiting to acquire the lock, along with additional lines that link lock-acquire regions of the box that release the lock to the region of the box that acquired it. 
   1.288 +
   1.289 +Two work-unit blocks that appear in sequence and that have no lines drawn between them often have a causal dependency, due to the sequential semantics of the base language.  For this tutorial, visualization of these causalities is also turned off, but can be inferred via the link to the code.
   1.290  
   1.291  Note that  many different orderings can be validly chosen by the runtime. The scheduler choices that are valid is determined by three kinds of constraints: the application code constraints, hardware constraints, and runtime implementation imposed constraints.
   1.292  
   1.293 -The visual features allow the user to see at a glance the total  execution time (height), idle cores during the run (empty space), cache behavior (color of work regions), degree of overhead (size of gray regions), and which units constrained which other units (lines). All consequence graphs in this paper are at the same scale, so they can be compared directly.
   1.294 +The visual features allow the user to see at a glance the total  execution time (height), idle cores during the run (empty space), cache behavior (color of work regions), degree of overhead (size of gray regions), and which units constrained which other units (lines). Note that all consequence graphs in this paper are at the same scale, so they can be compared directly.
   1.295 +
   1.296 +
   1.297 +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   1.298 +\section{Illustration of Tuning Scheduling Decisions}
   1.299 +\label{sec:casestudy}
   1.300 +
   1.301 +In this section we teach how to use the visualization to generate hypotheses about what to change in the code, in order to manage the cascades of cause-and-effect. 
   1.302 +
   1.303 +Because this is a tutorial, we create a contrived example  that brings out particular features of the tool, but still uses a real application running on real hardware.  The example uses features of SSR to explicitly assign units to cores, in the application source code.  This isn't the normal way that tuning would take place, but it allows us to control the activity to illustrate the concepts in the order most conducive to learning.
   1.304 +
   1.305 +We start  by describing the hardware being run on and structure of the program being tuned, and follow by describing the parallel language used and instrumented. 
   1.306 +
   1.307 + A sequence of visualizations follow. In each,  we point out how the  performance loss is identified, and which visual features suggest the hypothesis of the cause of the loss.  We show how the visual features direct the user to the specific sections of code that need to be changed, and how the model suggests what changes to try. 
   1.308 +
   1.309 +\subsection{Setup: Hardware and Application}
   1.310 +
   1.311 +We run our example on a machine with 4 sockets by 10 cores each, for a total of 40 physical cores. They are Intel WestmereEx cores running at 3.0GHz, with TurboBoost turned off for reproducability.
   1.312 +
   1.313 +For this tutorial, we chose a program that the reader knows well, matrix multiply.  This allows concentration on the tool without distraction about the application. 
   1.314 +
   1.315 +The application is structured as follows: it  creates one entity to divide the work into a number of pieces and  creates another entity for each piece of work. How many pieces is determined by the combination of a tuning parameter in the code together with the number of cores. The work is distributed across the cores  in a round-robin fashion, unless otherwise specified.
   1.316 +
   1.317 +The application also creates an entity that manages the partial-results. Each piece of work sends its contribution, which is accumulated into the overall  result. The entity that divides waits for the entity that accumulates to signal completion then the language runtime shuts down.
   1.318 +
   1.319 +\subsection{Programming Model}
   1.320 +We chose a simple language that was convenient to instrument. It is inspired by pi-calculus, and called Synchronous Send-Receive (SSR).  It implements  rendez-vous style send and receive operations made between virtual processors (VPs), where a VP is similar to a software thread. The example application uses the commands for creating and destroying VPs, two kinds of send-receive paired operations, a parallel singleton, and scheduling control constructs. 
   1.321 +
   1.322 +The first kind of send-receive pair is precise about sender and receiver. Called \emph{send\_from\_to}, it specifies both sender and receiver VPs, and is used by the results VP  to tell the divider VP that the work is complete. The second kind, \emph{send\_of\_type\_to}, specifies only a specific receiver, leaving the sender anonymous. It has increased flexibility while maintaining some control over scope. The worker VPs use this to send their partial result to the results VP. 
   1.323 +
   1.324 +The application uses the \emph{singleton} construct to reduce the amount of work done by the (sequential) divider VP. The construct designates a piece of code as to be executed only once, even though it is invoked by multiple VPs. It is employed to shift the work of copying matrix fragments out of the divider and over to the worker-pieces. The first worker-piece to use a given input-matrix fragment performs the copy, which spreads the copying across the cores.
   1.325 +
   1.326 +For this tutorial, we control the scheduling behaviors, in order to concisely illustrate the tool. This is done with language constructs that force which core a virtual processor is assigned to. 
   1.327 +
   1.328 +A note on terminology: We often use the term  ``work-unit'',  which we define precisely, instead of ``task'', which has acquired multiple  meanings in the literature. Work-unit  is defined as the trace-segment performed on a core, between two successive scheduling events, plus the set of datums consumed by that trace segment. The word task often maps well onto this  definition, and we use both words, but mean the precise work-unit definition when we say task.
   1.329 +
   1.330 +%<snip>
   1.331 +
   1.332  \begin{figure*}[t!]
   1.333    \begin{minipage}[b]{0.25\textwidth}
   1.334          \hfill\subfloat[35.8 Gcycles\\Original]
   1.335 @@ -269,7 +374,7 @@
   1.336  \subsection{Walk-through}
   1.337  \label{subsec:walk-through}
   1.338  
   1.339 -We wish to show the visualizations in a simple  way, to enhance understanding. Hence, this walk through uses a slightly contrived example in which the application explicitly controls where each unit of work is assigned. As a result, the causal interactions of interest are all constraints stated by language constructs. We don't show causalities internal to the runtime system or hardware, although the tool is capable of turning on display of those. We chose to do so to simplify this introduction to the use of the visualizations.  
   1.340 +We wish to show the visualizations in a simple  way, to enhance understanding. Hence, this walk through uses a contrived example in which the application explicitly controls where each unit of work is assigned. As a result, the causal interactions of interest are all constraints stated by language constructs. We don't show causalities internal to the runtime system or hardware, although the tool is capable of turning on display of those. We chose to do so to simplify this tutorial on the use of the visualizations.  
   1.341  
   1.342  Fig. \ref{story} displays all of the scheduling consequence graphs generated during our tuning session. They all use the same scale, for direct comparison. All have 40 columns, one for each core, and relative height indicates relative execution time. The lines in red, orange, and green represent application-code constructs. Red is creation of a virtual processor, green is the many-to-one \texttt{send\_of\_type\_to}, and orange is the singleton construct. For better visibility, only constraints that cross cores are turned on.
   1.343  
   1.344 @@ -300,24 +405,28 @@
   1.345  
   1.346  \subsubsection{Holes in the core usage}\label{subsec:holes}
   1.347  
   1.348 -In Fig \ref{story:e}, the true value of the SCG starts to appear. In it, ``holes'' are noticeable. Inspecting these holes closer, we can see that the stalled blocks are at the ends of orange lines. This tells us definitively that they were waiting upon the completion of a singleton. Other tools may have indicated that singleton constructs have a mild time spent blocked, but they wouldn't show how a single blocked singleton had a  chain-reaction effect, holding up creation, then in turn causing many cores to sit idle. 
   1.349 -
   1.350 - Zooming in on the singletons and tracing out the state inside the runtime shows that the empty space  is a runtime implementation issue. This  analysis required all the forms of information provided by the SCG.  such as  the knowledge of precisely which units came before and after the one blocked by the singleton, in combination with  the fact that it is a singleton construct blocking.  
   1.351 +In Fig. \ref{story:e}, the true value of the SCG starts to appear. In it, ``holes'' are noticeable. Inspecting these holes closer, we can see that the stalled blocks are at the ends of orange lines. This tells us definitively that they were waiting upon the completion of a singleton. Other tools may have indicated that singleton constructs have a mild time spent blocked, but they wouldn't show how a single blocked singleton had a  chain-reaction effect, holding up creation, then in turn causing many cores to sit idle. 
   1.352 +
   1.353 + Zooming in on the singletons and tracing out the state inside the runtime shows that the empty space  is a runtime implementation issue. This  analysis required all the forms of information provided by the SCG,  such as  the knowledge of precisely which units came before and after the one blocked by the singleton, in combination with  the fact that it is a singleton construct blocking.  
   1.354  
   1.355  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   1.356 -\section{The Model Behind the Visualization}
   1.357 +\section{Advanced Application of UCC and SCG}
   1.358  \label{sec:theory}
   1.359 -The value of the visualizations comes from the causality linkages. Being able to trace causes from idle cores back through hardware, runtime implementation, and scheduling choices, provides the added insight that augments the other tools. This linkage tracing is in turn made possible by the model of the structure of computation.    In effect, the tool directly visualizes the model.
   1.360 +%The value of the visualizations comes from the causality linkages. Being able to trace causes from idle cores back through hardware, runtime implementation, and scheduling choices, provides the added insight that augments the other tools. This linkage tracing is in turn made possible by the model of the structure of computation.    In effect, the tool directly visualizes the model.
   1.361  
   1.362  As seen, the model has two parts, a \emph{Unit \&\ Constraint Collection (UCC)}, and a \emph{Scheduling Consequence Graph} (SCG or just consequence graph).  The UCC depicts the scheduling  choices  the application allows, and so shows what the programmer has control over. The consequence graph says which of those were actually taken during the run and the causal linkage of consequences within the hardware, runtime, and succeeding choices.
   1.363  
   1.364 -In this section, we give a more precise description of UCC, then consequence graph.
   1.365 -However, this paper focuses on their application to performance tuning, so we abbreviate here and give a formal definition of the full model in a different paper.
   1.366 +In this section, we convey how the UCC and SCG are applied to more advanced situations. We give a formal definition of the full model in a different paper.
   1.367 +
   1.368  \subsection{Unit \& Constraint Collection}
   1.369  \label{sec:UCC}
   1.370 -A fully specified UCC contains the units of work that get scheduled during a run, and the constraints placed on scheduling those units.  In the simple case, all units and all constraints are fully specified in the application. However, many classes of application exist, and two degrees of freedom  determine how much of the UCC is actually defined in the application vs the input data, or even in the runtime.
   1.371 -
   1.372 -An example of a simple application that has all units and all constraints fixed in the source  is matrix multiply with fixed size matrices.  For other applications, the shape of the UCC is only partially defined by the application code.  Take the matrix multiply used in Section \ref{sec:casestudy}, where an input parameter determined the number of units created. There, the UCC was different for each parameter value. At the extreme, would be parallel search  with load-driven pruning. Pruning the search space means that  the units themselves are a function  of both the input data \emph{and} the previous pruning decisions made by the runtime.
   1.373 +It turns out that for many applications, not all units can be determined just from the source code.  Instead, additional factors come into the picture, such as input data and runtime choices.  In fact, the constraints might also be determined by input data and/or runtime choices.
   1.374 +
   1.375 +For example, all units and constraints can be determined just from the application for matrix multiply when the matrix sizes and number of cores are fixed in the application code.  But if the code is changed to accept the size as in input parameter, then the units are no longer known just from the source code.
   1.376 +
   1.377 +We call a UCC that contains all the units, with all constraints on them, a fully concrete UCC.  Meanwhile, a UCC that requires additional input before becoming fully concrete, we call an abstract UCC.
   1.378 +
   1.379 +A UCC can be abstract in the units, or abstract in the constraints, or both. This means that there is more than just one concrete UCC for an application that defines an abstract UCC.
   1.380  
   1.381  \begin{figure}[ht]
   1.382    \centering
   1.383 @@ -326,15 +435,25 @@
   1.384    \label{fig:UCC_example}
   1.385  \end{figure}
   1.386  
   1.387 -    We call a fully specified UCC a \emph{concrete} UCC.  Every run of an application eventually winds up defining a concrete UCC. The example in   Fig. \ref{fig:UCC_example} was produced for the performance tuning. Every unit that was scheduled in the SCG appears in it,  along with  the application-defined constraints on scheduling them.  The division parameter  determined the units. Hence, the application alone does not specify the concrete UCC, because the units remain unknown until parameter values are given. 
   1.388 -
   1.389 -In general, the amount of UCC made concrete by the application  falls into a two-dimensional grid. One dimension covers the units, the other the constraints, as shown in 
   1.390 -Fig. \ref{fig:UCC_Concreteness}.  The two axes each have  four kinds of  information that has to be added in  order to determine the units and constraints,  in the final concrete UCC. 
   1.391 +% We call a fully specified UCC a \emph{concrete} UCC.  
   1.392 +
   1.393 +Every run of an application eventually winds up defining a concrete UCC. For  example in   Fig. \ref{fig:UCC_example} was produced during performance tuning. Every unit that was scheduled in the SCG appears in it,  along with  the application-defined constraints on scheduling them. However, it was only after knowing the division parameter  that the units could be determined.  The added information turned the original abstract UCC into a fully concrete one. 
   1.394 +
   1.395 +%A fully specified UCC contains all the units of work that get scheduled during a run, and the constraints placed on scheduling those units.  In the simple case, all units and all constraints are fully specified in the application. However, many classes of application exist, and two degrees of freedom  determine how much of the UCC is actually defined in the application vs by the input data, or even in the runtime.
   1.396 +
   1.397 +
   1.398 +
   1.399 +%An example of a simple application that has all units and all constraints fixed in the source  is matrix multiply with fixed size matrices.  For other applications, the shape of the UCC is only partially defined by the application code.  Take the matrix multiply used in Section \ref{sec:casestudy}, where an input parameter determined the number of units created. There, the UCC was different for each parameter value. At the extreme, would be parallel search  with load-driven pruning. Pruning the search space means that  the units themselves are a function  of both the input data \emph{and} the previous pruning decisions made by the runtime.
   1.400 +
   1.401 +In general, the amount of UCC made concrete by the application  falls into a two-dimensional grid. One dimension covers the units, the other the constraints, as shown in Fig. \ref{fig:UCC_Concreteness}.  In each axis there are four kinds of  information that determine the units and constraints of a final concrete UCC. 
   1.402  
   1.403  The UCC may change at multiple points in an application's lifecycle. The position  a UCC lands on the grid indicates how far it is from being fully concrete.  The horizontal   position indicates what inputs are still needed to determine the units, and vertical the constraints.  0 indicates that the units (constraints) are fully determined by the application code alone; 1 means parameter values also must be known; 2 means input data values also play a role, and 3 means runtime decisions play a role in determining the units (constraints).
   1.404  
   1.405  The concept of UCC, and  its concreteness can provide value in classifying applications and algorithms and predicting what types of scheduling approach and hardware they will perform best on. The concepts also help in understanding what optimizations do to code and what changes may improve performance on given hardware plus runtime. 
   1.406  
   1.407 +
   1.408 +The progression of UCCs also has value in performance tuning  because it indicates what is inside the application programmer's control vs under control of each tool in the toolchain or  the runtime. For example, the application-derived UCC  shows what can be done statically: the further out on the diagonal that UCC is, the less scheduling can be done statically in the toolchain.
   1.409 +
   1.410  \begin{figure}[ht]
   1.411    \centering
   1.412    \includegraphics[width = 2in, height = 1.8in]{../figures/UCC_concreteness_grid.pdf}
   1.413 @@ -342,15 +461,14 @@
   1.414    \label{fig:UCC_Concreteness}
   1.415  \end{figure}
   1.416  
   1.417 -In the concreteness grid, the closer an application-derived UCC is to the origin, the less additional information is needed to obtain a  concrete UCC descendant of it. For example, the UCC labeled A in the figure is fully concrete just from the source code alone. It represents, for example, matrix multiply with fixed size matrices and fixed division. The UCC labeled B requires the input data plus parameters to be specified before its units are concrete, but just parameters to make its constraints fully concrete. Ray-tracing with bounce depth specified as a parameter may be like this. The UCC labeled C only has variability in its constraints, which require input data. An example would be H.264 motion vectors.
   1.418 -But even the least concrete UCC, D, at the end of the diagonal, generates a concrete descendant UCC while a run of the application unfolds.
   1.419 +In the concreteness grid, the closer an application-derived UCC is to the origin, the less additional information is needed to obtain a  concrete UCC descendant of it. The UCC labeled A in the figure is fully concrete just from the source code alone. It represents, for example, matrix multiply with fixed size matrices and fixed division. The UCC labeled B requires the input data plus parameters to be specified before its units are concrete, but only needs parameters to make its constraints fully concrete. Ray-tracing with bounce depth specified as a parameter may be an example of this. The UCC labeled C only has variability in its constraints, which require input data. An example would be H.264 motion vectors.
   1.420 +The least concrete UCC, D, is at the end of the diagonal.  However, even it generates a concrete descendant UCC as the run of the application unfolds.
   1.421   
   1.422 -Bear in mind that  even a fully concrete UCC still has degrees of freedom when deciding which units to run on which hardware and in what order of execution. Those decisions determine interactions within the hardware, to yield the communication patterns and consequent performance  during the run,  visualized by the SCG. 
   1.423 -
   1.424 -
   1.425 -
   1.426 -As noted, an application has a lifecycle, spanning from editing code all the way through  the run, and its representation  may change at the different stages of life, with corresponding changes to the UCC.
   1.427 - For example, specialization may perform a static scheduling, which fixes the units, moving the UCC towards the origin. Alternatively, the toolchain may inject manipulator code for the runtime to use, which lets it divide units during the run when it needs more units. The injection of manipulator code makes the UCC less concrete, moving it further from the origin.
   1.428 +Bear in mind that  even for a fully concrete UCC, the runtime still decides which units to run on which hardware and in what order of execution. Those decisions determine interactions within the hardware, in the form of communication patterns and consequent performance  during the run, as visualized by the SCG. 
   1.429 +
   1.430 +An application's lifecycle spans from editing code, to the toolchain during build, all the way through  the run. Its representation  may change at each stage of life, with corresponding changes to the UCC.  For example, specialization may perform a static scheduling, which fixes the units, moving the UCC towards the origin. 
   1.431 +
   1.432 +Alternatively, the toolchain may inject manipulator code, which the runtime later uses to divide units during the run. The UCC is made more abstract by the injection of the manipulator code, moving it further from the origin.
   1.433  
   1.434  The progression of UCCs has value in performance tuning  because it indicates what is inside the application programmer's control vs under control of each tool in the toolchain or  the runtime. For example, the original application-derived UCC  shows what can be done statically: the further out on the diagonal that UCC is, the less scheduling can be done statically in the toolchain.
   1.435  
   1.436 @@ -373,7 +491,7 @@
   1.437  
   1.438  \paragraph{Constraint  causal link} Two entire boxes (units) are linked this way when   action by one unit contributes to satisfaction of a constraint blocking the other unit. This includes sequential dependencies from the base language (which are noted in the tool but normally not displayed).
   1.439  
   1.440 -Sequential dependencies may add superfluous constraints that  eliminate some otherwise allowed choices in the UCC. An example would be a \texttt{for} loop that creates work-units -- no parallelism constructs cause the creations to be done in sequence, but the base C language sequentializes it nonetheless. 
   1.441 +Sequential dependencies may add superfluous constraints that  eliminate some otherwise allowed choices in the UCC. An example would be a sequential \texttt{for} loop that creates work-units -- no parallelism constructs cause the creations to be done in sequence, but the base C language sequentializes it nonetheless. 
   1.442  
   1.443  \paragraph{Runtime internal causal link} Runtime implementation details may introduce ``extra" causalities between units. For example, the runtime we instrumented for this paper runs separately on each core and relies upon a global lock for accessing shared runtime information. This lock introduces a causal relationship  when the runtime on one core is attempting to process one unit, but must wait for the runtime on a different core to finish with its unit.
   1.444  
     2.1 Binary file 0__Papers/PRT/PRT__formal_def/figures/PR__timeline_dual.pdf has changed
     3.1 --- a/0__Papers/PRT/PRT__formal_def/figures/PR__timeline_dual.svg	Sat Aug 03 19:24:22 2013 -0700
     3.2 +++ b/0__Papers/PRT/PRT__formal_def/figures/PR__timeline_dual.svg	Tue Sep 17 06:30:06 2013 -0700
     3.3 @@ -89,62 +89,52 @@
     3.4         d="m 196.98465,281.37498 c 69.82336,0 69.82336,0 69.82336,0"
     3.5         style="fill:#800000;stroke:#800000;stroke-width:1.80000007;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:none"
     3.6         inkscape:connector-curvature="0" />
     3.7 -    <g
     3.8 -       transform="translate(-32,-120)"
     3.9 -       id="g7355"
    3.10 -       style="stroke-width:1.79999995;stroke-miterlimit:4;stroke-dasharray:none">
    3.11 -      <path
    3.12 -         style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
    3.13 -         d="m 298.82881,392.82004 c 0,19.38279 0,19.38279 0,19.38279"
    3.14 -         id="path7357"
    3.15 -         inkscape:connector-curvature="0" />
    3.16 -      <text
    3.17 -         sodipodi:linespacing="100%"
    3.18 -         id="text7359"
    3.19 -         y="376.52615"
    3.20 -         x="298.7023"
    3.21 -         style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
    3.22 -         xml:space="preserve"><tspan
    3.23 -           y="376.52615"
    3.24 -           x="298.7023"
    3.25 -           id="tspan7361"
    3.26 -           sodipodi:role="line"
    3.27 -           style="font-size:10px;text-align:center;text-anchor:middle">Suspend</tspan><tspan
    3.28 -           y="385.74353"
    3.29 -           x="298.7023"
    3.30 -           sodipodi:role="line"
    3.31 -           id="tspan7363"
    3.32 -           style="font-size:9px;text-align:center;text-anchor:middle">(Point 2.S)</tspan></text>
    3.33 -    </g>
    3.34 -    <g
    3.35 -       transform="translate(-60,-120)"
    3.36 -       id="g7365"
    3.37 -       style="stroke-width:1.8;stroke-miterlimit:4;stroke-dasharray:none">
    3.38 -      <path
    3.39 -         style="fill:none;stroke:#000000;stroke-width:1.8;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;stroke-miterlimit:4;stroke-dasharray:none"
    3.40 -         d="m 378.82881,392.77746 c 0,19.15152 0,19.15152 0,19.15152"
    3.41 -         id="path7367"
    3.42 -         inkscape:connector-curvature="0" />
    3.43 -      <text
    3.44 -         sodipodi:linespacing="100%"
    3.45 -         id="text7369"
    3.46 -         y="376.52615"
    3.47 -         x="378.7023"
    3.48 -         style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
    3.49 -         xml:space="preserve"><tspan
    3.50 -           y="376.52615"
    3.51 -           x="380.20621"
    3.52 -           id="tspan7371"
    3.53 -           sodipodi:role="line"
    3.54 -           style="font-size:9px;text-align:center;text-anchor:middle"><tspan
    3.55 -             style="font-size:10px"
    3.56 -             id="tspan8087">Resume </tspan></tspan><tspan
    3.57 -           y="385.74353"
    3.58 -           x="378.7023"
    3.59 -           sodipodi:role="line"
    3.60 -           id="tspan7373"
    3.61 -           style="font-size:9px;text-align:center;text-anchor:middle">(Point 2.R)</tspan></text>
    3.62 -    </g>
    3.63 +    <path
    3.64 +       inkscape:connector-curvature="0"
    3.65 +       id="path7357"
    3.66 +       d="m 266.82881,272.82004 c 0,19.38279 0,19.38279 0,19.38279"
    3.67 +       style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none" />
    3.68 +    <text
    3.69 +       xml:space="preserve"
    3.70 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
    3.71 +       x="266.7023"
    3.72 +       y="298.52615"
    3.73 +       id="text7359"
    3.74 +       sodipodi:linespacing="100%"><tspan
    3.75 +         style="font-size:10px;text-align:center;text-anchor:middle"
    3.76 +         sodipodi:role="line"
    3.77 +         id="tspan7361"
    3.78 +         x="266.7023"
    3.79 +         y="298.52615">Suspend</tspan><tspan
    3.80 +         style="font-size:9px;text-align:center;text-anchor:middle"
    3.81 +         id="tspan7363"
    3.82 +         sodipodi:role="line"
    3.83 +         x="266.7023"
    3.84 +         y="307.74353">(Point 2.S)</tspan></text>
    3.85 +    <path
    3.86 +       inkscape:connector-curvature="0"
    3.87 +       id="path7367"
    3.88 +       d="m 318.82881,272.77746 c 0,19.15152 0,19.15152 0,19.15152"
    3.89 +       style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none" />
    3.90 +    <text
    3.91 +       xml:space="preserve"
    3.92 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
    3.93 +       x="318.7023"
    3.94 +       y="298.52615"
    3.95 +       id="text7369"
    3.96 +       sodipodi:linespacing="100%"><tspan
    3.97 +         style="font-size:9px;text-align:center;text-anchor:middle"
    3.98 +         sodipodi:role="line"
    3.99 +         id="tspan7371"
   3.100 +         x="320.20621"
   3.101 +         y="298.52615"><tspan
   3.102 +           id="tspan8087"
   3.103 +           style="font-size:10px">Resume </tspan></tspan><tspan
   3.104 +         style="font-size:9px;text-align:center;text-anchor:middle"
   3.105 +         id="tspan7373"
   3.106 +         sodipodi:role="line"
   3.107 +         x="318.7023"
   3.108 +         y="307.74353">(Point 2.R)</tspan></text>
   3.109      <text
   3.110         sodipodi:linespacing="100%"
   3.111         id="text7375"
   3.112 @@ -180,11 +170,11 @@
   3.113      <path
   3.114         inkscape:connector-curvature="0"
   3.115         style="fill:none;stroke:#422fac;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:none"
   3.116 -       d="m 195.92204,221.37498 c 33.06652,0 33.06652,0 33.06652,0"
   3.117 +       d="m 195.92204,239.37498 c 33.06652,0 33.06652,0 33.06652,0"
   3.118         id="path8095" />
   3.119      <g
   3.120         id="g8097"
   3.121 -       transform="translate(-70,-180)"
   3.122 +       transform="translate(-70,-162)"
   3.123         style="stroke-width:1.79999995;stroke-miterlimit:4;stroke-dasharray:none">
   3.124        <path
   3.125           inkscape:connector-curvature="0"
   3.126 @@ -211,13 +201,13 @@
   3.127      </g>
   3.128      <g
   3.129         id="g8107"
   3.130 -       transform="translate(-60,-180)"
   3.131 -       style="stroke-width:1.8;stroke-miterlimit:4;stroke-dasharray:none">
   3.132 +       transform="translate(-60,-162)"
   3.133 +       style="stroke-width:1.79999995;stroke-miterlimit:4;stroke-dasharray:none">
   3.134        <path
   3.135           inkscape:connector-curvature="0"
   3.136           id="path8109"
   3.137           d="m 378.82881,392.77746 c 0,19.15152 0,19.15152 0,19.15152"
   3.138 -         style="fill:none;stroke:#000000;stroke-width:1.8;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;stroke-miterlimit:4;stroke-dasharray:none" />
   3.139 +         style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none" />
   3.140        <text
   3.141           xml:space="preserve"
   3.142           style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   3.143 @@ -242,205 +232,23 @@
   3.144         xml:space="preserve"
   3.145         style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000080;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   3.146         x="352.7023"
   3.147 -       y="225.27441"
   3.148 +       y="243.27441"
   3.149         id="text8119"
   3.150         sodipodi:linespacing="100%"><tspan
   3.151           id="tspan8121"
   3.152           sodipodi:role="line"
   3.153           x="352.7023"
   3.154 -         y="225.27441">Timeline A</tspan></text>
   3.155 +         y="243.27441">Timeline A</tspan></text>
   3.156      <path
   3.157         id="path8123"
   3.158 -       d="m 320.08408,221.37498 c 27.45405,0 27.45405,0 27.45405,0"
   3.159 -       style="fill:none;stroke:#422fac;stroke-width:1.8;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-end:url(#Arrow2Mend);stroke-miterlimit:4;stroke-dasharray:none"
   3.160 +       d="m 320.08408,239.37498 c 27.45405,0 27.45405,0 27.45405,0"
   3.161 +       style="fill:none;stroke:#422fac;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:url(#Arrow2Mend)"
   3.162         inkscape:connector-curvature="0" />
   3.163      <path
   3.164 -       style="fill:none;stroke:#000000;stroke-width:0.99999994;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:2.99999998, 2.99999998;stroke-dashoffset:0;marker-end:url(#Arrow2Mend)"
   3.165 -       d="m 292.57011,280.15667 c 1.60737,-35.06333 -0.1867,-13.69014 2.41106,-33.11537 1.74808,-13.07166 19.28851,-14.93437 19.28851,-14.93437"
   3.166 +       style="fill:none;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:2.99999998, 2.99999998;stroke-dashoffset:0;marker-end:url(#Arrow2Mend)"
   3.167 +       d="m 292.57011,280.15667 c 1.60737,-29.22166 -0.1867,-11.40932 2.41106,-27.59824 1.74808,-10.89388 19.28851,-12.44626 19.28851,-12.44626"
   3.168         id="path8125"
   3.169         inkscape:connector-curvature="0"
   3.170         sodipodi:nodetypes="csc" />
   3.171 -    <path
   3.172 -       sodipodi:nodetypes="csc"
   3.173 -       inkscape:connector-curvature="0"
   3.174 -       id="path5550"
   3.175 -       d="m 239.09804,401.95213 c 23.67157,4.34238 9.24233,-0.50438 22.35648,6.51358 8.8248,4.72253 10.08233,52.10878 10.08233,52.10878"
   3.176 -       style="fill:none;stroke:#000000;stroke-width:0.99999982;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:3.00000004, 3.00000004;stroke-dashoffset:0;marker-end:url(#Arrow2Mend)" />
   3.177 -    <path
   3.178 -       inkscape:connector-curvature="0"
   3.179 -       style="fill:#800000;stroke:#800000;stroke-width:1.80000007;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:none"
   3.180 -       d="m 196.98465,461.37498 c 69.82336,0 69.82336,0 69.82336,0"
   3.181 -       id="path5552" />
   3.182 -    <g
   3.183 -       style="stroke-width:1.79999995;stroke-miterlimit:4;stroke-dasharray:none"
   3.184 -       id="g5554"
   3.185 -       transform="translate(-32,60)">
   3.186 -      <path
   3.187 -         inkscape:connector-curvature="0"
   3.188 -         id="path5556"
   3.189 -         d="m 298.82881,392.82004 c 0,19.38279 0,19.38279 0,19.38279"
   3.190 -         style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none" />
   3.191 -      <text
   3.192 -         xml:space="preserve"
   3.193 -         style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   3.194 -         x="298.7023"
   3.195 -         y="376.52615"
   3.196 -         id="text5558"
   3.197 -         sodipodi:linespacing="100%"><tspan
   3.198 -           style="font-size:10px;text-align:center;text-anchor:middle"
   3.199 -           sodipodi:role="line"
   3.200 -           id="tspan5560"
   3.201 -           x="298.7023"
   3.202 -           y="376.52615">Suspend</tspan><tspan
   3.203 -           style="font-size:9px;text-align:center;text-anchor:middle"
   3.204 -           id="tspan5562"
   3.205 -           sodipodi:role="line"
   3.206 -           x="298.7023"
   3.207 -           y="385.74353">(Point 2.S)</tspan></text>
   3.208 -    </g>
   3.209 -    <g
   3.210 -       style="stroke-width:1.79999995;stroke-miterlimit:4;stroke-dasharray:none"
   3.211 -       id="g5564"
   3.212 -       transform="translate(-60,60)">
   3.213 -      <path
   3.214 -         inkscape:connector-curvature="0"
   3.215 -         id="path5566"
   3.216 -         d="m 378.82881,392.77746 c 0,19.15152 0,19.15152 0,19.15152"
   3.217 -         style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none" />
   3.218 -      <text
   3.219 -         xml:space="preserve"
   3.220 -         style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   3.221 -         x="378.7023"
   3.222 -         y="376.52615"
   3.223 -         id="text5568"
   3.224 -         sodipodi:linespacing="100%"><tspan
   3.225 -           style="font-size:9px;text-align:center;text-anchor:middle"
   3.226 -           sodipodi:role="line"
   3.227 -           id="tspan5570"
   3.228 -           x="380.20621"
   3.229 -           y="376.52615"><tspan
   3.230 -             id="tspan5572"
   3.231 -             style="font-size:10px">Resume </tspan></tspan><tspan
   3.232 -           style="font-size:9px;text-align:center;text-anchor:middle"
   3.233 -           id="tspan5574"
   3.234 -           sodipodi:role="line"
   3.235 -           x="378.7023"
   3.236 -           y="385.74353">(Point 2.R)</tspan></text>
   3.237 -    </g>
   3.238 -    <text
   3.239 -       xml:space="preserve"
   3.240 -       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#800000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   3.241 -       x="352.7023"
   3.242 -       y="465.27441"
   3.243 -       id="text5576"
   3.244 -       sodipodi:linespacing="100%"><tspan
   3.245 -         id="tspan5578"
   3.246 -         sodipodi:role="line"
   3.247 -         x="352.7023"
   3.248 -         y="465.27441">Timeline B</tspan></text>
   3.249 -    <path
   3.250 -       id="path5580"
   3.251 -       d="m 320.08408,461.37498 c 27.45405,0 27.45405,0 27.45405,0"
   3.252 -       style="fill:none;stroke:#800000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:url(#Arrow2Mend)"
   3.253 -       inkscape:connector-curvature="0" />
   3.254 -    <path
   3.255 -       inkscape:connector-curvature="0"
   3.256 -       style="fill:#000000;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:url(#Arrow2Mend)"
   3.257 -       d="m 195.41471,497.37498 c 151.68424,0 151.68424,0 151.68424,0"
   3.258 -       id="path5582" />
   3.259 -    <text
   3.260 -       sodipodi:linespacing="100%"
   3.261 -       id="text5584"
   3.262 -       y="500.02267"
   3.263 -       x="352.7023"
   3.264 -       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   3.265 -       xml:space="preserve"><tspan
   3.266 -         y="500.02267"
   3.267 -         x="352.7023"
   3.268 -         sodipodi:role="line"
   3.269 -         id="tspan5586">Physical time</tspan></text>
   3.270 -    <path
   3.271 -       id="path5588"
   3.272 -       d="m 195.92204,401.37498 c 33.06652,0 33.06652,0 33.06652,0"
   3.273 -       style="fill:none;stroke:#422fac;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:none"
   3.274 -       inkscape:connector-curvature="0" />
   3.275 -    <g
   3.276 -       style="stroke-width:1.79999995;stroke-miterlimit:4;stroke-dasharray:none"
   3.277 -       transform="translate(-70,0)"
   3.278 -       id="g5590">
   3.279 -      <path
   3.280 -         style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
   3.281 -         d="m 298.82881,392.82004 c 0,19.38279 0,19.38279 0,19.38279"
   3.282 -         id="path5592"
   3.283 -         inkscape:connector-curvature="0" />
   3.284 -      <text
   3.285 -         sodipodi:linespacing="100%"
   3.286 -         id="text5594"
   3.287 -         y="376.52615"
   3.288 -         x="298.7023"
   3.289 -         style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   3.290 -         xml:space="preserve"><tspan
   3.291 -           y="376.52615"
   3.292 -           x="298.7023"
   3.293 -           id="tspan5596"
   3.294 -           sodipodi:role="line"
   3.295 -           style="font-size:10px;text-align:center;text-anchor:middle">Suspend</tspan><tspan
   3.296 -           y="385.74353"
   3.297 -           x="298.7023"
   3.298 -           sodipodi:role="line"
   3.299 -           id="tspan5598"
   3.300 -           style="font-size:9px;text-align:center;text-anchor:middle">(Point 1.S)</tspan></text>
   3.301 -    </g>
   3.302 -    <g
   3.303 -       style="stroke-width:1.79999995;stroke-miterlimit:4;stroke-dasharray:none"
   3.304 -       transform="translate(-60,0)"
   3.305 -       id="g5600">
   3.306 -      <path
   3.307 -         style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
   3.308 -         d="m 378.82881,392.77746 c 0,19.15152 0,19.15152 0,19.15152"
   3.309 -         id="path5602"
   3.310 -         inkscape:connector-curvature="0" />
   3.311 -      <text
   3.312 -         sodipodi:linespacing="100%"
   3.313 -         id="text5604"
   3.314 -         y="376.52615"
   3.315 -         x="378.7023"
   3.316 -         style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   3.317 -         xml:space="preserve"><tspan
   3.318 -           y="376.52615"
   3.319 -           x="380.20621"
   3.320 -           id="tspan5606"
   3.321 -           sodipodi:role="line"
   3.322 -           style="font-size:9px;text-align:center;text-anchor:middle"><tspan
   3.323 -             style="font-size:10px"
   3.324 -             id="tspan5608">Resume </tspan></tspan><tspan
   3.325 -           y="385.74353"
   3.326 -           x="378.7023"
   3.327 -           sodipodi:role="line"
   3.328 -           id="tspan5610"
   3.329 -           style="font-size:9px;text-align:center;text-anchor:middle">(Point 1.R)</tspan></text>
   3.330 -    </g>
   3.331 -    <text
   3.332 -       sodipodi:linespacing="100%"
   3.333 -       id="text5612"
   3.334 -       y="405.27441"
   3.335 -       x="352.7023"
   3.336 -       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000080;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   3.337 -       xml:space="preserve"><tspan
   3.338 -         y="405.27441"
   3.339 -         x="352.7023"
   3.340 -         sodipodi:role="line"
   3.341 -         id="tspan5614">Timeline A</tspan></text>
   3.342 -    <path
   3.343 -       inkscape:connector-curvature="0"
   3.344 -       style="fill:none;stroke:#422fac;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:url(#Arrow2Mend)"
   3.345 -       d="m 320.08408,401.37498 c 27.45405,0 27.45405,0 27.45405,0"
   3.346 -       id="path5616" />
   3.347 -    <path
   3.348 -       sodipodi:nodetypes="csc"
   3.349 -       inkscape:connector-curvature="0"
   3.350 -       id="path5618"
   3.351 -       d="m 292.57011,460.15667 c 1.60737,-35.06333 -0.1867,-13.69014 2.41106,-33.11537 1.74808,-13.07166 19.28851,-14.93437 19.28851,-14.93437"
   3.352 -       style="fill:none;stroke:#000000;stroke-width:0.99999994;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:2.99999998, 2.99999998;stroke-dashoffset:0;marker-end:url(#Arrow2Mend)" />
   3.353    </g>
   3.354  </svg>
     4.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     4.2 +++ b/0__Papers/PRT/PRT__formal_def/figures/PR__timeline_dual_three_versions.svg	Tue Sep 17 06:30:06 2013 -0700
     4.3 @@ -0,0 +1,754 @@
     4.4 +<?xml version="1.0" encoding="UTF-8" standalone="no"?>
     4.5 +<!-- Created with Inkscape (http://www.inkscape.org/) -->
     4.6 +
     4.7 +<svg
     4.8 +   xmlns:dc="http://purl.org/dc/elements/1.1/"
     4.9 +   xmlns:cc="http://creativecommons.org/ns#"
    4.10 +   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    4.11 +   xmlns:svg="http://www.w3.org/2000/svg"
    4.12 +   xmlns="http://www.w3.org/2000/svg"
    4.13 +   xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
    4.14 +   xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
    4.15 +   width="744.09448819"
    4.16 +   height="1052.3622047"
    4.17 +   id="svg2"
    4.18 +   sodipodi:version="0.32"
    4.19 +   inkscape:version="0.48.2 r9819"
    4.20 +   sodipodi:docname="PR__timeline_dual.svg"
    4.21 +   inkscape:output_extension="org.inkscape.output.svg.inkscape"
    4.22 +   version="1.1">
    4.23 +  <defs
    4.24 +     id="defs4">
    4.25 +    <marker
    4.26 +       inkscape:stockid="Arrow2Send"
    4.27 +       orient="auto"
    4.28 +       refY="0.0"
    4.29 +       refX="0.0"
    4.30 +       id="Arrow2Send"
    4.31 +       style="overflow:visible;">
    4.32 +      <path
    4.33 +         id="path4262"
    4.34 +         style="font-size:12.0;fill-rule:evenodd;stroke-width:0.62500000;stroke-linejoin:round;"
    4.35 +         d="M 8.7185878,4.0337352 L -2.2072895,0.016013256 L 8.7185884,-4.0017078 C 6.9730900,-1.6296469 6.9831476,1.6157441 8.7185878,4.0337352 z "
    4.36 +         transform="scale(0.3) rotate(180) translate(-2.3,0)" />
    4.37 +    </marker>
    4.38 +    <marker
    4.39 +       inkscape:stockid="Arrow1Mend"
    4.40 +       orient="auto"
    4.41 +       refY="0.0"
    4.42 +       refX="0.0"
    4.43 +       id="Arrow1Mend"
    4.44 +       style="overflow:visible;">
    4.45 +      <path
    4.46 +         id="path4238"
    4.47 +         d="M 0.0,0.0 L 5.0,-5.0 L -12.5,0.0 L 5.0,5.0 L 0.0,0.0 z "
    4.48 +         style="fill-rule:evenodd;stroke:#000000;stroke-width:1.0pt;marker-start:none;"
    4.49 +         transform="scale(0.4) rotate(180) translate(10,0)" />
    4.50 +    </marker>
    4.51 +    <marker
    4.52 +       inkscape:stockid="Arrow2Mend"
    4.53 +       orient="auto"
    4.54 +       refY="0.0"
    4.55 +       refX="0.0"
    4.56 +       id="Arrow2Mend"
    4.57 +       style="overflow:visible;">
    4.58 +      <path
    4.59 +         id="path4008"
    4.60 +         style="font-size:12.0;fill-rule:evenodd;stroke-width:0.62500000;stroke-linejoin:round;"
    4.61 +         d="M 8.7185878,4.0337352 L -2.2072895,0.016013256 L 8.7185884,-4.0017078 C 6.9730900,-1.6296469 6.9831476,1.6157441 8.7185878,4.0337352 z "
    4.62 +         transform="scale(0.6) rotate(180) translate(0,0)" />
    4.63 +    </marker>
    4.64 +    <inkscape:perspective
    4.65 +       sodipodi:type="inkscape:persp3d"
    4.66 +       inkscape:vp_x="0 : 526.18109 : 1"
    4.67 +       inkscape:vp_y="0 : 1000 : 0"
    4.68 +       inkscape:vp_z="744.09448 : 526.18109 : 1"
    4.69 +       inkscape:persp3d-origin="372.04724 : 350.78739 : 1"
    4.70 +       id="perspective10" />
    4.71 +    <inkscape:perspective
    4.72 +       id="perspective11923"
    4.73 +       inkscape:persp3d-origin="0.5 : 0.33333333 : 1"
    4.74 +       inkscape:vp_z="1 : 0.5 : 1"
    4.75 +       inkscape:vp_y="0 : 1000 : 0"
    4.76 +       inkscape:vp_x="0 : 0.5 : 1"
    4.77 +       sodipodi:type="inkscape:persp3d" />
    4.78 +  </defs>
    4.79 +  <sodipodi:namedview
    4.80 +     id="base"
    4.81 +     pagecolor="#ffffff"
    4.82 +     bordercolor="#666666"
    4.83 +     borderopacity="1.0"
    4.84 +     gridtolerance="10000"
    4.85 +     guidetolerance="10"
    4.86 +     objecttolerance="10"
    4.87 +     inkscape:pageopacity="0.0"
    4.88 +     inkscape:pageshadow="2"
    4.89 +     inkscape:zoom="1.3364318"
    4.90 +     inkscape:cx="214.9176"
    4.91 +     inkscape:cy="612.44308"
    4.92 +     inkscape:document-units="px"
    4.93 +     inkscape:current-layer="layer1"
    4.94 +     showgrid="false"
    4.95 +     inkscape:window-width="1317"
    4.96 +     inkscape:window-height="878"
    4.97 +     inkscape:window-x="7"
    4.98 +     inkscape:window-y="1"
    4.99 +     inkscape:window-maximized="0" />
   4.100 +  <metadata
   4.101 +     id="metadata7">
   4.102 +    <rdf:RDF>
   4.103 +      <cc:Work
   4.104 +         rdf:about="">
   4.105 +        <dc:format>image/svg+xml</dc:format>
   4.106 +        <dc:type
   4.107 +           rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
   4.108 +        <dc:title></dc:title>
   4.109 +      </cc:Work>
   4.110 +    </rdf:RDF>
   4.111 +  </metadata>
   4.112 +  <g
   4.113 +     inkscape:label="Layer 1"
   4.114 +     inkscape:groupmode="layer"
   4.115 +     id="layer1">
   4.116 +    <path
   4.117 +       id="path7353"
   4.118 +       d="m 196.98465,281.37498 c 69.82336,0 69.82336,0 69.82336,0"
   4.119 +       style="fill:#800000;stroke:#800000;stroke-width:1.80000007;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:none"
   4.120 +       inkscape:connector-curvature="0" />
   4.121 +    <g
   4.122 +       transform="translate(-32,-120)"
   4.123 +       id="g7355"
   4.124 +       style="stroke-width:1.79999995;stroke-miterlimit:4;stroke-dasharray:none">
   4.125 +      <path
   4.126 +         style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
   4.127 +         d="m 298.82881,392.82004 c 0,19.38279 0,19.38279 0,19.38279"
   4.128 +         id="path7357"
   4.129 +         inkscape:connector-curvature="0" />
   4.130 +      <text
   4.131 +         sodipodi:linespacing="100%"
   4.132 +         id="text7359"
   4.133 +         y="376.52615"
   4.134 +         x="298.7023"
   4.135 +         style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.136 +         xml:space="preserve"><tspan
   4.137 +           y="376.52615"
   4.138 +           x="298.7023"
   4.139 +           id="tspan7361"
   4.140 +           sodipodi:role="line"
   4.141 +           style="font-size:10px;text-align:center;text-anchor:middle">Suspend</tspan><tspan
   4.142 +           y="385.74353"
   4.143 +           x="298.7023"
   4.144 +           sodipodi:role="line"
   4.145 +           id="tspan7363"
   4.146 +           style="font-size:9px;text-align:center;text-anchor:middle">(Point 2.S)</tspan></text>
   4.147 +    </g>
   4.148 +    <g
   4.149 +       transform="translate(-60,-120)"
   4.150 +       id="g7365"
   4.151 +       style="stroke-width:1.8;stroke-miterlimit:4;stroke-dasharray:none">
   4.152 +      <path
   4.153 +         style="fill:none;stroke:#000000;stroke-width:1.8;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;stroke-miterlimit:4;stroke-dasharray:none"
   4.154 +         d="m 378.82881,392.77746 c 0,19.15152 0,19.15152 0,19.15152"
   4.155 +         id="path7367"
   4.156 +         inkscape:connector-curvature="0" />
   4.157 +      <text
   4.158 +         sodipodi:linespacing="100%"
   4.159 +         id="text7369"
   4.160 +         y="376.52615"
   4.161 +         x="378.7023"
   4.162 +         style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.163 +         xml:space="preserve"><tspan
   4.164 +           y="376.52615"
   4.165 +           x="380.20621"
   4.166 +           id="tspan7371"
   4.167 +           sodipodi:role="line"
   4.168 +           style="font-size:9px;text-align:center;text-anchor:middle"><tspan
   4.169 +             style="font-size:10px"
   4.170 +             id="tspan8087">Resume </tspan></tspan><tspan
   4.171 +           y="385.74353"
   4.172 +           x="378.7023"
   4.173 +           sodipodi:role="line"
   4.174 +           id="tspan7373"
   4.175 +           style="font-size:9px;text-align:center;text-anchor:middle">(Point 2.R)</tspan></text>
   4.176 +    </g>
   4.177 +    <text
   4.178 +       sodipodi:linespacing="100%"
   4.179 +       id="text7375"
   4.180 +       y="285.27441"
   4.181 +       x="352.7023"
   4.182 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#800000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.183 +       xml:space="preserve"><tspan
   4.184 +         y="285.27441"
   4.185 +         x="352.7023"
   4.186 +         sodipodi:role="line"
   4.187 +         id="tspan7379">Timeline B</tspan></text>
   4.188 +    <path
   4.189 +       inkscape:connector-curvature="0"
   4.190 +       style="fill:none;stroke:#800000;stroke-width:1.80000000000000000;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-end:url(#Arrow2Mend);stroke-miterlimit:4;stroke-dasharray:none"
   4.191 +       d="m 320.08408,281.37498 c 27.45405,0 27.45405,0 27.45405,0"
   4.192 +       id="path7387" />
   4.193 +    <path
   4.194 +       id="path8089"
   4.195 +       d="m 195.41471,317.37498 c 151.68424,0 151.68424,0 151.68424,0"
   4.196 +       style="fill:#000000;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:url(#Arrow2Mend)"
   4.197 +       inkscape:connector-curvature="0" />
   4.198 +    <text
   4.199 +       xml:space="preserve"
   4.200 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.201 +       x="352.7023"
   4.202 +       y="320.02267"
   4.203 +       id="text8091"
   4.204 +       sodipodi:linespacing="100%"><tspan
   4.205 +         id="tspan8093"
   4.206 +         sodipodi:role="line"
   4.207 +         x="352.7023"
   4.208 +         y="320.02267">Physical time</tspan></text>
   4.209 +    <path
   4.210 +       inkscape:connector-curvature="0"
   4.211 +       style="fill:none;stroke:#422fac;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:none"
   4.212 +       d="m 195.92204,221.37498 c 33.06652,0 33.06652,0 33.06652,0"
   4.213 +       id="path8095" />
   4.214 +    <g
   4.215 +       id="g8097"
   4.216 +       transform="translate(-70,-180)"
   4.217 +       style="stroke-width:1.79999995;stroke-miterlimit:4;stroke-dasharray:none">
   4.218 +      <path
   4.219 +         inkscape:connector-curvature="0"
   4.220 +         id="path8099"
   4.221 +         d="m 298.82881,392.82004 c 0,19.38279 0,19.38279 0,19.38279"
   4.222 +         style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none" />
   4.223 +      <text
   4.224 +         xml:space="preserve"
   4.225 +         style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.226 +         x="298.7023"
   4.227 +         y="376.52615"
   4.228 +         id="text8101"
   4.229 +         sodipodi:linespacing="100%"><tspan
   4.230 +           style="font-size:10px;text-align:center;text-anchor:middle"
   4.231 +           sodipodi:role="line"
   4.232 +           id="tspan8103"
   4.233 +           x="298.7023"
   4.234 +           y="376.52615">Suspend</tspan><tspan
   4.235 +           style="font-size:9px;text-align:center;text-anchor:middle"
   4.236 +           id="tspan8105"
   4.237 +           sodipodi:role="line"
   4.238 +           x="298.7023"
   4.239 +           y="385.74353">(Point 1.S)</tspan></text>
   4.240 +    </g>
   4.241 +    <g
   4.242 +       id="g8107"
   4.243 +       transform="translate(-60,-180)"
   4.244 +       style="stroke-width:1.8;stroke-miterlimit:4;stroke-dasharray:none">
   4.245 +      <path
   4.246 +         inkscape:connector-curvature="0"
   4.247 +         id="path8109"
   4.248 +         d="m 378.82881,392.77746 c 0,19.15152 0,19.15152 0,19.15152"
   4.249 +         style="fill:none;stroke:#000000;stroke-width:1.8;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;stroke-miterlimit:4;stroke-dasharray:none" />
   4.250 +      <text
   4.251 +         xml:space="preserve"
   4.252 +         style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.253 +         x="378.7023"
   4.254 +         y="376.52615"
   4.255 +         id="text8111"
   4.256 +         sodipodi:linespacing="100%"><tspan
   4.257 +           style="font-size:9px;text-align:center;text-anchor:middle"
   4.258 +           sodipodi:role="line"
   4.259 +           id="tspan8113"
   4.260 +           x="380.20621"
   4.261 +           y="376.52615"><tspan
   4.262 +             id="tspan8115"
   4.263 +             style="font-size:10px">Resume </tspan></tspan><tspan
   4.264 +           style="font-size:9px;text-align:center;text-anchor:middle"
   4.265 +           id="tspan8117"
   4.266 +           sodipodi:role="line"
   4.267 +           x="378.7023"
   4.268 +           y="385.74353">(Point 1.R)</tspan></text>
   4.269 +    </g>
   4.270 +    <text
   4.271 +       xml:space="preserve"
   4.272 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000080;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.273 +       x="352.7023"
   4.274 +       y="225.27441"
   4.275 +       id="text8119"
   4.276 +       sodipodi:linespacing="100%"><tspan
   4.277 +         id="tspan8121"
   4.278 +         sodipodi:role="line"
   4.279 +         x="352.7023"
   4.280 +         y="225.27441">Timeline A</tspan></text>
   4.281 +    <path
   4.282 +       id="path8123"
   4.283 +       d="m 320.08408,221.37498 c 27.45405,0 27.45405,0 27.45405,0"
   4.284 +       style="fill:none;stroke:#422fac;stroke-width:1.8;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-end:url(#Arrow2Mend);stroke-miterlimit:4;stroke-dasharray:none"
   4.285 +       inkscape:connector-curvature="0" />
   4.286 +    <path
   4.287 +       style="fill:none;stroke:#000000;stroke-width:0.99999994;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:2.99999998, 2.99999998;stroke-dashoffset:0;marker-end:url(#Arrow2Mend)"
   4.288 +       d="m 292.57011,280.15667 c 1.60737,-35.06333 -0.1867,-13.69014 2.41106,-33.11537 1.74808,-13.07166 19.28851,-14.93437 19.28851,-14.93437"
   4.289 +       id="path8125"
   4.290 +       inkscape:connector-curvature="0"
   4.291 +       sodipodi:nodetypes="csc" />
   4.292 +    <path
   4.293 +       inkscape:connector-curvature="0"
   4.294 +       style="fill:#800000;stroke:#800000;stroke-width:1.80000007;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:none"
   4.295 +       d="m 195.48813,523.37498 c 69.82336,0 69.82336,0 69.82336,0"
   4.296 +       id="path5552" />
   4.297 +    <path
   4.298 +       style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
   4.299 +       d="m 266.82881,514.82004 c 0,19.38279 0,19.38279 0,19.38279"
   4.300 +       id="path5556"
   4.301 +       inkscape:connector-curvature="0" />
   4.302 +    <text
   4.303 +       sodipodi:linespacing="100%"
   4.304 +       id="text5558"
   4.305 +       y="540.52612"
   4.306 +       x="264.7023"
   4.307 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.308 +       xml:space="preserve"><tspan
   4.309 +         y="540.52612"
   4.310 +         x="264.7023"
   4.311 +         id="tspan5560"
   4.312 +         sodipodi:role="line"
   4.313 +         style="font-size:10px;text-align:center;text-anchor:middle">Suspend</tspan><tspan
   4.314 +         y="549.74353"
   4.315 +         x="264.7023"
   4.316 +         sodipodi:role="line"
   4.317 +         id="tspan5562"
   4.318 +         style="font-size:9px;text-align:center;text-anchor:middle">(Point 2.S)</tspan></text>
   4.319 +    <path
   4.320 +       style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
   4.321 +       d="m 318.82881,514.77746 c 0,19.15152 0,19.15152 0,19.15152"
   4.322 +       id="path5566"
   4.323 +       inkscape:connector-curvature="0" />
   4.324 +    <text
   4.325 +       sodipodi:linespacing="100%"
   4.326 +       id="text5568"
   4.327 +       y="540.52612"
   4.328 +       x="320.7023"
   4.329 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.330 +       xml:space="preserve"><tspan
   4.331 +         y="540.52612"
   4.332 +         x="322.20621"
   4.333 +         id="tspan5570"
   4.334 +         sodipodi:role="line"
   4.335 +         style="font-size:9px;text-align:center;text-anchor:middle"><tspan
   4.336 +           style="font-size:10px"
   4.337 +           id="tspan5572">Resume </tspan></tspan><tspan
   4.338 +         y="549.74353"
   4.339 +         x="320.7023"
   4.340 +         sodipodi:role="line"
   4.341 +         id="tspan5574"
   4.342 +         style="font-size:9px;text-align:center;text-anchor:middle">(Point 2.R)</tspan></text>
   4.343 +    <text
   4.344 +       xml:space="preserve"
   4.345 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#800000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.346 +       x="354.7023"
   4.347 +       y="527.27441"
   4.348 +       id="text5576"
   4.349 +       sodipodi:linespacing="100%"><tspan
   4.350 +         id="tspan5578"
   4.351 +         sodipodi:role="line"
   4.352 +         x="354.7023"
   4.353 +         y="527.27441">Timeline B</tspan></text>
   4.354 +    <path
   4.355 +       id="path5580"
   4.356 +       d="m 320.08409,523.37498 c 28.16395,0 28.16395,0 28.16395,0"
   4.357 +       style="fill:none;stroke:#800000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:url(#Arrow2Mend)"
   4.358 +       inkscape:connector-curvature="0" />
   4.359 +    <path
   4.360 +       inkscape:connector-curvature="0"
   4.361 +       style="fill:#000000;stroke:#000000;stroke-width:1.80000007;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:url(#Arrow2Mend)"
   4.362 +       d="m 195.41472,559.37498 c 153.16627,0 153.16627,0 153.16627,0"
   4.363 +       id="path5582" />
   4.364 +    <text
   4.365 +       sodipodi:linespacing="100%"
   4.366 +       id="text5584"
   4.367 +       y="562.02271"
   4.368 +       x="354.05777"
   4.369 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.370 +       xml:space="preserve"><tspan
   4.371 +         y="562.02271"
   4.372 +         x="354.05777"
   4.373 +         sodipodi:role="line"
   4.374 +         id="tspan5586">Physical time</tspan></text>
   4.375 +    <path
   4.376 +       id="path5588"
   4.377 +       d="m 195.17378,437.37498 c 33.06652,0 33.06652,0 33.06652,0"
   4.378 +       style="fill:none;stroke:#422fac;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:none"
   4.379 +       inkscape:connector-curvature="0" />
   4.380 +    <g
   4.381 +       style="stroke-width:1.79999995;stroke-miterlimit:4;stroke-dasharray:none"
   4.382 +       transform="translate(-70,36)"
   4.383 +       id="g5590">
   4.384 +      <path
   4.385 +         style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
   4.386 +         d="m 298.82881,392.82004 c 0,19.38279 0,19.38279 0,19.38279"
   4.387 +         id="path5592"
   4.388 +         inkscape:connector-curvature="0" />
   4.389 +      <text
   4.390 +         sodipodi:linespacing="100%"
   4.391 +         id="text5594"
   4.392 +         y="376.52615"
   4.393 +         x="298.7023"
   4.394 +         style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.395 +         xml:space="preserve"><tspan
   4.396 +           y="376.52615"
   4.397 +           x="298.7023"
   4.398 +           id="tspan5596"
   4.399 +           sodipodi:role="line"
   4.400 +           style="font-size:10px;text-align:center;text-anchor:middle">Suspend</tspan><tspan
   4.401 +           y="385.74353"
   4.402 +           x="298.7023"
   4.403 +           sodipodi:role="line"
   4.404 +           id="tspan5598"
   4.405 +           style="font-size:9px;text-align:center;text-anchor:middle">(Point 1.S)</tspan></text>
   4.406 +    </g>
   4.407 +    <g
   4.408 +       style="stroke-width:1.79999995;stroke-miterlimit:4;stroke-dasharray:none"
   4.409 +       transform="translate(-60,36)"
   4.410 +       id="g5600">
   4.411 +      <path
   4.412 +         style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
   4.413 +         d="m 378.82881,392.77746 c 0,19.15152 0,19.15152 0,19.15152"
   4.414 +         id="path5602"
   4.415 +         inkscape:connector-curvature="0" />
   4.416 +      <text
   4.417 +         sodipodi:linespacing="100%"
   4.418 +         id="text5604"
   4.419 +         y="376.52615"
   4.420 +         x="378.7023"
   4.421 +         style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.422 +         xml:space="preserve"><tspan
   4.423 +           y="376.52615"
   4.424 +           x="380.20621"
   4.425 +           id="tspan5606"
   4.426 +           sodipodi:role="line"
   4.427 +           style="font-size:9px;text-align:center;text-anchor:middle"><tspan
   4.428 +             style="font-size:10px"
   4.429 +             id="tspan5608">Resume </tspan></tspan><tspan
   4.430 +           y="385.74353"
   4.431 +           x="378.7023"
   4.432 +           sodipodi:role="line"
   4.433 +           id="tspan5610"
   4.434 +           style="font-size:9px;text-align:center;text-anchor:middle">(Point 1.R)</tspan></text>
   4.435 +    </g>
   4.436 +    <text
   4.437 +       sodipodi:linespacing="100%"
   4.438 +       id="text5612"
   4.439 +       y="441.27441"
   4.440 +       x="354.7023"
   4.441 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000080;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.442 +       xml:space="preserve"><tspan
   4.443 +         y="441.27441"
   4.444 +         x="354.7023"
   4.445 +         sodipodi:role="line"
   4.446 +         id="tspan5614">Timeline A</tspan></text>
   4.447 +    <path
   4.448 +       inkscape:connector-curvature="0"
   4.449 +       style="fill:none;stroke:#422fac;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:url(#Arrow2Mend)"
   4.450 +       d="m 320.08409,437.37498 c 28.16395,0 28.16395,0 28.16395,0"
   4.451 +       id="path5616" />
   4.452 +    <path
   4.453 +       inkscape:connector-curvature="0"
   4.454 +       style="fill:#ff0000;stroke:#ff0000;stroke-width:1.80000007;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:3.60000016, 3.60000016;stroke-dashoffset:0;marker-end:url(#Arrow2Mend)"
   4.455 +       d="m 196.11806,483.37498 c 152.64336,0 152.64336,0 152.64336,0"
   4.456 +       id="path3063" />
   4.457 +    <path
   4.458 +       style="fill:none;stroke:#000000;stroke-width:1.80000007;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:3.60000001, 3.60000001;stroke-dashoffset:0;marker-end:url(#Arrow2Send)"
   4.459 +       d="m 228.82881,449.32353 c 0,29.78359 0,29.78359 0,29.78359"
   4.460 +       id="path3086"
   4.461 +       inkscape:connector-curvature="0" />
   4.462 +    <path
   4.463 +       inkscape:connector-curvature="0"
   4.464 +       id="path5044"
   4.465 +       d="m 266.82881,516.24027 c 0,-29.74405 0,-29.74405 0,-29.74405"
   4.466 +       style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:3.60000002, 3.60000002;stroke-dashoffset:0;marker-end:url(#Arrow2Send)" />
   4.467 +    <path
   4.468 +       style="fill:none;stroke:#000000;stroke-width:1.5;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:url(#Arrow2Mend)"
   4.469 +       d="m 293.31837,481.43892 c 3.87039,-15.03735 4.2342,-21.56492 7.28321,-26.28454 5.73916,-8.88373 15.91289,-10.38025 15.91289,-10.38025"
   4.470 +       id="path5048"
   4.471 +       inkscape:connector-curvature="0"
   4.472 +       sodipodi:nodetypes="csc" />
   4.473 +    <path
   4.474 +       sodipodi:nodetypes="csc"
   4.475 +       inkscape:connector-curvature="0"
   4.476 +       id="path5608"
   4.477 +       d="m 301.54925,484.53107 c 2.49703,15.03735 2.73174,21.56492 4.69884,26.28454 3.70269,8.88373 10.26639,10.38025 10.26639,10.38025"
   4.478 +       style="fill:none;stroke:#000000;stroke-width:1.5;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:url(#Arrow2Mend)" />
   4.479 +    <path
   4.480 +       id="path5610"
   4.481 +       d="m 196.98465,751.37498 c 69.82336,0 69.82336,0 69.82336,0"
   4.482 +       style="fill:#800000;stroke:#800000;stroke-width:1.80000007;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:none"
   4.483 +       inkscape:connector-curvature="0" />
   4.484 +    <path
   4.485 +       inkscape:connector-curvature="0"
   4.486 +       id="path5612"
   4.487 +       d="m 266.82881,742.82004 c 0,19.38279 0,19.38279 0,19.38279"
   4.488 +       style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none" />
   4.489 +    <text
   4.490 +       xml:space="preserve"
   4.491 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.492 +       x="264.7023"
   4.493 +       y="768.52612"
   4.494 +       id="text5614"
   4.495 +       sodipodi:linespacing="100%"><tspan
   4.496 +         style="font-size:10px;text-align:center;text-anchor:middle"
   4.497 +         sodipodi:role="line"
   4.498 +         id="tspan5616"
   4.499 +         x="264.7023"
   4.500 +         y="768.52612">Suspend</tspan><tspan
   4.501 +         style="font-size:9px;text-align:center;text-anchor:middle"
   4.502 +         id="tspan5618"
   4.503 +         sodipodi:role="line"
   4.504 +         x="264.7023"
   4.505 +         y="777.74353">(Point 2.S)</tspan></text>
   4.506 +    <path
   4.507 +       inkscape:connector-curvature="0"
   4.508 +       id="path5620"
   4.509 +       d="m 318.82881,742.77746 c 0,19.15152 0,19.15152 0,19.15152"
   4.510 +       style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none" />
   4.511 +    <text
   4.512 +       xml:space="preserve"
   4.513 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.514 +       x="320.7023"
   4.515 +       y="768.52612"
   4.516 +       id="text5622"
   4.517 +       sodipodi:linespacing="100%"><tspan
   4.518 +         style="font-size:9px;text-align:center;text-anchor:middle"
   4.519 +         sodipodi:role="line"
   4.520 +         id="tspan5624"
   4.521 +         x="322.20621"
   4.522 +         y="768.52612"><tspan
   4.523 +           id="tspan5626"
   4.524 +           style="font-size:10px">Resume </tspan></tspan><tspan
   4.525 +         style="font-size:9px;text-align:center;text-anchor:middle"
   4.526 +         id="tspan5628"
   4.527 +         sodipodi:role="line"
   4.528 +         x="320.7023"
   4.529 +         y="777.74353">(Point 2.R)</tspan></text>
   4.530 +    <text
   4.531 +       sodipodi:linespacing="100%"
   4.532 +       id="text5630"
   4.533 +       y="755.27441"
   4.534 +       x="352.7023"
   4.535 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#800000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.536 +       xml:space="preserve"><tspan
   4.537 +         y="755.27441"
   4.538 +         x="352.7023"
   4.539 +         sodipodi:role="line"
   4.540 +         id="tspan5632">Timeline B</tspan></text>
   4.541 +    <path
   4.542 +       inkscape:connector-curvature="0"
   4.543 +       style="fill:none;stroke:#800000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:url(#Arrow2Mend)"
   4.544 +       d="m 320.08408,751.37498 c 27.45405,0 27.45405,0 27.45405,0"
   4.545 +       id="path5634" />
   4.546 +    <path
   4.547 +       id="path5636"
   4.548 +       d="m 195.41471,787.37498 c 151.68424,0 151.68424,0 151.68424,0"
   4.549 +       style="fill:#000000;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:url(#Arrow2Mend)"
   4.550 +       inkscape:connector-curvature="0" />
   4.551 +    <text
   4.552 +       xml:space="preserve"
   4.553 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.554 +       x="352.7023"
   4.555 +       y="790.02271"
   4.556 +       id="text5638"
   4.557 +       sodipodi:linespacing="100%"><tspan
   4.558 +         id="tspan5640"
   4.559 +         sodipodi:role="line"
   4.560 +         x="352.7023"
   4.561 +         y="790.02271">Physical time</tspan></text>
   4.562 +    <path
   4.563 +       inkscape:connector-curvature="0"
   4.564 +       style="fill:none;stroke:#422fac;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:none"
   4.565 +       d="m 195.92204,665.37498 c 33.06652,0 33.06652,0 33.06652,0"
   4.566 +       id="path5642" />
   4.567 +    <g
   4.568 +       id="g5644"
   4.569 +       transform="translate(-70,264)"
   4.570 +       style="stroke-width:1.79999995;stroke-miterlimit:4;stroke-dasharray:none">
   4.571 +      <path
   4.572 +         inkscape:connector-curvature="0"
   4.573 +         id="path5646"
   4.574 +         d="m 298.82881,392.82004 c 0,19.38279 0,19.38279 0,19.38279"
   4.575 +         style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none" />
   4.576 +      <text
   4.577 +         xml:space="preserve"
   4.578 +         style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.579 +         x="298.7023"
   4.580 +         y="376.52615"
   4.581 +         id="text5648"
   4.582 +         sodipodi:linespacing="100%"><tspan
   4.583 +           style="font-size:10px;text-align:center;text-anchor:middle"
   4.584 +           sodipodi:role="line"
   4.585 +           id="tspan5650"
   4.586 +           x="298.7023"
   4.587 +           y="376.52615">Suspend</tspan><tspan
   4.588 +           style="font-size:9px;text-align:center;text-anchor:middle"
   4.589 +           id="tspan5652"
   4.590 +           sodipodi:role="line"
   4.591 +           x="298.7023"
   4.592 +           y="385.74353">(Point 1.S)</tspan></text>
   4.593 +    </g>
   4.594 +    <g
   4.595 +       id="g5654"
   4.596 +       transform="translate(-60,264)"
   4.597 +       style="stroke-width:1.79999995;stroke-miterlimit:4;stroke-dasharray:none">
   4.598 +      <path
   4.599 +         inkscape:connector-curvature="0"
   4.600 +         id="path5656"
   4.601 +         d="m 378.82881,392.77746 c 0,19.15152 0,19.15152 0,19.15152"
   4.602 +         style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none" />
   4.603 +      <text
   4.604 +         xml:space="preserve"
   4.605 +         style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.606 +         x="378.7023"
   4.607 +         y="376.52615"
   4.608 +         id="text5658"
   4.609 +         sodipodi:linespacing="100%"><tspan
   4.610 +           style="font-size:9px;text-align:center;text-anchor:middle"
   4.611 +           sodipodi:role="line"
   4.612 +           id="tspan5660"
   4.613 +           x="380.20621"
   4.614 +           y="376.52615"><tspan
   4.615 +             id="tspan5662"
   4.616 +             style="font-size:10px">Resume </tspan></tspan><tspan
   4.617 +           style="font-size:9px;text-align:center;text-anchor:middle"
   4.618 +           id="tspan5664"
   4.619 +           sodipodi:role="line"
   4.620 +           x="378.7023"
   4.621 +           y="385.74353">(Point 1.R)</tspan></text>
   4.622 +    </g>
   4.623 +    <text
   4.624 +       xml:space="preserve"
   4.625 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000080;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.626 +       x="352.7023"
   4.627 +       y="669.27441"
   4.628 +       id="text5666"
   4.629 +       sodipodi:linespacing="100%"><tspan
   4.630 +         id="tspan5668"
   4.631 +         sodipodi:role="line"
   4.632 +         x="352.7023"
   4.633 +         y="669.27441">Timeline A</tspan></text>
   4.634 +    <path
   4.635 +       id="path5670"
   4.636 +       d="m 320.08408,665.37498 c 27.45405,0 27.45405,0 27.45405,0"
   4.637 +       style="fill:none;stroke:#422fac;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:url(#Arrow2Mend)"
   4.638 +       inkscape:connector-curvature="0" />
   4.639 +    <path
   4.640 +       id="path5672"
   4.641 +       d="m 227.92204,711.37498 c 15.62732,0 15.62732,0 15.62732,0"
   4.642 +       style="fill:#ff0000;stroke:#ff0000;stroke-width:1.80000007;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:3.60000014, 3.60000014;stroke-dashoffset:0;marker-end:none"
   4.643 +       inkscape:connector-curvature="0" />
   4.644 +    <path
   4.645 +       style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
   4.646 +       d="m 228.82881,701.32352 c 0,19.38279 0,19.38279 0,19.38279"
   4.647 +       id="path5674"
   4.648 +       inkscape:connector-curvature="0" />
   4.649 +    <path
   4.650 +       inkscape:connector-curvature="0"
   4.651 +       id="path5676"
   4.652 +       d="m 242.82881,701.32352 c 0,19.38279 0,19.38279 0,19.38279"
   4.653 +       style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none" />
   4.654 +    <path
   4.655 +       inkscape:connector-curvature="0"
   4.656 +       style="fill:#ff0000;stroke:#ff0000;stroke-width:1.80000007;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:3.6000001, 3.6000001;stroke-dashoffset:0;marker-end:none"
   4.657 +       d="m 265.92203,711.37498 c 28.40046,0 28.40046,0 28.40046,0"
   4.658 +       id="path5678" />
   4.659 +    <path
   4.660 +       inkscape:connector-curvature="0"
   4.661 +       id="path5680"
   4.662 +       d="m 266.82881,701.32352 c 0,19.38279 0,19.38279 0,19.38279"
   4.663 +       style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none" />
   4.664 +    <path
   4.665 +       style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
   4.666 +       d="m 294.82881,701.32352 c 0,19.38279 0,19.38279 0,19.38279"
   4.667 +       id="path5682"
   4.668 +       inkscape:connector-curvature="0" />
   4.669 +    <path
   4.670 +       inkscape:connector-curvature="0"
   4.671 +       id="path5684"
   4.672 +       d="m 228.82881,677.32352 c 0,19.38279 0,19.38279 0,19.38279"
   4.673 +       style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:3.6, 3.6;stroke-dashoffset:0;marker-end:url(#Arrow2Send)" />
   4.674 +    <path
   4.675 +       style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:3.6, 3.6;stroke-dashoffset:0;marker-end:url(#Arrow2Send)"
   4.676 +       d="m 266.82881,744.24025 c 0,-19.38279 0,-19.38279 0,-19.38279"
   4.677 +       id="path5686"
   4.678 +       inkscape:connector-curvature="0" />
   4.679 +    <path
   4.680 +       sodipodi:nodetypes="csc"
   4.681 +       inkscape:connector-curvature="0"
   4.682 +       id="path5688"
   4.683 +       d="m 273.86358,709.43892 c 7.11652,-15.03735 7.78546,-21.56492 13.39171,-26.28454 10.55265,-8.88373 29.25918,-10.38025 29.25918,-10.38025"
   4.684 +       style="fill:none;stroke:#000000;stroke-width:1.5;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:url(#Arrow2Mend)" />
   4.685 +    <path
   4.686 +       style="fill:none;stroke:#000000;stroke-width:1.5;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:url(#Arrow2Mend)"
   4.687 +       d="m 284.33924,712.53107 c 5.3686,15.03735 5.87324,21.56492 10.10251,26.28454 7.96078,8.88373 22.07272,10.38025 22.07272,10.38025"
   4.688 +       id="path5690"
   4.689 +       inkscape:connector-curvature="0"
   4.690 +       sodipodi:nodetypes="csc" />
   4.691 +    <text
   4.692 +       xml:space="preserve"
   4.693 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000080;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.694 +       x="354.7023"
   4.695 +       y="481.27441"
   4.696 +       id="text5880"
   4.697 +       sodipodi:linespacing="100%"><tspan
   4.698 +         id="tspan5882"
   4.699 +         sodipodi:role="line"
   4.700 +         x="354.7023"
   4.701 +         y="481.27441"
   4.702 +         style="fill:#ff0000">Hidden</tspan><tspan
   4.703 +         sodipodi:role="line"
   4.704 +         x="354.7023"
   4.705 +         y="491.27441"
   4.706 +         id="tspan5884"
   4.707 +         style="fill:#ff0000">Timeline</tspan></text>
   4.708 +    <text
   4.709 +       xml:space="preserve"
   4.710 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.711 +       x="248.7023"
   4.712 +       y="502.52612"
   4.713 +       id="text5886"
   4.714 +       sodipodi:linespacing="100%"><tspan
   4.715 +         style="font-size:10px;text-align:center;text-anchor:middle"
   4.716 +         id="tspan5890"
   4.717 +         sodipodi:role="line"
   4.718 +         x="248.7023"
   4.719 +         y="502.52612">comm</tspan></text>
   4.720 +    <text
   4.721 +       sodipodi:linespacing="100%"
   4.722 +       id="text5894"
   4.723 +       y="466.52612"
   4.724 +       x="244.7023"
   4.725 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.726 +       xml:space="preserve"><tspan
   4.727 +         y="466.52612"
   4.728 +         x="244.7023"
   4.729 +         sodipodi:role="line"
   4.730 +         id="tspan5896"
   4.731 +         style="font-size:10px;text-align:center;text-anchor:middle">comm</tspan></text>
   4.732 +    <text
   4.733 +       xml:space="preserve"
   4.734 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.735 +       x="314.7023"
   4.736 +       y="464.52612"
   4.737 +       id="text5898"
   4.738 +       sodipodi:linespacing="100%"><tspan
   4.739 +         style="font-size:10px;text-align:center;text-anchor:middle"
   4.740 +         id="tspan5900"
   4.741 +         sodipodi:role="line"
   4.742 +         x="314.7023"
   4.743 +         y="464.52612">control</tspan></text>
   4.744 +    <text
   4.745 +       sodipodi:linespacing="100%"
   4.746 +       id="text5902"
   4.747 +       y="506.52612"
   4.748 +       x="320.7023"
   4.749 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   4.750 +       xml:space="preserve"><tspan
   4.751 +         y="506.52612"
   4.752 +         x="320.7023"
   4.753 +         sodipodi:role="line"
   4.754 +         id="tspan5904"
   4.755 +         style="font-size:10px;text-align:center;text-anchor:middle">control</tspan></text>
   4.756 +  </g>
   4.757 +</svg>
     5.1 Binary file 0__Papers/PRT/PRT__formal_def/figures/PR__timeline_dual_w_hidden.pdf has changed
     6.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     6.2 +++ b/0__Papers/PRT/PRT__formal_def/figures/PR__timeline_dual_w_hidden.svg	Tue Sep 17 06:30:06 2013 -0700
     6.3 @@ -0,0 +1,366 @@
     6.4 +<?xml version="1.0" encoding="UTF-8" standalone="no"?>
     6.5 +<!-- Created with Inkscape (http://www.inkscape.org/) -->
     6.6 +
     6.7 +<svg
     6.8 +   xmlns:dc="http://purl.org/dc/elements/1.1/"
     6.9 +   xmlns:cc="http://creativecommons.org/ns#"
    6.10 +   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    6.11 +   xmlns:svg="http://www.w3.org/2000/svg"
    6.12 +   xmlns="http://www.w3.org/2000/svg"
    6.13 +   xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
    6.14 +   xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
    6.15 +   width="744.09448819"
    6.16 +   height="1052.3622047"
    6.17 +   id="svg2"
    6.18 +   sodipodi:version="0.32"
    6.19 +   inkscape:version="0.48.2 r9819"
    6.20 +   sodipodi:docname="PR__timeline_dual_w_hidden.svg"
    6.21 +   inkscape:output_extension="org.inkscape.output.svg.inkscape"
    6.22 +   version="1.1">
    6.23 +  <defs
    6.24 +     id="defs4">
    6.25 +    <marker
    6.26 +       inkscape:stockid="Arrow2Send"
    6.27 +       orient="auto"
    6.28 +       refY="0.0"
    6.29 +       refX="0.0"
    6.30 +       id="Arrow2Send"
    6.31 +       style="overflow:visible;">
    6.32 +      <path
    6.33 +         id="path4262"
    6.34 +         style="font-size:12.0;fill-rule:evenodd;stroke-width:0.62500000;stroke-linejoin:round;"
    6.35 +         d="M 8.7185878,4.0337352 L -2.2072895,0.016013256 L 8.7185884,-4.0017078 C 6.9730900,-1.6296469 6.9831476,1.6157441 8.7185878,4.0337352 z "
    6.36 +         transform="scale(0.3) rotate(180) translate(-2.3,0)" />
    6.37 +    </marker>
    6.38 +    <marker
    6.39 +       inkscape:stockid="Arrow1Mend"
    6.40 +       orient="auto"
    6.41 +       refY="0.0"
    6.42 +       refX="0.0"
    6.43 +       id="Arrow1Mend"
    6.44 +       style="overflow:visible;">
    6.45 +      <path
    6.46 +         id="path4238"
    6.47 +         d="M 0.0,0.0 L 5.0,-5.0 L -12.5,0.0 L 5.0,5.0 L 0.0,0.0 z "
    6.48 +         style="fill-rule:evenodd;stroke:#000000;stroke-width:1.0pt;marker-start:none;"
    6.49 +         transform="scale(0.4) rotate(180) translate(10,0)" />
    6.50 +    </marker>
    6.51 +    <marker
    6.52 +       inkscape:stockid="Arrow2Mend"
    6.53 +       orient="auto"
    6.54 +       refY="0.0"
    6.55 +       refX="0.0"
    6.56 +       id="Arrow2Mend"
    6.57 +       style="overflow:visible;">
    6.58 +      <path
    6.59 +         id="path4008"
    6.60 +         style="font-size:12.0;fill-rule:evenodd;stroke-width:0.62500000;stroke-linejoin:round;"
    6.61 +         d="M 8.7185878,4.0337352 L -2.2072895,0.016013256 L 8.7185884,-4.0017078 C 6.9730900,-1.6296469 6.9831476,1.6157441 8.7185878,4.0337352 z "
    6.62 +         transform="scale(0.6) rotate(180) translate(0,0)" />
    6.63 +    </marker>
    6.64 +    <inkscape:perspective
    6.65 +       sodipodi:type="inkscape:persp3d"
    6.66 +       inkscape:vp_x="0 : 526.18109 : 1"
    6.67 +       inkscape:vp_y="0 : 1000 : 0"
    6.68 +       inkscape:vp_z="744.09448 : 526.18109 : 1"
    6.69 +       inkscape:persp3d-origin="372.04724 : 350.78739 : 1"
    6.70 +       id="perspective10" />
    6.71 +    <inkscape:perspective
    6.72 +       id="perspective11923"
    6.73 +       inkscape:persp3d-origin="0.5 : 0.33333333 : 1"
    6.74 +       inkscape:vp_z="1 : 0.5 : 1"
    6.75 +       inkscape:vp_y="0 : 1000 : 0"
    6.76 +       inkscape:vp_x="0 : 0.5 : 1"
    6.77 +       sodipodi:type="inkscape:persp3d" />
    6.78 +  </defs>
    6.79 +  <sodipodi:namedview
    6.80 +     id="base"
    6.81 +     pagecolor="#ffffff"
    6.82 +     bordercolor="#666666"
    6.83 +     borderopacity="1.0"
    6.84 +     gridtolerance="10000"
    6.85 +     guidetolerance="10"
    6.86 +     objecttolerance="10"
    6.87 +     inkscape:pageopacity="0.0"
    6.88 +     inkscape:pageshadow="2"
    6.89 +     inkscape:zoom="1.3364318"
    6.90 +     inkscape:cx="214.9176"
    6.91 +     inkscape:cy="603.68563"
    6.92 +     inkscape:document-units="px"
    6.93 +     inkscape:current-layer="layer1"
    6.94 +     showgrid="false"
    6.95 +     inkscape:window-width="1317"
    6.96 +     inkscape:window-height="878"
    6.97 +     inkscape:window-x="7"
    6.98 +     inkscape:window-y="1"
    6.99 +     inkscape:window-maximized="0" />
   6.100 +  <metadata
   6.101 +     id="metadata7">
   6.102 +    <rdf:RDF>
   6.103 +      <cc:Work
   6.104 +         rdf:about="">
   6.105 +        <dc:format>image/svg+xml</dc:format>
   6.106 +        <dc:type
   6.107 +           rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
   6.108 +        <dc:title></dc:title>
   6.109 +      </cc:Work>
   6.110 +    </rdf:RDF>
   6.111 +  </metadata>
   6.112 +  <g
   6.113 +     inkscape:label="Layer 1"
   6.114 +     inkscape:groupmode="layer"
   6.115 +     id="layer1">
   6.116 +    <path
   6.117 +       inkscape:connector-curvature="0"
   6.118 +       style="fill:#800000;stroke:#800000;stroke-width:1.80000007;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:none"
   6.119 +       d="m 195.48813,523.37498 c 69.82336,0 69.82336,0 69.82336,0"
   6.120 +       id="path5552" />
   6.121 +    <path
   6.122 +       style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
   6.123 +       d="m 266.82881,514.82004 c 0,19.38279 0,19.38279 0,19.38279"
   6.124 +       id="path5556"
   6.125 +       inkscape:connector-curvature="0" />
   6.126 +    <text
   6.127 +       sodipodi:linespacing="100%"
   6.128 +       id="text5558"
   6.129 +       y="540.52612"
   6.130 +       x="264.7023"
   6.131 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   6.132 +       xml:space="preserve"><tspan
   6.133 +         y="540.52612"
   6.134 +         x="264.7023"
   6.135 +         id="tspan5560"
   6.136 +         sodipodi:role="line"
   6.137 +         style="font-size:10px;text-align:center;text-anchor:middle">Suspend</tspan><tspan
   6.138 +         y="549.74353"
   6.139 +         x="264.7023"
   6.140 +         sodipodi:role="line"
   6.141 +         id="tspan5562"
   6.142 +         style="font-size:9px;text-align:center;text-anchor:middle">(Point 2.S)</tspan></text>
   6.143 +    <path
   6.144 +       style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
   6.145 +       d="m 318.82881,514.77746 c 0,19.15152 0,19.15152 0,19.15152"
   6.146 +       id="path5566"
   6.147 +       inkscape:connector-curvature="0" />
   6.148 +    <text
   6.149 +       sodipodi:linespacing="100%"
   6.150 +       id="text5568"
   6.151 +       y="540.52612"
   6.152 +       x="320.7023"
   6.153 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   6.154 +       xml:space="preserve"><tspan
   6.155 +         y="540.52612"
   6.156 +         x="322.20621"
   6.157 +         id="tspan5570"
   6.158 +         sodipodi:role="line"
   6.159 +         style="font-size:9px;text-align:center;text-anchor:middle"><tspan
   6.160 +           style="font-size:10px"
   6.161 +           id="tspan5572">Resume </tspan></tspan><tspan
   6.162 +         y="549.74353"
   6.163 +         x="320.7023"
   6.164 +         sodipodi:role="line"
   6.165 +         id="tspan5574"
   6.166 +         style="font-size:9px;text-align:center;text-anchor:middle">(Point 2.R)</tspan></text>
   6.167 +    <text
   6.168 +       xml:space="preserve"
   6.169 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#800000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   6.170 +       x="354.7023"
   6.171 +       y="527.27441"
   6.172 +       id="text5576"
   6.173 +       sodipodi:linespacing="100%"><tspan
   6.174 +         id="tspan5578"
   6.175 +         sodipodi:role="line"
   6.176 +         x="354.7023"
   6.177 +         y="527.27441">Timeline B</tspan></text>
   6.178 +    <path
   6.179 +       id="path5580"
   6.180 +       d="m 320.08409,523.37498 c 28.16395,0 28.16395,0 28.16395,0"
   6.181 +       style="fill:none;stroke:#800000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:url(#Arrow2Mend)"
   6.182 +       inkscape:connector-curvature="0" />
   6.183 +    <path
   6.184 +       inkscape:connector-curvature="0"
   6.185 +       style="fill:#000000;stroke:#000000;stroke-width:1.80000007;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:url(#Arrow2Mend)"
   6.186 +       d="m 195.41472,559.37498 c 153.16627,0 153.16627,0 153.16627,0"
   6.187 +       id="path5582" />
   6.188 +    <text
   6.189 +       sodipodi:linespacing="100%"
   6.190 +       id="text5584"
   6.191 +       y="562.02271"
   6.192 +       x="354.05777"
   6.193 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   6.194 +       xml:space="preserve"><tspan
   6.195 +         y="562.02271"
   6.196 +         x="354.05777"
   6.197 +         sodipodi:role="line"
   6.198 +         id="tspan5586">Physical time</tspan></text>
   6.199 +    <path
   6.200 +       id="path5588"
   6.201 +       d="m 195.17378,437.37498 c 33.06652,0 33.06652,0 33.06652,0"
   6.202 +       style="fill:none;stroke:#422fac;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:none"
   6.203 +       inkscape:connector-curvature="0" />
   6.204 +    <g
   6.205 +       style="stroke-width:1.79999995;stroke-miterlimit:4;stroke-dasharray:none"
   6.206 +       transform="translate(-70,36)"
   6.207 +       id="g5590">
   6.208 +      <path
   6.209 +         style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
   6.210 +         d="m 298.82881,392.82004 c 0,19.38279 0,19.38279 0,19.38279"
   6.211 +         id="path5592"
   6.212 +         inkscape:connector-curvature="0" />
   6.213 +      <text
   6.214 +         sodipodi:linespacing="100%"
   6.215 +         id="text5594"
   6.216 +         y="376.52615"
   6.217 +         x="298.7023"
   6.218 +         style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   6.219 +         xml:space="preserve"><tspan
   6.220 +           y="376.52615"
   6.221 +           x="298.7023"
   6.222 +           id="tspan5596"
   6.223 +           sodipodi:role="line"
   6.224 +           style="font-size:10px;text-align:center;text-anchor:middle">Suspend</tspan><tspan
   6.225 +           y="385.74353"
   6.226 +           x="298.7023"
   6.227 +           sodipodi:role="line"
   6.228 +           id="tspan5598"
   6.229 +           style="font-size:9px;text-align:center;text-anchor:middle">(Point 1.S)</tspan></text>
   6.230 +    </g>
   6.231 +    <g
   6.232 +       style="stroke-width:1.79999995;stroke-miterlimit:4;stroke-dasharray:none"
   6.233 +       transform="translate(-60,36)"
   6.234 +       id="g5600">
   6.235 +      <path
   6.236 +         style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
   6.237 +         d="m 378.82881,392.77746 c 0,19.15152 0,19.15152 0,19.15152"
   6.238 +         id="path5602"
   6.239 +         inkscape:connector-curvature="0" />
   6.240 +      <text
   6.241 +         sodipodi:linespacing="100%"
   6.242 +         id="text5604"
   6.243 +         y="376.52615"
   6.244 +         x="378.7023"
   6.245 +         style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   6.246 +         xml:space="preserve"><tspan
   6.247 +           y="376.52615"
   6.248 +           x="380.20621"
   6.249 +           id="tspan5606"
   6.250 +           sodipodi:role="line"
   6.251 +           style="font-size:9px;text-align:center;text-anchor:middle"><tspan
   6.252 +             style="font-size:10px"
   6.253 +             id="tspan5608">Resume </tspan></tspan><tspan
   6.254 +           y="385.74353"
   6.255 +           x="378.7023"
   6.256 +           sodipodi:role="line"
   6.257 +           id="tspan5610"
   6.258 +           style="font-size:9px;text-align:center;text-anchor:middle">(Point 1.R)</tspan></text>
   6.259 +    </g>
   6.260 +    <text
   6.261 +       sodipodi:linespacing="100%"
   6.262 +       id="text5612"
   6.263 +       y="441.27441"
   6.264 +       x="354.7023"
   6.265 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000080;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   6.266 +       xml:space="preserve"><tspan
   6.267 +         y="441.27441"
   6.268 +         x="354.7023"
   6.269 +         sodipodi:role="line"
   6.270 +         id="tspan5614">Timeline A</tspan></text>
   6.271 +    <path
   6.272 +       inkscape:connector-curvature="0"
   6.273 +       style="fill:none;stroke:#422fac;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:url(#Arrow2Mend)"
   6.274 +       d="m 320.08409,437.37498 c 28.16395,0 28.16395,0 28.16395,0"
   6.275 +       id="path5616" />
   6.276 +    <path
   6.277 +       inkscape:connector-curvature="0"
   6.278 +       style="fill:#ff0000;stroke:#ff0000;stroke-width:1.80000007;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:3.60000016, 3.60000016;stroke-dashoffset:0;marker-end:url(#Arrow2Mend)"
   6.279 +       d="m 196.11806,483.37498 c 152.64336,0 152.64336,0 152.64336,0"
   6.280 +       id="path3063" />
   6.281 +    <path
   6.282 +       style="fill:none;stroke:#000000;stroke-width:1.80000007;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:3.60000001, 3.60000001;stroke-dashoffset:0;marker-end:url(#Arrow2Send)"
   6.283 +       d="m 228.82881,449.32353 c 0,29.78359 0,29.78359 0,29.78359"
   6.284 +       id="path3086"
   6.285 +       inkscape:connector-curvature="0" />
   6.286 +    <path
   6.287 +       inkscape:connector-curvature="0"
   6.288 +       id="path5044"
   6.289 +       d="m 266.82881,516.24027 c 0,-29.74405 0,-29.74405 0,-29.74405"
   6.290 +       style="fill:none;stroke:#000000;stroke-width:1.79999995;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:3.60000002, 3.60000002;stroke-dashoffset:0;marker-end:url(#Arrow2Send)" />
   6.291 +    <path
   6.292 +       style="fill:none;stroke:#000000;stroke-width:1.5;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:url(#Arrow2Mend)"
   6.293 +       d="m 293.31837,481.43892 c 3.87039,-15.03735 4.2342,-21.56492 7.28321,-26.28454 5.73916,-8.88373 15.91289,-10.38025 15.91289,-10.38025"
   6.294 +       id="path5048"
   6.295 +       inkscape:connector-curvature="0"
   6.296 +       sodipodi:nodetypes="csc" />
   6.297 +    <path
   6.298 +       sodipodi:nodetypes="csc"
   6.299 +       inkscape:connector-curvature="0"
   6.300 +       id="path5608"
   6.301 +       d="m 301.54925,484.53107 c 2.49703,15.03735 2.73174,21.56492 4.69884,26.28454 3.70269,8.88373 10.26639,10.38025 10.26639,10.38025"
   6.302 +       style="fill:none;stroke:#000000;stroke-width:1.5;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;marker-end:url(#Arrow2Mend)" />
   6.303 +    <text
   6.304 +       xml:space="preserve"
   6.305 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000080;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   6.306 +       x="354.7023"
   6.307 +       y="481.27441"
   6.308 +       id="text5880"
   6.309 +       sodipodi:linespacing="100%"><tspan
   6.310 +         id="tspan5882"
   6.311 +         sodipodi:role="line"
   6.312 +         x="354.7023"
   6.313 +         y="481.27441"
   6.314 +         style="fill:#ff0000">Hidden</tspan><tspan
   6.315 +         sodipodi:role="line"
   6.316 +         x="354.7023"
   6.317 +         y="491.27441"
   6.318 +         id="tspan5884"
   6.319 +         style="fill:#ff0000">Timeline</tspan></text>
   6.320 +    <text
   6.321 +       xml:space="preserve"
   6.322 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   6.323 +       x="248.7023"
   6.324 +       y="502.52612"
   6.325 +       id="text5886"
   6.326 +       sodipodi:linespacing="100%"><tspan
   6.327 +         style="font-size:10px;text-align:center;text-anchor:middle"
   6.328 +         id="tspan5890"
   6.329 +         sodipodi:role="line"
   6.330 +         x="248.7023"
   6.331 +         y="502.52612">comm</tspan></text>
   6.332 +    <text
   6.333 +       sodipodi:linespacing="100%"
   6.334 +       id="text5894"
   6.335 +       y="466.52612"
   6.336 +       x="244.7023"
   6.337 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   6.338 +       xml:space="preserve"><tspan
   6.339 +         y="466.52612"
   6.340 +         x="244.7023"
   6.341 +         sodipodi:role="line"
   6.342 +         id="tspan5896"
   6.343 +         style="font-size:10px;text-align:center;text-anchor:middle">comm</tspan></text>
   6.344 +    <text
   6.345 +       xml:space="preserve"
   6.346 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   6.347 +       x="314.7023"
   6.348 +       y="464.52612"
   6.349 +       id="text5898"
   6.350 +       sodipodi:linespacing="100%"><tspan
   6.351 +         style="font-size:10px;text-align:center;text-anchor:middle"
   6.352 +         id="tspan5900"
   6.353 +         sodipodi:role="line"
   6.354 +         x="314.7023"
   6.355 +         y="464.52612">control</tspan></text>
   6.356 +    <text
   6.357 +       sodipodi:linespacing="100%"
   6.358 +       id="text5902"
   6.359 +       y="506.52612"
   6.360 +       x="320.7023"
   6.361 +       style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS"
   6.362 +       xml:space="preserve"><tspan
   6.363 +         y="506.52612"
   6.364 +         x="320.7023"
   6.365 +         sodipodi:role="line"
   6.366 +         id="tspan5904"
   6.367 +         style="font-size:10px;text-align:center;text-anchor:middle">control</tspan></text>
   6.368 +  </g>
   6.369 +</svg>
     7.1 --- a/0__Papers/PRT/PRT__formal_def/latex/PRT__full_w_Henning_derived_formal_def.tex	Sat Aug 03 19:24:22 2013 -0700
     7.2 +++ b/0__Papers/PRT/PRT__formal_def/latex/PRT__full_w_Henning_derived_formal_def.tex	Tue Sep 17 06:30:06 2013 -0700
     7.3 @@ -49,7 +49,10 @@
     7.4  \preprintfooter{short description of paper}   % 'preprint' option specified.
     7.5  
     7.6  
     7.7 -\title{A Proto-Runtime Approach to Domain Specific Languages}
     7.8 +\title{ The Proto-Runtime Abstraction for Construction
     7.9 +of Parallel Language Runtime Systems\\ or\\ The Proto-Runtime
    7.10 +Abstraction Applied to the Implementation of Runtime
    7.11 +Systems for Parallel Domain Specific Languages}
    7.12  
    7.13  
    7.14  \authorinfo{Sean Halle}
    7.15 @@ -68,17 +71,22 @@
    7.16  
    7.17  \begin{abstract}
    7.18   
    7.19 +
    7.20 +
    7.21 +Domain Specific Languages that are embedded into a base language have promise to provide productivity, performant-portability and wide adoption for parallel programming. However such languages have too few users to support the large effort required to create them and port them across hardware platforms, resulting in low adoption of the method.
    7.22 +As one step to ameliorate this, we apply the proto-runtime approach, which reduces the effort to create and port the runtime systems of parallel languages. It modularizes the creation of runtime systems and the parallelism constructs they implement, by providing an interface
    7.23 +that separates the language construct  and scheduling logic away from the low-level runtime details, including concurrency, memory consistency, and runtime-performance aspects.
    7.24 +As a result, new parallel constructs are written using sequential reasoning,  multiple languages can be mixed within
    7.25 +the same program, and reusable services such as performance
    7.26 +tuning and debugging
    7.27 +support are available. In addition, scheduling of work onto hardware is under language and application control, without interference from an underlying thread package scheduler. This enables higher quality scheduling decisions for higher application performance.
    7.28 +We present measurements of the time taken to develop runtimes for  new languages, as well as time to re-implement for existing ones,  which average  a few days each.  In addition, we measure performance of implementations
    7.29 +based on proto-runtime, going head-to-head with the standard distributions of Cilk, StarSs (OMPSs), and posix threads, showing that the proto-runtime matches or outperforms on large servers in all cases.
    7.30 +
    7.31  ?
    7.32 -replace lang-specific with interface, centralize services, minimize effort to create, give language control over hardware assignment..  side benefits: multi-lang, perf-tuning, debugging
    7.33 -
    7.34 -?
    7.35 -
    7.36 -Domain Specific Languages that are embedded into a base language have promise to provide productivity, performant-portability and wide adoption for parallel programming. However such languages have too few users to support the large effort required to create them and port them across hardware platforms, resulting in low adoption of the method.
    7.37 -To solve this, we introduce a proto-runtime approach, which reduces the effort to create and port domain specific languages. It modularizes the creation of runtime systems and the parallelism constructs they implement, by separating the language-construct  and scheduling logic away from the low-level runtime details, including concurrency, memory consistency, and runtime-performance aspects.
    7.38 -As a result, new parallel constructs are written using sequential reasoning, and multiple languages can be mixed within
    7.39 -the same program. In addition, scheduling of work onto hardware is under language and application control, without interference from an underlying thread package scheduler. This enables higher quality scheduling decisions for higher application performance.
    7.40 -We present measurements of the time taken to develop runtimes for  new languages, as well as time to re-implement existing ones,  which average  a few days each.  In addition, we measure performance of proto-runtime based implementations going head-to-head with the standard distributions of Cilk, StarSs (OMPSs), and posix threads, showing that the proto-runtime matches or outperforms on large servers in all cases.
    7.41 -\end{abstract}
    7.42 +
    7.43 +
    7.44 +replace lang-specific with interface, centralize services, minimize effort to create, give language control over hardware assignment..  side benefits: multi-lang, perf-tuning, debugging\end{abstract}
    7.45  
    7.46  
    7.47  
    7.48 @@ -90,15 +98,22 @@
    7.49  
    7.50  [Note to reviewers: this paper's style and structure follow the official PPoPP guide to writing style, which is linked to the PPoPP website. We are taking on faith that the approach has been communicated effectively to reviewers and that we won't be penalized for following it's recommended structure and approach.]
    7.51  
    7.52 -Programming in the past has been overwhelmingly sequential, with the applications being run on sequential hardware.  But the laws of physics have forced the hardware to become parallel, which will force nearly all future programming to  become parallel programming.  However,  the transition from sequential to parallel programming has been slow due to  the difficulty of the traditional parallel programming methods. 
    7.53 -
    7.54 -The difficulties with parallel programming fall into three main categories: 1)  difficult mental model, 2) extra effort to rewrite the code for each hardware target to get acceptable performance and 3) disruption to existing practices, including steep learning curve, changes to the tools used, and changes in design practices. 
    7.55 -
    7.56 -Many believe that these can be overcome with the use of embedded style Domain-Specific Languages (eDSLs) []. eDSL language
    7.57 +The degree of parallelism in hardware  steadily increases, but programming has not kept pace, instead relying
    7.58 +upon band-aid measures to make use of relatively coarse
    7.59 +grained multi-cores.  Pressure continues to mount to
    7.60 +integrate parallelism into every aspect of programming.
    7.61 +However,  the transition has been slow due to  difficulties
    7.62 +with the traditional parallel programming methods. 
    7.63 +
    7.64 +The main difficulties with those parallel programming methods are: 1)  difficult mental model, which reduces productivity, 2) additional effort to rewrite the code for each hardware target to get acceptable performance and 3) disruption to existing practices, including steep learning curve, changes to the tools used, and changes in work practices. 
    7.65 +
    7.66 +New languages and tools are being investigated to mitigate
    7.67 +these problems. Many  believe that one promising approach
    7.68 +is embedded-style parallel Domain-Specific Languages (epDSLs) []. epDSL language
    7.69  constructs match the mental model of the domain, while
    7.70  they internally imply parallelism. For example, a simulation
    7.71 -eDSL called HWSim[] has only 10 constructs, which match
    7.72 -the actions taken during a simulation
    7.73 +epDSL called HWSim[] has only 10 constructs, which match
    7.74 +the actions taken during  simulation
    7.75  of interacting objects.  They are mixed into sequential C code and take
    7.76  only a couple of hours to learn.  Yet they encapsulate subtle
    7.77  and complex dependencies that relate simulated time
    7.78 @@ -108,63 +123,83 @@
    7.79  
    7.80  
    7.81  
    7.82 - Despite this, such languages have been slow to adopt, we believe due to the cost to create them and to port them across hardware targets. The small number of users of each language, which is specific to a narrow domain, makes this cost impractical.
    7.83 -
    7.84 -We propose that a method that makes Domain Specific Languages (DSLs) low cost to produce as well as to port across hardware targets will allow them to fulfill their promise, and we introduce what we call a proto-runtime to help towards this goal.  
    7.85 -
    7.86 -The proto-runtime approach is a normal, full, runtime, but with two key pieces replaced by an interface. One  piece replaced is the logic of language constructs, and the other is logic for choosing which core to assign work onto. The remaining proto-runtime piece handles the  low-level hardware details of the runtime. 
    7.87 -
    7.88 -The decomposition into a proto-runtime, plus  plugged-in  language behaviors, modularizes the construction of runtimes.  The proto-runtime is one module, which  embodies runtime internals, which are hardware oriented and independent of language. The plugged-in portions form the two other modules, which are language specific. The interface between them   occurs at a natural boundary, which separates   the hardware oriented portion of a runtime from the language oriented portion. 
    7.89 + Despite this, the adoption of such languages has been slow, we believe due to the cost to create them and to port them across hardware targets. The small number of users of each language, which is specific to a narrow domain, makes this cost impractical.
    7.90 +
    7.91 +We propose that a method that makes epDSLs lower cost to produce as well as to port across hardware targets will allow them to fulfill their promise. We  discuss
    7.92 +the proto-runtime approach and show
    7.93 +how to apply it to help towards this goal.  
    7.94 +
    7.95 +In this approach, a language's runtime system is built
    7.96 +as a plugin that is plugged into to a  proto-runtime  instance that was separately installed on the given hardware. Together, the plugin
    7.97 +plus proto-runtime instance form the runtime system
    7.98 +of the language. The proto-runtime instance itself acts as the infrastructure of a runtime system, and
    7.99 +encapsulates most of the hardware-specific details,
   7.100 +while providing a number of services for use by the
   7.101 +plugged in language module. 
   7.102 +
   7.103 +A proto-runtime instance is essentially a full runtime, but with two key pieces replaced by an interface. One  piece replaced is the logic of language constructs, and the other is logic for choosing which core to assign work onto. The proto-runtime instance then supplies
   7.104 +the rest of the runtime system. 
   7.105 +
   7.106 +The decomposition, into a proto-runtime plus  plugged-in  language behaviors, modularizes the construction of runtimes.  The proto-runtime is one module, which  embodies runtime internals, which are hardware oriented and independent of language. The plugged-in portions form the two other modules, which are language specific. The interface between them   occurs at a natural boundary, which separates   the hardware oriented portion of a runtime from the language oriented portion. 
   7.107  
   7.108  We claim the following benefits of the proto-runtime approach, each of which is  supported in the indicated section of  the paper:
   7.109  
   7.110  \begin{itemize}
   7.111  
   7.112 -\item The proto-runtime approach should reliably apply to future languages and hardware.  because the patterns underlying it appear to be fundamental and so should hold for future languages and architectures (\S\ref{subsec:TiePoints},
   7.113 -\S\ref{subsec:Example}).
   7.114 -
   7.115  \item The proto-runtime approach modularizes the runtime (\S\ref{sec:Proposal}).
   7.116  
   7.117  %\item The modularization  is consistent with patterns that appear to be fundamental to parallel computation and runtimes (\S\ ). 
   7.118  
   7.119 -\item The modularization  cleanly separates runtime internals from the language-specific logic (\S\ref{sec:Proposal},
   7.120 +\item The modularization  cleanly separates hardware
   7.121 +related runtime internals from the language-specific logic (\S\ref{sec:Proposal},
   7.122  \S\ref{subsec:Example}). 
   7.123  
   7.124  \item The modularization gives the language control
   7.125  over timing and placement of executing work (\S\ref{sec:Proposal}).
   7.126  
   7.127 +
   7.128 +\item
   7.129 +
   7.130 +The modularization  selectively exposes hardware aspects relevant to placement of work. If the language takes advantage of this, it  can result in reduced communication between cores and increased application performance  (\S\ ).
   7.131 +
   7.132 +\begin{itemize}
   7.133 +
   7.134 +\item Similar control over hardware is not possible when the language is   built on top of a package like Posix threads or TBB, which has its own work-to-hardware assignment   (\S\ref{sec:Related}).
   7.135 +
   7.136 +\end{itemize}
   7.137 +
   7.138 +
   7.139  \item The modularization results in reduced time to implement a new language's behavior, and in reduced time to port a language to new hardware (\S\ref{sec:Proposal},
   7.140  \S\ref{subsec:ImplTimeMeas}).
   7.141  
   7.142  \begin{itemize}
   7.143  
   7.144  
   7.145 -\item  Part of the time reduction is due to the proto-runtime providing a centralized location for services for all languages to use, so the language doesn't have to provide them separately.  Such services include debugging facilities, automated verification, concurrency handling, hardware performance information gathering, and so on  (\S\ ).
   7.146 -
   7.147 -\item Part of the time reduction is due to encapsulation of hardware aspects inside the hardware-oriented module (\S \ref{sec:intro}).
   7.148 -
   7.149 -\item Part of the time reduction is due to  reuse of the performance-tuning effort for runtime internals (\S ).  
   7.150 -
   7.151 -\item  Part of the time reduction is due to using sequential thinking when implementing the language logic, enabled by  the proto-runtime protecting shared internal runtime state and exporting an interface that presents a sequential model  (\S\ref{subsec:Example}). 
   7.152 +\item  Part of the time reduction is due to the proto-runtime providing common services for all languages to (re)use.  Such services include debugging facilities, automated verification, concurrency handling, dynamic performance measurements for use in assignment and auto-tuning, and so on  (\S\ ).
   7.153 +
   7.154 +\item Part  is due to hiding the low
   7.155 +level hardware aspects inside the proto-runtime module,
   7.156 +independent from language (\S \ref{sec:intro}).
   7.157 +
   7.158 +\item Part  is due to  reuse of the effort of performance-tuning  the runtime internals (\S ).  
   7.159 +
   7.160 +\item  Part is due to using sequential thinking when implementing the language logic, enabled by  the proto-runtime protecting shared internal runtime state and exporting an interface that presents a sequential model  (\S\ref{subsec:Example}). 
   7.161  
   7.162  
   7.163  \end{itemize}
   7.164  
   7.165 -\item
   7.166 -
   7.167 -The modularization also selectively exposes hardware aspects relevant to placement of work, giving the language  control over placement of work onto the hardware. If the language takes advantage of this, it  can result in reduced communication between cores and increased application performance  (\S\ ).
   7.168 -
   7.169 -\begin{itemize}
   7.170 -
   7.171 -\item Similar control over hardware is not possible when the language is   built on top of a package like Posix threads or TBB, which has its own work-to-hardware assignment   (\S\ref{sec:Related}).
   7.172 +\item Modularization with similar benefits does not appear possible when using a package such as Posix threads or TBB,  unless the package itself is modified and then used  according to the proto-runtime pattern  (\S\ref{sec:Related}).
   7.173 +
   7.174 +
   7.175 +\item The proto-runtime approach appears to future-proof language
   7.176 +runtime
   7.177 +construction,  because the patterns underlying proto-runtime appear to be fundamental (\S\ref{subsec:TiePoints},
   7.178 +\S\ref{subsec:Example}), and so  should hold for future  architectures. Plugins are reused on those, although performance related updates to the
   7.179 +plugins may be desired.
   7.180  
   7.181  \end{itemize}
   7.182  
   7.183 -\item Modularization with similar benefits does not appear possible when using a package such as Posix threads or TBB,  unless the package itself is modified and then used  according to the proto-runtime pattern  (\S\ref{sec:Related}).
   7.184 -
   7.185 -\end{itemize}
   7.186 -
   7.187 -The paper is organized as follows: We first expand on the value of embedded style DSLs (eDSL), and where the effort goes when creating one (\S\ref{subsec:eDSLEffort}). We focus on the role that  runtime implementation effort plays in the adoption of eDSLs, which motivates the value of the  savings provided by the proto-runtime approach. We then move on to the details of the proto-runtime approach (\S\ref{sec:Proposal}), and tie them to how a runtime is modularized (\S\ref{subsec:Modules}), covering how each claimed benefit is provided. 
   7.188 +The paper is organized as follows: We first expand on the value of embedded style parallel DSLs (epDSLs), and where the effort goes when creating one (\S\ref{subsec:eDSLEffort}). We focus on the role that  runtime implementation effort plays in the adoption of epDSLs, which motivates the value of the  savings provided by the proto-runtime approach. We then move on to the details of the proto-runtime approach (\S\ref{sec:Proposal}), and tie them to how a runtime is modularized (\S\ref{subsec:Modules}), covering how each claimed benefit is provided. 
   7.189  We then show overhead measurements (\S\ref{subsec:OverheadMeas}) and implementation time measurements (\S\ref{subsec:ImplTimeMeas} ), which indicate that the proto-runtime approach is performance competitive while significantly reducing implementation and porting effort.
   7.190  With that  understanding in hand, we then discuss  how the approach compares to related work (\S\ref{sec:Related}), and finally, we highlight the main conclusions drawn from the research (\S\ref{sec:Conclusion}).
   7.191  
   7.192 @@ -174,7 +209,7 @@
   7.193  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   7.194  %
   7.195  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   7.196 -\section{Background: The eDSL Hypothesis}
   7.197 +\section{Background: The epDSL Hypothesis}
   7.198  
   7.199  %[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
   7.200  
   7.201 @@ -186,10 +221,10 @@
   7.202  
   7.203  Domain Specific Languages have been around for a while [], and recently have been suggested as a good approach for parallel programming[][stanford PPL].
   7.204  
   7.205 -In essence, a DSL, or just Domain Language, captures patterns that are common in a particular domain of expertise, such as user interfaces, simulations of physical phenomena, bio-informatics, cosmology, and so on.  Each domain has a particular set of mental models, common types of computation, and common kinds of data structures. A  DSL captures these common elements in custom syntax.
   7.206 +In essence, a DSL, or just Domain Language, captures patterns that are common in a particular domain of expertise, such as user interfaces, simulations of physical systems, bio-informatics,  and so on.  Each domain has a particular set of mental models, common types of computation, and common kinds of data structures. A  DSL captures these common elements in custom syntax.
   7.207   
   7.208  
   7.209 -The custom syntax can capture parallelism information while simultaneously being natural to think about. In practice, multiple aspects of domains provide opportunities for parallelism. For example, the custom data structures seen by the coder can be internally implemented with distributed algorithms; common operations in the domain can be internally implemented with parallel algorithms; and, the domain constructs often imply dependencies. All of these are gained without the programmer being aware of this implied parallelism. 
   7.210 +The custom syntax can capture parallelism information while simultaneously being natural to think about. In practice, multiple aspects of domains provide opportunities for parallelism. For example, the custom data structures seen by the coder can be internally implemented with distributed algorithms; common operations in the domain can be internally implemented with parallel algorithms; and, the domain constructs often imply dependencies. All of these are gained without the programmer being aware of this implied parallelism; they just follow simple language usage rules. 
   7.211  
   7.212  
   7.213  
   7.214 @@ -198,62 +233,68 @@
   7.215  A style of domain language, which we feel has good adoption potential, is the so-called \textit{embedded} style of DSL (eDSL) [] [metaborg][stanford ppl]. In this variation, a program is written in a mix of a base sequential language plus domain language constructs. The syntax of the two is intermixed. A preprocessing step then translates the domain syntax into the base syntax, and includes calls to the domain language's runtime.
   7.216  
   7.217  
   7.218 -For example, use C (or Java) as the base language for an application, and mix-in custom syntax for constructs from a user-interface eDSL.  To test the code, the developer modifies the build process to first perform the translation step, then pass the resulting source through the normal C (or Java) compiler. The resulting executable contains calls to a dynamic (or shared) runtime library that becomes linked, at run time, to an implementation that has been tuned to the hardware it is running on.
   7.219 +For example, use C (or Java) as the base language for an application, then mix-in custom syntax  from a user-interface eDSL.  To test the code, the developer modifies the build process to first perform the translation step, then pass the resulting source through the normal  compiler. The resulting executable contains calls to a runtime library that becomes linked, at run time, to an implementation that has been tuned to the hardware.
   7.220  
   7.221  As with HWSim, the number of such embedded
   7.222  constructs tends to be low, easy to learn, and significantly
   7.223  reduce the complexity of the code written. All while
   7.224  implicitly specifying parallelism. 
   7.225  
   7.226 -Additionally, eDSLs have more than just a syntactic advantage over libraries.  The language has a toolchain that provides build-time optimization and can take advantage of relationships among distinct constructs within the code.  The relationship information allows derivation of communication patterns that inform the choice of placement of work, which is critical to performance on parallel hardware.
   7.227 +Additionally, parallel versions, or epDSLs have more than just a syntactic advantage over libraries.  The language has a toolchain that provides build-time optimization and can take advantage of relationships among distinct constructs within the code.  The relationship information allows derivation of communication patterns that inform the choice of placement of work, which is critical to performance on parallel hardware.
   7.228  \subsection{Low learning curve, high productivity, and portability}
   7.229 -eDSLs are generally quick to learn because the domain experts are already familiar with the concepts expressed by the custom syntax, and the number of constructs
   7.230 -tends to be low for an embedded DSL. This is especially valuable for  those who are \textit{not} expert programmers. Embedded style DSLs further reduce learning curve because they  require no new development tools nor development procedures. Together, these address the goal of  a low learning curve for switching to parallel software development.
   7.231 -
   7.232 -Productivity has been shown to be enhanced by a well designed DSL, with studies commonly measuring
   7.233 -10x reduction in development time [].  Factors
   7.234 -include simplifying the application code, modularizing it, and encapsulating  performance aspects inside the language.  Simplifying reduces the amount of code and the amount of mental effort. Modularizing separates concerns within the code and isolates aspects, which improves productivity. Encapsulating performance inside the DSL constructs removes them from the application programmer's concerns, which also improves productivity.
   7.235 + eDSLs tend to have low learning curve because domain experts are  already familiar with the concepts behind the language constructs, and there are relatively few constructs
   7.236 +for an embedded DSL. This is especially valuable for  those who are \textit{not} expert programmers. Embedded style DSLs further reduce learning curve because they  require no new development tools nor development procedures. Together, these address the goal of  a low learning curve for switching to parallel software development.
   7.237 +
   7.238 +Productivity has been shown to be enhanced by a well designed DSL, with studies  measuring
   7.239 +10x reduction in development time [][][].  Factors
   7.240 +behind this include simplifying the application code, modularizing it, and encapsulating  performance aspects inside the language.  Simplifying reduces the amount of code and the amount of mental effort. Modularizing separates concerns within the code and isolates aspects, which improves productivity. Encapsulating performance inside the DSL constructs removes them from the application programmer's concerns, which also improves productivity.
   7.241  
   7.242  Perhaps the most important productivity enhancement comes from hiding parallelism aspects inside the  DSL constructs. The language takes advantage of the domain patterns to present a familiar mental model, and then attaches synchronization, work-division, and communication implications to those constructs, without the programmer having to be aware of them.    Combining the simplicity, modularization, performance encapsulation, and parallelism hiding,  with congruence with the mental model of the domain,  together work towards the goal of high productivity.
   7.243   
   7.244 -Portability is aided by the encapsulation of performance aspects inside the DSL constructs. This means that the elements of the problem  that require large amounts of computation are often pulled into the language, which isolates the application code from hardware performance concerns.  Only the language implementation must adapt to new hardware in order to get high performance. Although such isolation cannot always be fully achieved, Domain Languages hold promise for making significant strides towards it.
   7.245 +Portability is aided by the encapsulation of performance aspects inside the DSL constructs. The aspects   that require large amounts of computation are often pulled into the language, so only the language implementation must adapt to new hardware. Although fully achieving such isolation isn't always possible, epDSLs hold promise for making significant strides towards it.
   7.246  
   7.247  \subsection{Low disruption and easy adoption} 
   7.248  
   7.249 -Using an eDSL tends to have low disruption because the base language remains the same, along with most of the development tools and practices.
   7.250 - Constructs from the eDSL can be mixed into existing sequential code, incrementally replacing the high computation sections, while continuing with the same development  practices.
   7.251 +Using an epDSL tends to have low disruption because the base language remains the same, along with most of the development tools and practices.
   7.252 + Constructs from the epDSL can be mixed into existing sequential code, incrementally replacing the high computation sections, while continuing with the same development  practices.
   7.253   
   7.254   \subsection{ Few users means the effort of eDSLs must be low} \label{subsec:eDSLEffort}
   7.255  
   7.256 -What appears to be holding eDSLs back from addressing the challenges of parallel programming would be mainly the time, expertise, and cost needed to develop an eDSL.  Because the number of users is small,  the economic model of the past doesn't apply.  For sequential languages, the potential user-base is in the millions, but for a parallel Domain Language, the user base may be only a few hundred developers who will use the language.
   7.257 -
   7.258 -As such, the effort to create a usable eDSL needs to be reduced to the point that it is viable for that size of user base.  
   7.259 -
   7.260 -The effort to be reduced falls into three categories:
   7.261 +What appears to be holding epDSLs back from widespread
   7.262 +adoption is mainly the time, expertise, and cost to develop an epDSL.  The effort to create a usable epDSL needs to be reduced to the point that it is viable for a user base of only a few hundred.  
   7.263 +
   7.264 +The effort  falls into three categories:
   7.265  
   7.266  \begin{enumerate}
   7.267 -\item effort to explore  language design and create the eDSL syntax
   7.268 -\item effort to create the runtime that produces the eDSL behavior
   7.269 -\item effort to performance tune the eDSL on particular hardware
   7.270 +\item effort to explore  language design and create the epDSL syntax
   7.271 +\item effort to create the runtime that produces the epDSL behavior
   7.272 +\item effort to performance tune the epDSL on particular hardware
   7.273  \end{itemize}    
   7.274  
   7.275  
   7.276 -\subsection{Critical areas of effort in the big picture}
   7.277 -
   7.278 -Across the industry as a whole, when eDSLs become successful, there will be hundreds of Domain Languages, and likewise hundreds of different hardware platforms that each language must run efficiently on.  That multiplicative effect must be reduced in order to make the eDSL approach economically viable.
   7.279 +\subsection{The big picture}
   7.280 +
   7.281 +Across the industry as a whole, when epDSLs become successful, there may be  thousands of epDSLs, that
   7.282 +each must  be mapped onto hundreds of different hardware platforms.  That multiplicative effect must be reduced in order to make the epDSL approach economically viable.
   7.283  
   7.284  The first category of eDSL effort is creating the front-end translation of custom syntax into the base language. This is a one-time effort that does not repeat when new hardware is added. 
   7.285  
   7.286 -The effort that has to be expended on each platform is the runtime implementation, which includes hardware-specific low-level tuning, and the tuning of the domain construct implementation.
   7.287 -
   7.288 -Luckily, hardware platforms cluster into groups with similar performance-related features. This opens the door to an approach that can present a common abstraction for all platforms in a cluster.  Examples of clusters include:
   7.289 +The effort that has to be expended on each platform is the runtime implementation and toolchain optimizations.
   7.290 +Runtime implementation includes hardware-specific low-level tuning and modification of mapping of work onto cores.
   7.291 +
   7.292 +This is where leveraging the proto-runtime approach
   7.293 +pays off. Hardware platforms cluster into groups with similar performance-related features.  Proto-runtime
   7.294 +presents a common abstraction for all hardware
   7.295 +platforms, but a portion of the interface supplies performance related
   7.296 +information specific to the hardware. This portion is  specialized for each
   7.297 +cluster. Examples of clusters include:
   7.298  
   7.299  \begin{itemize}
   7.300 -\item shared coherent memory multi-core single-chip machine
   7.301 -\item shared coherent memory multi-core multi-chip machine
   7.302 -\item independent address space coprocessor (GPU)
   7.303 -\item a network of nodes of the above categories
   7.304 -\item a machine with a hierarchy of sub-networks
   7.305 +\item single chip shared coherent memory
   7.306 +\item multi-chip shared coherent memory (NUMA)
   7.307 +\item coprocessor with independent address space (GPGPU)
   7.308 +\item a network among nodes of the above categories
   7.309 +(Distributed) \item a hierarchy of sub-networks
   7.310  \end{itemize}
   7.311  
   7.312  
   7.313 @@ -262,9 +303,12 @@
   7.314  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   7.315  \section{Our Proposal} \label{sec:Proposal}
   7.316  
   7.317 -We propose addressing the runtime effort by defining a modularization of runtimes, as seen in Fig X.  The low-level hardware details are collected into one module, which presents a common interface. The language supplies
   7.318 -the top two modules, which plug in via the interface. The hardware specific module presents the same interface
   7.319 -for all platforms sharing similar performance related features.  This module only has to be implement once for a given platform, then reused by  the languages.  
   7.320 +We propose addressing the runtime effort portion of creating
   7.321 +an epDSL by defining a modularization of runtimes, as seen in Fig. \ref{fig:PR_three_pieces}.  The low-level hardware details are collected into one module, which presents a common interface, called the \textit{proto-runtime
   7.322 +instance}. The language supplies
   7.323 +the top two modules, which plug in via the interface. The hardware specific module  (proto-runtime instance) presents the same interface
   7.324 +for all platforms, with a specialization for each category
   7.325 +of platform sharing similar performance related features.  The proto-runtime module only has to be implement once for a given platform, and is then reused by  all the languages.  
   7.326  
   7.327  \begin{figure}[ht]
   7.328    \centering
   7.329 @@ -274,16 +318,16 @@
   7.330  \end{figure}
   7.331  
   7.332  
   7.333 -Thus, a given language doesn't have to re-implement its runtime for every platform.  Instead, it has a much lower effort requirement, of implementing for each category.
   7.334 -
   7.335 -The language effort is further reduced because the language doesn't consider the low-level details of making the runtime itself run fast. It only has to consider the level of hardware feature that is exposed by the interface. This is a higher level of abstraction, which simplifies the task for the language implementer.
   7.336 -
   7.337 -One additional benefit is giving control to the language, to choose when and where it wishes work to execute.
   7.338 -This feature simplifies implementation of languages
   7.339 -that have features related to scheduling behavior.
   7.340 +Because of the modularization, a language has a much lower effort requirement, of implementing just for each category.
   7.341 +
   7.342 +The higher level of abstraction simplifies the task for the language implementer.
   7.343 +The language doesn't consider the low-level details of making the runtime itself run fast. It only has to consider the level of hardware feature that is exposed by the interface. 
   7.344 +
   7.345 +One additional benefit is that the assignment module
   7.346 +gives control to the language, to choose when and where it wishes work to execute.
   7.347 +This  simplifies implementation of language  features related to scheduling behavior.
   7.348  It also enables the language implementor to use sophisticated
   7.349 -methods for choosing placement of virtual processors
   7.350 -(threads) and tasks, which can significantly impact
   7.351 +methods for choosing placement of work, which can significantly impact
   7.352  application performance.  
   7.353  
   7.354  In this paper, we present work that applies to coherent
   7.355 @@ -292,10 +336,10 @@
   7.356  
   7.357  \subsection{Breakdown of the modules} \label{subsec:Modules}
   7.358  
   7.359 -The language is broken into two parts, as seen in Fig
   7.360 -X. One is a thin wrapper library that
   7.361 -invokes the runtime and the other is a set of modules that are part of the runtime.
   7.362 -
   7.363 +The language is broken into two parts, as seen in Fig.
   7.364 +\ref{fig:langBreakdown}. One is a thin wrapper library that
   7.365 +invokes the runtime and the other is a set of modules that are part of that invoked runtime. These are called
   7.366 +the \textit{language plugin} or just plugin. 
   7.367  
   7.368  
   7.369  \begin{figure}[ht]
   7.370 @@ -308,26 +352,19 @@
   7.371    \label{fig:langBreakdown}
   7.372  \end{figure}
   7.373    
   7.374 -The runtime itself consists of three modules connected via
   7.375 -an interface, as was seen back in Fig X. One encapsulates the low-level hardware details, and presents an interface to the language modules. We call
   7.376 -this the \textit{proto-runtime}.
   7.377 -It's job is to enforce the interface that the language modules see.
   7.378 -
   7.379 -
   7.380 -The language has two modules, both of which are collected in what we call the \textit{language plugin}.  One module encodes the behavior of language
   7.381 -constructs, the other module provides logic for choosing which work to execute on
   7.382 -which hardware resource.
   7.383 -
   7.384 -A non-changing application executable is able to invoke hardware specific plugin code, which changes between machines, because the plugin collects the two language modules into a dynamic library. The library is implemented, compiled,  distributed and installed separately from  applications.  The application executable contains only symbols of plugin functions, and during the run those are dynamically linked to machine-specific implementations.
   7.385 -
   7.386 -
   7.387 -In order to provide such modularization, we rely upon a model for specifying synchronization constructs that we call the tie-point model. The low-level nature of a tie-point places them below the level of  constructs such as a mutex. Instead, a mutex is specified in terms
   7.388 +
   7.389 +
   7.390 +Thus, a non-changing application executable is able to invoke hardware specific plugin code, which changes between machines. The plugin collects the two language modules into a dynamic library. The library is implemented, compiled,  distributed and installed separately from  applications.  The application executable contains only symbols of plugin functions, and during the run those are dynamically linked to machine-specific implementations.
   7.391 +
   7.392 +
   7.393 +In order to provide such modularization, we rely upon a model for specifying synchronization constructs that we call the tie-point model. The low-level nature of a tie-point places them below the level of  constructs,
   7.394 +even a simple  mutex. Instead, a mutex is specified in terms
   7.395  of the primitives in the tie-point model. In turn,
   7.396 -the proto-runtime
   7.397 - implements the primitives of the tie-point model.
   7.398 +the  tie-point primitives are implemented
   7.399 +by proto-runtime.
   7.400  
   7.401   This places all parallel constructs on the same level in the software stack, be they complex like the AND-OR parallelism of Prolog, or the wild-card matching
   7.402 -channels in coordination languages,  or ultra-simple acquire and release mutex constructs. All are implemented in terms of the same tie-point primitives provided by the proto-runtime.
   7.403 +channels in coordination languages,  or ultra-simple acquire and release mutex constructs. All are implemented in terms of the same tie-point primitives provided by the proto-runtime instance.
   7.404  
   7.405  We have reached a point in the paper, now, where the order of explanation can take one of two paths: either
   7.406  start with the abstract model of tie-points and explain how this affects the modularization of the runtime, or start with implementation details and work upwards towards the abstract model of tie-points.  We have chosen to start with the abstract tie-point model, but the reader is invited to skip to the section after it, which starts with code examples and ties code details to the abstract tie-point model.   
   7.407 @@ -338,14 +375,14 @@
   7.408  
   7.409  
   7.410  \subsection{timelines}
   7.411 -A tie-point relates timelines, so we talk a bit, first, about timelines. A timeline is the primitive in parallelism.  If you look at any parallel language, it involves a number of independent timelines. It then controls which timelines are actively progressing relative to the others.
   7.412 +A tie-point relates timelines, so we talk a bit, first, about timelines. A timeline is the common element in parallelism.  If you look at any parallel language, it involves a number of independent timelines. It then controls which timelines are actively progressing relative to the others.
   7.413  
   7.414  For example, take a thread library, which we consider
   7.415  a parallel language.  It provides a command to create a thread, where that thread represents an independent timeline. The library also provides the mutex acquire and release commands, which control which of those timelines advance relative to each other. When an acquire executes, it can cause the thread to block, which means the associated timeline suspends; it stops
   7.416  making forward progress. The release in a different thread clears the block, which resumes the timeline. That linkage between suspend and resume of different timelines is the control the language exerts over which timelines are actively progressing.
   7.417  
   7.418  To build up to tie-points, we look at the nature of points on
   7.419 -a single timeline, by reviewing mutex behavior in detail. We see the timeline shown in Fig \ref{fig:singleTimeline}.  Thread A, which is timeline A, tries to acquire the mutex, M,
   7.420 +a single timeline, by reviewing mutex behavior in detail. See the timeline shown in Fig \ref{fig:singleTimeline}.  Thread A, which is timeline A, tries to acquire the mutex, M,
   7.421  by executing the acquire command. Timeline A stops, at point 1.S, then something external to it happens, and the timeline starts again at point 1.R.  The gap between is not seen by the code executed within the thread.  Rather, from the code-execution viewpoint, the acquire command is a single command, and hence the gap between 1.S and 1.R collapses to a single point on the timeline.
   7.422  
   7.423  
   7.424 @@ -359,36 +396,74 @@
   7.425  \end{figure}
   7.426  
   7.427  
   7.428 -Now, a tie-point is seen as the linkage between such collapsed points on
   7.429 -two timelines. In Fig \ref{fig:dualTimeline}, timeline A is still there, suspends still at 1.S and resumes at 1.R.  However, now there is a second timeline, timeline B.  It executes the release command at point 2.S, which suspends timeline B, performs the behavior of the release command
   7.430 -inside the gap, then resumes timeline B at 2.R. The behavior of the release
   7.431 -command causes the end of suspend in the first timeline.  That causality ties the two collapsed points in the two timelines together.
   7.432 -
   7.433 + Fig. \ref{fig:dualTimeline}  shows  two timelines: timeline A executing acquire and timeline B executing release. The release still suspends its timeline, but
   7.434 +it quickly resumes again because it is not blocked.
   7.435 +The release causes timeline A to also resume. The fact
   7.436 +of the release on one timeline has caused the end of the acquire on the other. This makes
   7.437 +the two collapsed points become what we term \textit{tied together} into a \textit{tie-point}.
   7.438  
   7.439  \begin{figure}[ht]
   7.440    \centering
   7.441 -  \includegraphics[width = 2.8in, height = 1.35in]
   7.442 +  \includegraphics[width = 2.8in, height = 1.2in]
   7.443    {../figures/PR__timeline_dual.pdf}
   7.444 -  \caption{Two  timelines with a causal relationship.
   7.445 -Activity that takes place during the gap in timeline
   7.446 -B causes resume of timeline A. This ties point 2 on
   7.447 -timeline B to point 1 on timeline A.}
   7.448 +  \caption{Two  timelines with tied together ``collapsed''
   7.449 +points.
   7.450 +Point 1 on timeline A forms a tie-point with point
   7.451 +2 on timeline B.
   7.452 +It is hidden activity that takes place inside the gaps that
   7.453 +establishes a causal relationship that ties them together.}
   7.454    \label{fig:dualTimeline}
   7.455  \end{figure}
   7.456  
   7.457 -
   7.458 -
   7.459 -We call this connection between the collapsed suspensions a tie-point.  What it provides is a guarantee about visibility of events between the tied timelines. The
   7.460 -guarantee makes both agree on the order of events,\textit{
   7.461 -relative to the mutual tied point}. 
   7.462 -The guarantees  are what defines a tie-point. 
   7.463 -
   7.464 -Fig \ref{fig:tie-pointGuarantees} shows the ordering guarantees in terms of visibility of operations between
   7.465 -the timelines.  If these visibility constraints are
   7.466 -satisfied, then the timelines share a tie-point. Note that the ordering
   7.467 - guarantees are equivalent to the constraints on visibility of operations. Operations that execute  in
   7.468 -the first timeline before the tie-point must be visible
   7.469 -in the second after the tie point, and vice versa. Likewise, operations that execute in one timeline after the tie-point must not be visible in the other timeline before the tie-point. 
   7.470 +Fig. \ref{fig:dualTimelineWHidden} adds detail about
   7.471 +how the release goes about causing the end of the block
   7.472 +on the acquire. It reveals
   7.473 +a hidden timeline, which is what performs the behavior of the
   7.474 +acquire and release constructs.  As seen, acquire starts
   7.475 +with a suspend, which is accompanied by a communication
   7.476 +sent to the hidden timeline.  The hidden timeline then
   7.477 +checks whether the mutex is free, sees that it isn't
   7.478 +and leaves  timeline A suspended. Later, timeline
   7.479 +B performs release, which suspends it and sends a communication
   7.480 +to the same hidden timeline. That then sees that timeline
   7.481 +A is waiting for the release and performs a special
   7.482 +control action that resumes timeline A, followed by
   7.483 +doing the control action again to resume timeline B.
   7.484 + It is inside the hidden timeline that the acquire
   7.485 +gets linked to the release, tying the constructs together.   
   7.486 +
   7.487 +
   7.488 +\begin{figure}[ht]
   7.489 +  \centering
   7.490 +  \includegraphics[width = 2.8in, height = 1.9in]
   7.491 +  {../figures/PR__timeline_dual_w_hidden.pdf}
   7.492 +  \caption{Two  timelines with tied together ``collapsed''
   7.493 +points  showing the detail of a hidden timeline that
   7.494 +performs the behavior that ties the points together.
   7.495 +Vertical dashed lines represent communication sent
   7.496 +as part of the suspend action, and the curvy arrows
   7.497 +represent special control that causes resume of the
   7.498 +target timelines. During the gaps in timelines A and
   7.499 +B, activity takes place in the hidden timeline, which
   7.500 +calculates that the timelines should be resumed, then
   7.501 +exercises control to make resume happen.}
   7.502 +  \label{fig:dualTimelineWHidden}
   7.503 +\end{figure}
   7.504 +
   7.505 +
   7.506 +
   7.507 +We show in \S\ref{sec:FormalTiePoint} that the pattern
   7.508 +of communications to and from the hidden timeline establishes
   7.509 +an ordering relationship between events before and
   7.510 +after the tied points. That implies a relation on
   7.511 +the visibility of events. 
   7.512 +
   7.513 +Fig \ref{fig:tie-pointGuarantees} shows the ordering relationship and the implied visibility of operations between
   7.514 +the timelines. Operations that execute  in
   7.515 +the first timeline before the tie-point are visible
   7.516 +in the second after the tie point, and vice versa. Likewise, operations that execute in one timeline after the tie-point are not  visible in the other timeline before the tie-point. Such an ordering satisfies
   7.517 +the requirements
   7.518 +of a synchronization construct. 
   7.519  
   7.520  
   7.521  
   7.522 @@ -397,31 +472,31 @@
   7.523    \includegraphics[width = 2.8in, height = 1.25in]
   7.524    {../figures/PR__timeline_tie_point_ordering.pdf}
   7.525    \caption{The
   7.526 -guarantees that a tie-point enforces. Shows which
   7.527 - operations performed on one timeline are visible to the other
   7.528 -timeline. These visibilities must be true for a tie-point.
   7.529 -Note that all events are divided into two groups, those
   7.530 -before the tied points versus those after the tied
   7.531 -points.  Both timelines see the same before group and
   7.532 -the same after group. }
   7.533 +visibility guarantees that result from a tie-point. Shows which
   7.534 + operations, such as writes,  performed on one timeline can be seen by the other
   7.535 +timeline. These visibilities are equivalent to establishing
   7.536 +an order between events before the tied points versus those after the tied
   7.537 +points.  Both timelines agree on what events are before
   7.538 +versus after the tied point.  }
   7.539    \label{fig:tie-pointGuarantees}
   7.540  \end{figure}
   7.541  
   7.542  
   7.543 -\subsection{Formal definition of tie-point}
   7.544 +\subsection{Formal definition of tie-point} \label{sec:FormalTiePoint}
   7.545  In a moment we will show how any and all synchronization constructs
   7.546  can be defined in terms of tie-points. Before getting
   7.547 -there, we provide a formal definition of tie-point,
   7.548 -which we will then use to show that a tie point
   7.549 -can satisfy the conditions of any synchronization
   7.550 +there, we must choose an, unavoidably arguable, definition of synchronization
   7.551 +construct. We then provide a formal definition of tie-point
   7.552 +and use it to show that a tie point
   7.553 +satisfies the conditions of any
   7.554 +such synchronization
   7.555  construct.
   7.556 -
   7.557 + 
   7.558  Our formalism defines timelines, communication between
   7.559  timelines, and suspend and resume of a timeline. It then shows a particular pattern, which is the characteristic pattern that defines a tie-point. We then show that when that characteristic pattern exists, then relations exist between timelines that have certain properties.
   7.560  We conclude by showing a few classical definitions
   7.561  of synchronization and show that those definitions
   7.562 -are upheld when a relation with the derived properties
   7.563 -exists among the timelines. Hence, those classical definitions can be satisfied via creation of a tie-point. 
   7.564 +are upheld when  the tie-point pattern is present. Hence, those classical definitions can be satisfied via creation of a tie-point. 
   7.565  
   7.566  \subsubsection{}
   7.567  
   7.568 @@ -438,7 +513,7 @@
   7.569  from any timeline that code executes in).  
   7.570  
   7.571  \item[event:] 
   7.572 -\(E =\{c_{0,t},c_{1,t}, ..\} \cup \{s_{\alpha ,t}\} \cup \{r_{\beta , t}\}
   7.573 +\(E =\{c_{0,t},c_{1,t}, ..\} \cup \{s_{n,\alpha ,t}\} \cup \{r_{n,\beta , t}\}
   7.574  \cup \{z_{\gamma ,t} \} \). There are four kinds of event
   7.575  that can happen on a timeline, namely $c$, a step of computation,
   7.576  which modifies the memory local to the timeline; $s$, a
   7.577 @@ -452,23 +527,26 @@
   7.578  $z\_s_{\gamma ,t}$ while resume is denoted $z\_r_{\gamma
   7.579  ,t}$ where $s$
   7.580  and $r$ are literal while $\gamma$ denotes the position
   7.581 -on the timeline and $t$ is the timeline the suspend
   7.582 -happens on. 
   7.583 +on the timeline and $t$ is the timeline that executes
   7.584 +the synchronization construct. 
   7.585  \item[communication:]
   7.586  \(C = \{s,r\}, s < r\).  A communication is a set of
   7.587  one send event from one timeline plus one or more receive events
   7.588  from different timelines, with the send
   7.589 -event ordered before the receive event(s), denoted $s_{n,t}\mapsto
   7.590 -r_{n,t}$ where $n$ distinguishes the communication
   7.591 -set and $t$ denotes the timeline the event is on.  A communication
   7.592 +event ordered before the receive event(s), denoted $s_{n,\alpha, t}\mapsto
   7.593 +r_{n,\beta,t}$ where $n$ distinguishes the communication
   7.594 +set, $\alpha$ and $\beta$ are the ordering upon the
   7.595 +timeline and $t$ denotes the timeline the event is on.  A communication
   7.596  orders events on one timeline relative to events on another.
   7.597 -However, the ordering is only between two points, in
   7.598 +However, the ordering is only between two points. In
   7.599  particular for two sends from timeline 1 to timeline
   7.600 -2, if \(s_{1,1} < s_{2.1}\) on timeline 1, then on
   7.601 -timeline 2, both \(r_{1,2} < r_{2,2}\) and \(r_{2,2} < r_{1,2}\) are valid. However, $s_{1,1} \mapsto r_{1,2}$
   7.602 -followed by $s_{2,2} \mapsto r_{2,1}$ where $r_{1,2}
   7.603 -< s_{2,2}$
   7.604 -  implies that $s_{1,1} < r_{2,1}$ always.  
   7.605 +2, if \(s_{1,\_,1} < s_{2,\_,1}\) on timeline 1, then on
   7.606 +timeline 2, both \(r_{1,\_,2} < r_{2,\_,2}\) and \(r_{2,\_,2} < r_{1,\_,2}\) are valid, where ``$\_$'' in the position
   7.607 +of the ordering integer represents a wild
   7.608 +card. However, $s_{1,\_,1} \mapsto r_{1,\_,2}$
   7.609 +followed by $s_{2,\_,2} \mapsto r_{2,\_,1}$ where $r_{1,\_,2}
   7.610 +< s_{2,\_,2}$
   7.611 +  implies that $s_{1,\_,1} < r_{2,\_,1}$ always.  
   7.612  
   7.613  \item[hidden timeline:] We define a special kind of  "hidden" timeline that is not
   7.614  seen by application code. It has an additional
   7.615 @@ -480,12 +558,13 @@
   7.616  event is on. Additionally, a suspend event on an application
   7.617  visible timeline implies a send from that timeline
   7.618  to a hidden timeline. Hence $z\_s_{\gamma,t} \Rightarrow
   7.619 -s_{\gamma,h}$  
   7.620 +s_{n,\gamma,t} \mapsto r_{n,\_,h}$  
   7.621  
   7.622  \item[tie-point:] Now, we define a tie-point as a set of two or more
   7.623  synchronization points from different timelines which
   7.624  are related by a particular pattern of communications.
   7.625 -As a result of the pattern, the set satisfies particular criteria. The pattern is that communications from the suspend synchronization events must converge on a common hidden timeline and that timeline must then emit a subsequent resume event for each of the suspended timelines. 
   7.626 +As a result of the pattern, the set satisfies particular criteria. The pattern is that communications from the suspend synchronization events must converge on a common hidden timeline and that timeline must then emit a subsequent resume event for each of the suspended timelines,
   7.627 +as shown back in Fig. \ref{fig:dualTimelineWHidden}. 
   7.628  
   7.629  \end{description}
   7.630  
   7.631 @@ -518,7 +597,7 @@
   7.632  to be a synchronization construct.  It is only in the
   7.633  hardware that a synchronization construct is assembled
   7.634  from pieces.  We further claim that the hardware implements
   7.635 -according to the pattern described in our formal definition.
   7.636 +according to the tie-point pattern described in our formal definition.
   7.637  
   7.638  What we consider to be a tie-point is any point that
   7.639  has this pattern, independent of the semantics added.
   7.640 @@ -579,6 +658,56 @@
   7.641  sync constructs..  but they can't be used in a distributed
   7.642  memory system to make distributed memory things.
   7.643  
   7.644 +Unless use communication to implement shared memory
   7.645 +on top of distributed memory.. things like that.. It's
   7.646 +a question of what's fair game in the comparison --
   7.647 +proto-runtime the behavior is in the hidden timeline,
   7.648 +which is "inside" the construct, in a sense..  but using sync constructs to implement others, you lose
   7.649 +that "inside" notion..  it just becomes application
   7.650 +code that uses sync constructs..  with the app code
   7.651 +running in an application timeline..  so..  need to
   7.652 +get at that notion of animator, which has the "hidden"
   7.653 +timeline, versus function call.. 
   7.654 +
   7.655 +What about this.. it's a matter of constructing from
   7.656 +equally powerful versus from less powerful.. mmmm want
   7.657 +that notion of animator in there..  and want to get
   7.658 +at when an arrangement qualifies as having "switched
   7.659 +over to the animator" -- does implementing mutex from
   7.660 +just memory ops qualify as switching over to the animator
   7.661 +just by entering the code that implements the mutex?
   7.662 +Say, place that code in-line in the application code
   7.663 +everywhere it's used..
   7.664 +
   7.665 +Hmmmm.. could use the relation model to show that the
   7.666 +pure memory based implementation contains a tie-point,
   7.667 +which is how the more-primitive operations are able
   7.668 +to construct the more powerful mutex.     That might
   7.669 +be a more fruitful, easier to gain acceptance, approach..
   7.670 +show that things that have no time-related semantics,
   7.671 +only simple one-way communication, are able to construct
   7.672 +the time-related semantics.. and it is the presence
   7.673 +of the tie-point convergence pattern that does it.
   7.674 +
   7.675 +In fact, might take the Dijkstra original mutex from
   7.676 +must memory implementation and show the tie-point pattern
   7.677 +within it..  then also show the tie-point pattern within lock-free implementations..  the point being that all
   7.678 +you have to show is the presence of the tie-point pattern,
   7.679 +in order to prove synchronization properties..  where
   7.680 +"synchronization properties" is the existence of the ordering relation.. which is equivalent to agreement of before vs after.. which is equivalent to the visibility
   7.681 +relation, which is what a programmer cares about..
   7.682 +the visibility is what a programmer requires in a "mutual
   7.683 +exclusion".  
   7.684 +
   7.685 +This visibility guarantees is how it can be guaranteed that
   7.686 +those that are still "before" the mutex cannot influence
   7.687 +the one "after" the mutex, which is inside the critical section.  And also require vice versa,
   7.688 +that the one "after" the mutex, inside the critical
   7.689 +section, cannot take actions
   7.690 +that influence any "before" it..  similarly at the
   7.691 +end of the critical section, need the same isolation.
   7.692 +  
   7.693 +
   7.694  Let's see..  the relation model said that something
   7.695  with synchronization constraints can be created from
   7.696  just communication plus hidden timeline..  as long
   7.697 @@ -624,17 +753,18 @@
   7.698  
   7.699  The other part of the story is: the proto-runtime cannot
   7.700  be used by itself.  It requires addition before it
   7.701 -can be used.  That is, have to add the M->M, to arrive
   7.702 -at the TxM->M, then can use the TxM->M..  but can't
   7.703 -use just the Tx by itself -- that's non-sensical. 
   7.704 -So, provides a (M->M, f) that is used to get the TxM->M,
   7.705 -but can't use the f inside an application.. it doesn't
   7.706 +can be used.  That is, have to add the $M\mapsto M$, to arrive
   7.707 +at the $T\times M\mapsto M$, then can use the $T\times
   7.708 +M\mapsto M$..  but can't
   7.709 +use just the $T\times$ by itself -- that's non-sensical. 
   7.710 +So, provides a $(M\mapsto M, f)$ that is used to get the $T\times M\mapsto M$,
   7.711 +but can't use the $f$ inside an application.. it doesn't
   7.712  do anything other than add the Tx..  so it doesn't
   7.713  accomplish any steps of computation, nor does it provide
   7.714 -Tx to any application code..  the (M->M, f) is outside
   7.715 +$T\times$ to any application code..  the $(M\mapsto M, f)$ is outside
   7.716  of any language -- that's what CREATES a language.
   7.717  
   7.718 -*****Can't define (M->M, f) as part of its own language,
   7.719 +*****Can't define $(M\mapsto M, f)$ as part of its own language,
   7.720  because it doesn't do anything.  No computation is
   7.721  performed by it. ****  (so, what's the definition of
   7.722  computation, then?)
   7.723 @@ -680,7 +810,7 @@
   7.724   A sync construct is a full tie-point.  
   7.725  
   7.726  
   7.727 -========================================================
   7.728 +================================================
   7.729  
   7.730  
   7.731  \subsubsection{Lifeline, Timeline, and Projection}
   7.732 @@ -1179,7 +1309,7 @@
   7.733  Handlers.   Gaps in the timelines are caused by suspension,
   7.734  which is effected by primitives within the proto-runtime
   7.735  code module.}
   7.736 -  \label{fig:langBreakdown}
   7.737 +  \label{fig:physTimeSeq}
   7.738  \end{figure*}
   7.739  
   7.740  
   7.741 @@ -1545,7 +1675,7 @@
   7.742  
   7.743  
   7.744  \subsubsection{Vthread Versus Highly Tuned Posix Threads}
   7.745 -
   7.746 +\label{sec:VthreadVsPthread}
   7.747  Measurements indicate that the proto-runtime approach has far lower overhead than even the current highly tuned Linux thread implementation, and discusses why equivalent user-level M to N thread packages haven't been pursued, leaving no viable user-level libraries to compare against.  
   7.748  \subsubsection{VCilk Versus Cilk 5.4}
   7.749  In \S we give numbers that indicate that the proto-runtime approach is also competitive with Cilk
   7.750 @@ -1559,22 +1689,23 @@
   7.751  %%
   7.752  %%%%%%%%%%%%%%%%%%%%%%%%
   7.753  \subsection{Development Time Measurements}\label{subsec:ImplTimeMeas}
   7.754 -Here we summarize the time to develop each of the eDSLs and copy-cat languages created so far. As a control, we estimate how long the equivalent functionality required, using the traditional approach, based on anecdotal evidence.
   7.755 -
   7.756 -Summarized in Table \ref{tabPersonHoursLang}, we measured the time we spent to design, code, and get an initial version working for each of the languages we created.  The results are shown in the same order we created them, with SSR the first. As we gained experience,  design and coding became more efficient. Not shown is the 7 hours required to take the send-receive code from SSR and adapt it to work with tasks in VSs.  In addition, 11 hours was spent importing the DKU constructs into VSs.  These are hours spent at the keyboard or with pen and paper, and don't include think time during other activities in the day.
   7.757 +Here we summarize the time to develop each of the epDSLs and each copy-cat language created so far. As a control, we estimate, based on anecdotal evidence, the time required to create the equivalent functionality, using the traditional approach.
   7.758 +
   7.759 +Table \ref{tabPersonHoursLang}, summarizes measurements
   7.760 +of the time we spent to design, code, and debug an initial version working for each of the languages we created.  The results are shown in the same order we created them, with SSR the first. As we gained experience,  design and coding became more efficient.   These are hours spent at the keyboard or with pen and paper, and don't include think time during other activities in the day.
   7.761   
   7.762  
   7.763  \begin{centering}
   7.764 -\begin{tabular}{|l|r|r|r|r|r|r|}
   7.765 -  \cline{2-7}
   7.766 -  \multicolumn{1}{r|}{} & SSR & Vthread & VCilk & HWSim & VOMP & VSs\\
   7.767 -  \cline{2-7}
   7.768 +\begin{tabular}{|l|r|r|r|r|r|r|r|}
   7.769 +  \cline{2-8}
   7.770 +  \multicolumn{1}{r|}{} & SSR & Vthread & VCilk & HWSim & VOMP & VSs & Reo\\
   7.771 +  \cline{2-8}
   7.772    \noalign{\vskip2pt}
   7.773    \hline
   7.774 -  Design & 19 & 6 & 3 & 52 & 18& 6\\
   7.775 -  Code & 13 & 3 & 3& 32 & 9& 12\\
   7.776 -  Test & 7 & 2 & 2& 12 & 8& 5\\
   7.777 -  L.O.C. & 470 & 290 & 310& 3000 & 690 & 780\\
   7.778 +  Design & 19 & 6 & 3 & 52 & 18& 6 & 14\\
   7.779 +  Code & 13 & 3 & 3& 32 & 9& 12 & 18\\
   7.780 +  Test & 7 & 2 & 2& 12 & 8& 5 & 10\\
   7.781 +  L.O.C. & 470 & 290 & 310& 3000 & 690 & 780 & 920\\
   7.782    \hline
   7.783  \end{tabular}
   7.784  \caption
   7.785 @@ -1583,8 +1714,8 @@
   7.786  \end{centering}
   7.787  \label{tabPersonHoursLang}
   7.788  
   7.789 -\subsubsection{Comparison of Design Approaches}
   7.790 -We give the bigger picture of the difference in design methods between traditional approaches and the proto-runtime implementations, discussing OpenMP versus VOMP, Cilk 5.4 vs VCilk, pthread vs Vthread, and OMPSs vs VSs.  These discussions attempt to give the two design philosophies and paint a picture of the development process in the two competing approaches.  The goal is to illustrate how the proto-runtime approach maintains many of the language features, through its centralized services, while significantly reducing implementation time, through reuse of the services, elimination of concurrency concerns in design and debugging, and in the simplifications in design and implementation caused by the clean modularization of the proto-runtime approach, and the regularization of implementation from one language to another.
   7.791 +%\subsubsection{Comparison of Design Approaches}
   7.792 +%We give the bigger picture of the difference in  approach for each language, between the proto-runtime implementation and the distributed implementation.  The goal is to illustrate how the proto-runtime  centralized services, while significantly reducing implementation time, through reuse of the services, elimination of concurrency concerns in design and debugging, and in the simplifications in design and implementation caused by the clean modularization of the proto-runtime approach, and the regularization of implementation from one language to another.
   7.793  
   7.794  
   7.795  %%%%%%%%%%%%%%%%%%%%%%%%
   7.796 @@ -1592,37 +1723,43 @@
   7.797  %%%%%%%%%%%%%%%%%%%%%%%%
   7.798  \section{Related Work} \label{sec:Related}
   7.799  
   7.800 -With the full understanding of the proto-runtime approach in hand, we discuss  how it compares to other approaches for implementing the runtimes of domain specific languages.  The criteria are: level of effort to implement the runtime, effort to port the runtime, runtime performance, and support for application performance. The main alternative implementation approaches are: posix threads, user-level threads, TBB, modifying libGomp, and using hardware primitives to make a custom runtime.
   7.801 -
   7.802 -We first talk about each of these approaches, then summarize the conclusions in Table \ref{tab:CriteriaVsApproach}.
   7.803 -
   7.804 -The first three methods involve building the DSL runtime on top of OS threads, user threads, or TBB, all of which are languages in their own right. So the DSL runtime runs on top of the runtime for that lower-level language.  This places control of work placement inside the lower-level runtime, blocking the DSL runtime, which hurts application-code performance, due to inability to use data locality. In addition, OS threads have operating system overhead and OS-imposed fairness requirements, which keeps runtime performance poor.
   7.805 -
   7.806 -All three also force the DSL implementation to manage concurrency explicitly, using language primitives such as locks.  TBB may have a slight advantage due to its task-scheduling commands, but only for task-based languages. Hence, implementation effort is poor for these approaches.  
   7.807 -
   7.808 -For the same reason, for these three, the runtime code needs to be rewritten and tuned for each hardware platform for each language, or else some form of hardware-abstraction placed into the runtime.  But putting in a hardware abstraction is essentially an alternative way of implementing half of the proto-runtime approach, but without the centralization, reuse, and modularization benefits.
   7.809 -
   7.810 -Many language researchers use libGomp (based on informal discussions) because of its very simple structure, which makes it relatively easy to modify, especially for simple languages. However, it provides no services such as debugging or performance tuning, and it has no modularization or reuse across languages benefits.  As the price of the simplicity, performance suffers, as seen in the experiments [].  Also, re-writes of the DSL runtime are required for each platform in order to tune it to hardware characteristics. However, because the runtime is directly modified, the language gains control over placement of work, enabling good application performance.
   7.811 -
   7.812 -Lastly, we consider the alternative of writing a custom runtime from scratch, using hardware primitives such as the Compare And Swap (CAS) instruction, or similar atomic read-modify-write instructions.  This approach requires the highest degree of implementation effort, and the worst portability across hardware.  However, if sufficient effort is expended on tuning, it can achieve the best runtime performance and equal the best performance of application code. So far, the gap has proven small between highly tuned language-specific custom runtime performance and that of our proto-runtime, but we only have the CILK implementation as a comparison point. 
   7.813 - 
   7.814 -Putting this all together, Table \ref{tab:CriteriaVsApproach} shows that the proto-runtime approach is the only one that scores high in all the mesures. It makes initial language implementation fast, as well as reduces porting effort, while keeping runtime performance high and enabling high application performance. 
   7.815 +We discuss  how proto-runtime compares to other approaches to implementing the runtimes of domain specific languages.  The criteria for comparison are: level of effort to implement the runtime, effort to port the runtime, runtime performance, and support for application performance. The main alternative implementation approaches are: posix threads, user-level threads, TBB, modifying libGomp, and using hardware primitives to make a custom runtime.
   7.816 +
   7.817 +We  summarize the conclusions in Table \ref{tab:CriteriaVsApproach}.
   7.818 +
   7.819  
   7.820  \begin{center}
   7.821 +\caption{Table \ref{tab:CriteriaVsApproach} shows how well each approach scores in the measures important to implementors of  runtimes for  DSLs. On the left are the implementation approaches. At the top are the measures. In a cell is the score on the measure for
   7.822 +the approach. One plus is the lowest score, indicating the implementation approach is undesirable, 5 indicates the highest desirability.  The reasons for the scores are discussed in the text. } \label{tab:CriteriaVsApproach}
   7.823 +
   7.824  \begin{tabular}{|c|c|c|c|c|}\hline
   7.825  Runtime Creation  & \textbf{impl.}& \textbf{porting} & \textbf{runtime} & \textbf{application} \\
   7.826  \textbf{} & \textbf{ease} & \textbf{ease} & \textbf{perf.} & \textbf{perf.}\\\hline
   7.827  \textbf{OS Threads} & ++ & ++ & + & + \\\hline
   7.828 -\textbf{User Threads} & ++& ++ & ++ & + \\\hline
   7.829 +%\textbf{User Threads} & ++& ++ & ++ & + \\\hline
   7.830  \textbf{TBB} & ++ & ++ & ++ & + \\\hline
   7.831  \textbf{libGomp} & +++ & ++ & +++ & ++++ \\\hline
   7.832  \textbf{HW primitives} & + & + & +++++ & +++++ \\\hline
   7.833  \textbf{Proto-runtime} & +++++ & +++++ & ++++ & +++++\\\hline
   7.834  \end{tabular}
   7.835  \end{center}
   7.836 -\caption{The table shows how well each approach scores in each measure important to the implementor of a runtime for a DSL. On the left are the approaches that can be used to write the runtime. At the top are the measures an implementor may care about. For all measures, one plus is the lowest score, indicating the implementation approach is undesirable, 5 indicates the highest desirability.  The scores are based on reasons  discussed in the text. }
   7.837 -\label{tab:CriteriaVsApproach}
   7.838 -
   7.839 +
   7.840 +
   7.841 +
   7.842 +The first two methods have poor runtime and application
   7.843 +performance. They involve building the DSL runtime on top of OS threads\ or TBB, both of which have runtimes in their own right. So the DSL runtime runs on top of the lower-level runtime.  This places control of work placement inside the lower-level runtime, blocking the DSL runtime, which hurts application-code performance, due to inability to use data locality. In addition, OS threads have operating system overhead and OS-imposed fairness requirements, which keeps runtime performance poor as seen in Section \ref{sec:VthreadVsPthread}.
   7.844 +
   7.845 +Both also force the DSL implementation to manage concurrency explicitly, using lower-level runtime constructs such as locks.  TBB may have a slight advantage due to its task-scheduling commands, but only for task-based languages. Hence, implementation effort is poor for these approaches.  
   7.846 +
   7.847 +For the same reason, porting is poor for these two
   7.848 +approaches. The DSL's runtime code needs to be rewritten and tuned for each hardware platform, or else some form of hardware-abstraction placed into the runtime.  But putting in a hardware abstraction is essentially an alternative way of implementing half of the proto-runtime approach, but without the centralization, reuse, and modularization benefits.
   7.849 +
   7.850 +Moving on to libGomp. Some  language researchers use libGomp (based on informal discussions) because of its very simple structure, which makes it relatively easy to modify, especially for simple languages. However, it provides no services such as debugging or performance tuning, and it has no modularization or reuse across languages benefits.  As the price of the simplicity, performance suffers, as seen in the experiments [].  Also, re-writes of the DSL runtime are required for each platform in order to tune it to hardware characteristics. However, because the runtime is directly modified, the language gains control over placement of work, enabling good application performance, if the extra
   7.851 +effort is expended to take advantage.
   7.852 +
   7.853 +Lastly, we consider the alternative of writing a custom runtime from scratch, using hardware primitives such as the Compare And Swap (CAS) instruction, or similar atomic read-modify-write instructions.  This approach requires the highest degree of implementation effort, and the worst portability across hardware.  However, if sufficient effort is expended on tuning, it can achieve the best runtime performance and equal the best performance of application code. So far, the gap has proven small between highly tuned language-specific custom runtime performance and that of our proto-runtime, but we only have the CILK implementation as a comparison point. 
   7.854 + 
   7.855 +Putting this all together, Table \ref{tab:CriteriaVsApproach} shows that the proto-runtime approach is the only one that scores high in all of the measures. It makes initial language implementation fast, as well as reduces porting effort, while keeping runtime performance high and enabling high application performance. 
   7.856  
   7.857  
   7.858  
   7.859 @@ -1656,6 +1793,7 @@
   7.860  \end{itemize}
   7.861  
   7.862  
   7.863 +\end{document}
   7.864  =============================================
   7.865  ==
   7.866  ==
     8.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     8.2 +++ b/0__Papers/PRT/PRT__intro_plus_eco_contrast/helpers/07_F_26__The_Questions__blank.txt	Tue Sep 17 06:30:06 2013 -0700
     8.3 @@ -0,0 +1,100 @@
     8.4 +
     8.5 +
     8.6 +1) What are the problems the authors are trying to solve? 
     8.7 +    When done, for each problem, how does one decide the value of a proposed solution?  Suggest a priority domain for deciding whether to use a proposed solution.
     8.8 +
     8.9 +The problem is 
    8.10 +
    8.11 +A priority domain for deciding the value of some proposed solution to this problem is
    8.12 +
    8.13 +The value of this solution is determined by
    8.14 +
    8.15 +
    8.16 +
    8.17 +2) What "things" does the proposed solution to this problem enable?
    8.18 +     What benefit to reader is bought by each "thing", & what related to the "thing", gives the benefit.
    8.19 +     What details are unique about the proposed solution that enables the thing that gives benefit?
    8.20 +     How does that uniqueness enable or achieve the thing?
    8.21 +
    8.22 +It enables
    8.23 +
    8.24 +The benefit to me is
    8.25 +
    8.26 +Unique details of solution that enable the thing gives benefit are 
    8.27 +
    8.28 +The uniqueness enables the thing that gives benefit by
    8.29 +
    8.30 +
    8.31 +
    8.32 +3) What are the fundamentals underlying the problem?  
    8.33 +     What makes this problem hard? 
    8.34 +     What are the basic elements and forces of the problem that the proposed solution has to be in terms of, avoid, use to advantage? ie: gravity, invariant relationships, market forces, human capacity (avg level of real programmers, hubris, legacy is held onto, barriers to adoption), and so on
    8.35 +How does the proposed solution work within/relate to/address/take advantage of/deal with the fundamentals underlying the problem?
    8.36 +
    8.37 +The fundamentals are
    8.38 +
    8.39 +The hard part is
    8.40 +
    8.41 +The basic elements are
    8.42 +
    8.43 +The proposed solution
    8.44 +
    8.45 +
    8.46 +
    8.47 +4) What are other approaches and conventional wisdom to solving these problems?
    8.48 +    What benefits enabled by the proposed solution are not enabled by other work, and vice versa?
    8.49 +    How does each approach address something the others miss?
    8.50 +    Try to suggest groupings or categories for the various approaches.  
    8.51 +    Try to suggest ways multiple approaches may be combined to get more pros with fewer cons.
    8.52 +
    8.53 +Other approaches are
    8.54 +
    8.55 +A benefit enabled by the proposed that is not enabled by other work is
    8.56 +
    8.57 +Categories:
    8.58 +
    8.59 +Combining:
    8.60 +
    8.61 +
    8.62 +
    8.63 +5) What is/are the unique main "things" that enable what the proposed solution does?
    8.64 +    Sketch the details of each of these "things".  
    8.65 +    Did you detect any drawbacks, not stated in the paper, from the details?
    8.66 +    Did you see any really cool techniques?
    8.67 +
    8.68 +Unique main "things" are
    8.69 +
    8.70 +Drawbacks from details:
    8.71 +
    8.72 +Idea of
    8.73 +
    8.74 +
    8.75 +
    8.76 +6) What aspects of the implementation/proof/design need results given in order to convince you that the proposed solution delivers the stated benefits?
    8.77 +
    8.78 +They have to show
    8.79 +
    8.80 +
    8.81 +
    8.82 +7) What results did they show?
    8.83 +       Did they show results in all the needed aspects (which were left out)?
    8.84 +       Were the testing method and results shown good enough to convince you?
    8.85 +       Did you detect any cons, not stated in the paper, from the results?
    8.86 +
    8.87 +They showed 
    8.88 +
    8.89 +Con..  
    8.90 +
    8.91 +
    8.92 +
    8.93 +8) How do you think this work may provide some value to you in your future research?
    8.94 +
    8.95 +The work my provide value for me
    8.96 +
    8.97 +
    8.98 +
    8.99 +3 or more comments/questions:  (pick out the most important things to you from the discussion you gave above, or add things that were not brought out by the above questions.  I am asking for these as things to bring up during class).
   8.100 +
   8.101 +1)
   8.102 +
   8.103 + 
   8.104 \ No newline at end of file
     9.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     9.2 +++ b/0__Papers/PRT/PRT__intro_plus_eco_contrast/helpers/bib_for_papers.bib	Tue Sep 17 06:30:06 2013 -0700
     9.3 @@ -0,0 +1,1257 @@
     9.4 +
     9.5 +
     9.6 +
     9.7 +""
     9.8 +@Article{,
     9.9 +  author =       {},
    9.10 +  title =        {},
    9.11 +  journal =      {},
    9.12 +  volume =       {},
    9.13 +  number =       {},
    9.14 +  year =         {},
    9.15 +  pages =        {}
    9.16 +}
    9.17 +
    9.18 +
    9.19 +
    9.20 +""
    9.21 +@Book{,
    9.22 +  author = 	     {},
    9.23 +  title = 	     {},
    9.24 +  publisher =    {},
    9.25 +  year =         {},
    9.26 +  pages =        {}
    9.27 +}
    9.28 +
    9.29 +
    9.30 +
    9.31 +""
    9.32 +@misc{,
    9.33 +  author =       {},
    9.34 +  title =        {},
    9.35 +  url =          {}
    9.36 +}
    9.37 +
    9.38 +
    9.39 +"Lamport paper with clock sync"
    9.40 +@article{Lamport78,
    9.41 + author = {Lamport, Leslie},
    9.42 + title = {Time, clocks, and the ordering of events in a distributed system},
    9.43 + journal = {Commun. ACM},
    9.44 + volume = {21},
    9.45 + issue = {7},
    9.46 + year = {1978},
    9.47 + pages = {558--565},
    9.48 + }
    9.49 +
    9.50 +"Lamport paper with mutex lock algorithm"
    9.51 +@article{Lamport87,
    9.52 + author = {Lamport, Leslie},
    9.53 + title = {A fast mutual exclusion algorithm},
    9.54 + journal = {ACM Trans. Comput. Syst.},
    9.55 + volume = {5},
    9.56 + issue = {1},
    9.57 + year = {1987},
    9.58 + pages = {1--11}
    9.59 +}
    9.60 +
    9.61 +"Dijkstra semaphore definition paper"
    9.62 +@inproceedings{Dijkstra67,
    9.63 + author = {Dijkstra, Edsger W.},
    9.64 + title = {The structure of the "{THE}"-multiprogramming system},
    9.65 + booktitle = {Proceedings of the first ACM symposium on Operating System Principles},
    9.66 + series = {SOSP '67},
    9.67 + year = {1967},
    9.68 + pages = {10.1--10.6}
    9.69 + }
    9.70 +
    9.71 +"Original coroutine paper"
    9.72 +@article{Conway63,
    9.73 + author = {Conway, Melvin E.},
    9.74 + title = {Design of a separable transition-diagram compiler},
    9.75 + journal = {Commun. ACM},
    9.76 + volume = {6},
    9.77 + issue = {7},
    9.78 + year = {1963},
    9.79 + pages = {396--408}
    9.80 +}
    9.81 +
    9.82 +"Component model book Leavens G, Sitaraman M(eds.). Foundations of Component-Based Systems. Cambridge University Press: Cambridge, 2000"
    9.83 +@Book{ComponentModel00,
    9.84 +  author = 	     {G Leavens and M Sitaraman (eds)},
    9.85 +  title = 	     {Foundations of Component-Based Systems},
    9.86 +  publisher =    {Cambridge University Press},
    9.87 +  year =         {2000}
    9.88 +}
    9.89 +
    9.90 +
    9.91 +"Hewitt Actors Ref on ArXiv"
    9.92 +@misc{Hewitt10,
    9.93 +  author =       {Carl Hewitt},
    9.94 +  title =        {Actor Model of Computation},
    9.95 +  year =         {2010},
    9.96 +  note =          {http://arxiv.org/abs/1008.1459}
    9.97 +}
    9.98 +
    9.99 +"Actors paper -- AGHA has a 1985 tech report looks like it introduces Actors as an execution model..?"
   9.100 +@article{Actors97,
   9.101 +author = {Agha,G. and Mason,I. and Smith,S. and Talcott,C.},
   9.102 +title = {A foundation for actor computation},
   9.103 +journal = {Journal of Functional Programming},
   9.104 +volume = {7},
   9.105 +number = {01},
   9.106 +pages = {1-72},
   9.107 +year = {1997},
   9.108 +}
   9.109 +
   9.110 +"Scheduler Activations: M onto N thread technique"
   9.111 +@article{SchedActivations,
   9.112 + author = {Anderson, Thomas E. and Bershad, Brian N. and Lazowska, Edward D. and Levy, Henry M.},
   9.113 + title = {Scheduler activations: effective kernel support for the user-level management of parallelism},
   9.114 + journal = {ACM Trans. Comput. Syst.},
   9.115 + volume = {10},
   9.116 + issue = {1},
   9.117 + month = {February},
   9.118 + year = {1992},
   9.119 + pages = {53--79}
   9.120 +} 
   9.121 +
   9.122 +"BOM in Manticore project: functional language for scheduling and concurrency"
   9.123 +@inproceedings{BOMinManticore,
   9.124 + author = {Fluet, Matthew and Rainey, Mike and Reppy, John and Shaw, Adam and Xiao, Yingqi},
   9.125 + title = {Manticore: a heterogeneous parallel language},
   9.126 + booktitle = {Proceedings of the 2007 workshop on Declarative aspects of multicore programming},
   9.127 + series = {DAMP '07},
   9.128 + year = {2007},
   9.129 + pages = {37--44},
   9.130 + numpages = {8}
   9.131 +} 
   9.132 +
   9.133 +
   9.134 +//=====================================
   9.135 +"Gain from Chaos tech report"
   9.136 +@techreport
   9.137 + {Halle92,
   9.138 +    Author = {Halle, K.S. and Chua, Leon O. and Anishchenko, V.S. and Safonova, M.A.},
   9.139 +    Title = {Signal Amplification via Chaos: Experimental Evidence},
   9.140 +    Institution = {EECS Department, University of California, Berkeley},
   9.141 +    Year = {1992},
   9.142 +    URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/1992/2223.html},
   9.143 +    Number = {UCB/ERL M92/130}
   9.144 +}
   9.145 +
   9.146 +
   9.147 +Reprinted in:
   9.148 +Madan, R. N. (1993) Chua’s Circuit : A Paradigm for Chaos, World Scientific, Singapore.
   9.149 +"Signal Amplification via Chaos: Experimental Evidence"
   9.150 +K.S. Halle, Leon O. Chua, V.S. Anishchenko and M.A. Safonova
   9.151 +pgs 290-308
   9.152 +
   9.153 +
   9.154 +"Spread Spectrum Communication Through Modulation of Chaos"
   9.155 +Halle K.S., Wu C.W., Itoh M., Chua L.O. Spread Spectrum Communication Through Modulation of Chaos. Int. J. of Bifur. and Chaos, (3):469–477. 1993.
   9.156 +cited by 232
   9.157 +
   9.158 +
   9.159 +"Experimental Demonstration of Secure Communications Via Chaotic Synchronization"
   9.160 +Kocarev V, Halle K.S., Eckert K., Chua L.O., Parlitz V. Experimental Demonstration of Secure Communications Via Chaotic Synchronization. Int. J. Bifur. and Chaos, (2):709 713. 1992.
   9.161 +
   9.162 +
   9.163 +//==========================================
   9.164 +
   9.165 +"BLIS 2010 HotPar: Leveraging Semantics Attached to Function Calls to Isolate Applications from Hardware"
   9.166 +@inproceedings
   9.167 + {BLISInHotPar,
   9.168 +    author =    {Sean Halle and Albert Cohen},
   9.169 +    booktitle = {HOTPAR '10: USENIX Workshop on Hot Topics in Parallelism},
   9.170 +    month =     {June},
   9.171 +    title =     {Leveraging Semantics Attached to Function Calls to Isolate Applications from Hardware},
   9.172 +    year =      {2010}
   9.173 + }
   9.174 +
   9.175 +"2011 HotPar: "
   9.176 +@inproceedings
   9.177 + {HotPar11,
   9.178 +    author =    {Sean Halle and Albert Cohen},
   9.179 +    booktitle = {HOTPAR '11: USENIX Workshop on Hot Topics in Parallelism},
   9.180 +    month =     {May},
   9.181 +    title =     {},
   9.182 +    year =      {2011}
   9.183 + }
   9.184 +
   9.185 +"VMS in LCPC 2011"
   9.186 +@article{VMSLCPC,
   9.187 +  author = {Sean Halle and Albert Cohen},
   9.188 +  title = {A Mutable Hardware Abstraction to Replace Threads},
   9.189 +  journal = {24th International Workshop on Languages and Compilers for Parallel Languages (LCPC11)},
   9.190 +  year = {2011} 
   9.191 +}
   9.192 +
   9.193 +
   9.194 +"A Framework to Support Research on Portable High Performance Parallelism"
   9.195 +@misc{FrameworkTechRep,
   9.196 +  Author =       {Halle, Sean and Nadezhkin, Dmitry and Cohen, Albert},
   9.197 +  Note =         {http://www.soe.ucsc.edu/share/technical-reports/2010/ucsc-soe-10-02.pdf},
   9.198 +  Title =        {A Framework to Support Research on Portable High Performance Parallelism},
   9.199 +  Year = 2010
   9.200 +}
   9.201 +
   9.202 +"DKU Pattern for Performance Portable Parallel Software"
   9.203 +@misc{DKUTechRep,
   9.204 +  Author =       {Halle, Sean and Cohen, Albert},
   9.205 +  Note =         {http://www.soe.ucsc.edu/share/technical-reports/2009/ucsc-soe-09-06.pdf},
   9.206 +  Title =        {DKU Pattern for Performance Portable Parallel Software},
   9.207 +  Year = 2009
   9.208 +}
   9.209 +
   9.210 +"An Extensible Parallel Language"
   9.211 +@misc{EQNLangTechRep,
   9.212 +  Author =       {Halle, Sean},
   9.213 +  Note =         {http://www.soe.ucsc.edu/share/technical-reports/2009/ucsc-soe-09-16.pdf},
   9.214 +  Title =        {An Extensible Parallel Language},
   9.215 +  Year = 2009
   9.216 +}
   9.217 +
   9.218 +"A Hardware-Independent Parallel Operating System Abstraction Layer"
   9.219 +@misc{CTOSTechRep,
   9.220 +  Author =       {Halle, Sean},
   9.221 +  Note =         {http://www.soe.ucsc.edu/share/technical-reports/2009/ucsc-soe-09-15.pdf},
   9.222 +  Title =        {A Hardware-Independent Parallel Operating System Abstraction LayerParallelism},
   9.223 +  Year = 2009
   9.224 +}
   9.225 +
   9.226 +"Parallel Language Extensions for Side Effects"
   9.227 +@misc{SideEffectsTechRep,
   9.228 +  Author =       {Halle, Sean and Cohen, Albert},
   9.229 +  Note =         {http://www.soe.ucsc.edu/share/technical-reports/2009/ucsc-soe-09-14.pdf},
   9.230 +  Title =        {Parallel Language Extensions for Side Effects},
   9.231 +  Year = 2009
   9.232 +}
   9.233 +
   9.234 +
   9.235 +"BaCTiL: Base CodeTime Language"
   9.236 +@misc{BaCTiLTechRep,
   9.237 +  Author =       {Halle, Sean},
   9.238 +  Note =         {http://www.soe.ucsc.edu/share/technical-reports/2006/ucsc-crl-06-08.pdf},
   9.239 +  Title =        {BaCTiL: Base CodeTime Language},
   9.240 +  Year = 2006
   9.241 +}
   9.242 +
   9.243 +
   9.244 +"The Elements of the CodeTime Software Platform"
   9.245 +@misc{CTPlatformTechRep,
   9.246 +  Author =       {Halle, Sean},
   9.247 +  Note =         {http://www.soe.ucsc.edu/share/technical-reports/2006/ucsc-crl-06-09.pdf},
   9.248 +  Title =        {The Elements of the CodeTime Software Platform},
   9.249 +  Year = 2006
   9.250 +}
   9.251 +
   9.252 +
   9.253 +"A Scalable and Efficient Peer-to-Peer Run-Time System for a Hardware Independent Software Platform"
   9.254 +@misc{CTRTTechRep,
   9.255 +  Author =       {Halle, Sean},
   9.256 +  Note =         {http://www.soe.ucsc.edu/share/technical-reports/2006/ucsc-crl-06-10.pdf},
   9.257 +  Title =        {A Scalable and Efficient Peer-to-Peer Run-Time System for a Hardware Independent Software Platform},
   9.258 +  Year = 2006
   9.259 +}
   9.260 +
   9.261 +
   9.262 +"The Big-Step Operational Semantics of CodeTime Circuits"
   9.263 +@misc{FrameworkTechRep,
   9.264 +  Author =       {Halle, Sean},
   9.265 +  Note =         {http://www.soe.ucsc.edu/share/technical-reports/2006/ucsc-crl-06-11.pdf},
   9.266 +  Title =        {The Big-Step Operational Semantics of CodeTime Circuits},
   9.267 +  Year = 2006
   9.268 +}
   9.269 +
   9.270 +
   9.271 +"A Mental Framework for use in Creating Hardware Independent Parallel Languages"
   9.272 +@misc{FrameworkTechRep,
   9.273 +  Author =       {Halle, Sean},
   9.274 +  Note =         {http://www.soe.ucsc.edu/share/technical-reports/2006/ucsc-crl-06-12.pdf},
   9.275 +  Title =        {A Mental Framework for use in Creating Hardware Independent Parallel Languages},
   9.276 +  Year = 2006
   9.277 +}
   9.278 +
   9.279 +
   9.280 +"The Case for an Integrated Software Platform for HEC Illustrated Using the CodeTime Platform"
   9.281 +@misc{CIPTechRep,
   9.282 +  Author =       {Halle, Sean},
   9.283 +  Note =         {http://www.soe.ucsc.edu/share/technical-reports/2005/ucsc-crl-05-05.pdf},
   9.284 +  Title =        {The Case for an Integrated Software Platform for HEC Illustrated Using the CodeTime Platform},
   9.285 +  Year = 2005
   9.286 +}
   9.287 +
   9.288 +//==========================================
   9.289 +
   9.290 +
   9.291 +"OMP Hompe page"
   9.292 +@misc{OMPHome,
   9.293 +  Note =         {http://www.openmediaplatform.eu/},
   9.294 +  Title =        {{Open Media Platform} homepage},
   9.295 +}
   9.296 +
   9.297 +"The OMP infrastructure site"
   9.298 +@misc{Halle2008,
   9.299 +  Author =       {Sean Halle and Albert Cohen},
   9.300 +  Note =         {http://omp.musictwodotoh.com},
   9.301 +  Title =        {{DKU} infrastructure server}
   9.302 +}
   9.303 +
   9.304 +
   9.305 +
   9.306 +"The DKU sourceforge site"
   9.307 +@misc{DKUSourceForge,
   9.308 +  Author =       {Sean Halle and Albert Cohen},
   9.309 +  Month =        {November},
   9.310 +  Note =         {http://dku.sourceforge.net},
   9.311 +  Title =        {{DKU} website},
   9.312 +  Year =         {2008}
   9.313 +}
   9.314 +
   9.315 +
   9.316 +"The BLIS sourceforge site"
   9.317 +@misc{BLISHome,
   9.318 +  Author =       {Sean Halle and Albert Cohen},
   9.319 +  Month =        {November},
   9.320 +  Note =         {http://blisplatform.sourceforge.net},
   9.321 +  Title =        {{BLIS} website},
   9.322 +  Year =         {2008}
   9.323 +}
   9.324 +
   9.325 +
   9.326 +"The VMS Home page"
   9.327 +@misc{VMSHome,
   9.328 +  Author =       {Sean Halle and Merten Sach and Ben Juurlink and Albert Cohen},
   9.329 +  Note =         {http://virtualizedmasterslave.org},
   9.330 +  Title =        {{VMS} Home Page},
   9.331 +  Year =         {2010}
   9.332 +}
   9.333 +
   9.334 +
   9.335 +"The PStack Home page"
   9.336 +@misc{PStackHome,
   9.337 +  Author =       {Sean Halle},
   9.338 +  Note =         {http://pstack.sourceforge.net},
   9.339 +  Title =        {{PStack} Home Page},
   9.340 +  Year =         {2012}
   9.341 +}
   9.342 +
   9.343 +
   9.344 +"Deblocking code in SVN"
   9.345 +@misc{DeblockingCode,
   9.346 +  Note = {http://dku.svn.sourceforge.net/viewvc/dku/branches/DKU\_C\_\_Deblocking\_\_orig/},
   9.347 +  Title ={{DKU-ized Deblocking Filter} code}
   9.348 +}
   9.349 +
   9.350 +
   9.351 +
   9.352 +"Sample code on BLIS site"
   9.353 +@misc{SampleBLISCode,
   9.354 +  Note = {http://dku.sourceforge.net/SampleCode.htm},
   9.355 +  Title ={{Sample BLIS Code}}
   9.356 +}
   9.357 +
   9.358 +"Framework Technical Report"
   9.359 +@misc{FrameworkTechRep,
   9.360 +  Author =       {Halle, Sean and Nadezhkin, Dmitry and Cohen, Albert},
   9.361 +  Note =         {http://www.soe.ucsc.edu/share/technical-reports/2010/ucsc-soe-10-02.pdf},
   9.362 +  Title =        {A Framework to Support Research on Portable High Performance Parallelism}
   9.363 +}
   9.364 +
   9.365 +"Map reduce"
   9.366 +@misc{MapReduceHome,
   9.367 +  Author =       {Google Corp.},
   9.368 +  Note =         {http://labs.google.com/papers/mapreduce.html},
   9.369 +  Title =        {{MapReduce} Home page},
   9.370 +}
   9.371 +
   9.372 +
   9.373 +"TBB  Thread Building Blocks"
   9.374 +@misc{TBBHome,
   9.375 +  Author =       {Intel Corp.},
   9.376 +  Note =         {http://www.threadingbuildingblocks.org},
   9.377 +  Title =        {{TBB} Home page},
   9.378 +}
   9.379 +
   9.380 +
   9.381 +"HPF Wikipedia entry"
   9.382 +@misc{HPFWikipedia,
   9.383 +  Author =       {Wikipedia},
   9.384 +  Note =         {http://en.wikipedia.org/wiki/High_Performance_Fortran},
   9.385 +  Title =        {{HPF} wikipedia page},
   9.386 +}
   9.387 +
   9.388 +
   9.389 +"OpenMP Home page"
   9.390 +@misc{OpenMPHome,
   9.391 +  Author =       {{OpenMP} organization},
   9.392 +  Note =         {http://www.openmp.org},
   9.393 +  Title =        {{OpenMP} Home page}
   9.394 +}
   9.395 +
   9.396 +
   9.397 +
   9.398 +"Open MPI Home page"
   9.399 +@misc{MPIHome,
   9.400 +  Author =       {open-mpi organization},
   9.401 +  Note =         {http://www.open-mpi.org},
   9.402 +  Title =        {{Open MPI} Home page}
   9.403 +}
   9.404 +
   9.405 +"OpenCL Home page"
   9.406 +@misc{OpenCLHome,
   9.407 +  Author =       {Kronos Group},
   9.408 +  Note =         {http://www.khronos.org/opencl},
   9.409 +  Title =        {{OpenCL} Home page}
   9.410 +}
   9.411 +
   9.412 +
   9.413 +"CILK Hompe page"
   9.414 +@misc{CILKHome,
   9.415 +  Author =       {Cilk group at MIT},
   9.416 +  Note =         {http://supertech.csail.mit.edu/cilk/},
   9.417 +  Title =        {{CILK} homepage},
   9.418 +}
   9.419 +
   9.420 +@InProceedings{Fri98,
   9.421 +  author = 	 {M. Frigo and C. E. Leiserson and K. H. Randall},
   9.422 +  title = 	 {The Implementation of the Cilk-5 Multithreaded Language},
   9.423 +  booktitle = 	 {PLDI '98: Proceedings of the 1998 ACM SIGPLAN conference on Programming language design and implementation},
   9.424 +  pages =	 {212--223},
   9.425 +  year =	 1998,
   9.426 +  address =	 {Montreal, Quebec},
   9.427 +  month =	 jun
   9.428 +}
   9.429 +
   9.430 +
   9.431 +"Titanium Hompe page"
   9.432 +@misc{TitaniumHome,
   9.433 +  Note =         {http://titanium.cs.berkeley.edu},
   9.434 +  Title =        {{Titanium} homepage}
   9.435 +}
   9.436 +
   9.437 +
   9.438 +"CnC in HotPar"
   9.439 +@inproceedings{CnCInHotPar,
   9.440 +    author = {Knobe, Kathleen},
   9.441 +    booktitle = {HOTPAR '09: USENIX Workshop on Hot Topics in Parallelism},
   9.442 +    month = {March},
   9.443 +    title = {Ease of Use with Concurrent Collections {(CnC)}},
   9.444 +    year = {2009}
   9.445 +}
   9.446 +
   9.447 +
   9.448 +"CnC Hompe page"
   9.449 +@misc{CnCHome,
   9.450 +  Author =       {Intel Corp.},
   9.451 +  Note =         {http://software.intel.com/en-us/articles/intel-concurrent-collections-for-cc/},
   9.452 +  Title =        {{CnC} homepage},
   9.453 +}
   9.454 +
   9.455 +"Spiral Home page"
   9.456 +@misc{SpiralHome,
   9.457 +  Author =       {Spiral Group at CMU},
   9.458 +  Note =         {http://www.spiral.net},
   9.459 +  Title =        {{Spiral} homepage},
   9.460 +}
   9.461 +
   9.462 +
   9.463 +"Scala Hompe page"
   9.464 +@misc{ScalaHome,
   9.465 +  Author =       {Scala organization},
   9.466 +  Note =         {http://www.scala-lang.org/},
   9.467 +  Title =        {{Scala} homepage},
   9.468 +}
   9.469 +
   9.470 +
   9.471 +
   9.472 +
   9.473 +"UPC Hompe page"
   9.474 +@misc{UPCHome,
   9.475 +  Author =       {UPC group at UC Berkeley},
   9.476 +  Note =         {http://upc.lbl.gov/},
   9.477 +  Title =        {{Unified Parallel C} homepage},
   9.478 +}
   9.479 +
   9.480 +
   9.481 +"Suif Hompe page"
   9.482 +@misc{SuifHome,
   9.483 +  Note =         {http://suif.stanford.edu},
   9.484 +  Title =        {{Suif} Parallelizing compiler homepage},
   9.485 +}
   9.486 +
   9.487 +
   9.488 +
   9.489 +"SEJITS"
   9.490 +@article{SEJITS,
   9.491 +  author = {B. Catanzaro and S. Kamil and Y. Lee and K. Asanovic and J. Demmel and K. Keutzer and J. Shalf and K. Yelick and A. Fox},
   9.492 +  title = {SEJITS: Getting Productivity AND Performance With Selective Embedded JIT Specialization},
   9.493 +  journal = {First Workshop on Programmable Models for Emerging Architecture at the 18th International Conference on Parallel Architectures and Compilation Techniques },
   9.494 +  year = {2009} 
   9.495 +}
   9.496 +
   9.497 +
   9.498 +"Arnaldo 3D parallel on NXP chip"
   9.499 +@inproceedings{Arnaldo3D,
   9.500 +  author = {Azevedo, Arnaldo and Meenderinck, Cor and Juurlink, Ben and Terechko, Andrei and Hoogerbrugge, Jan and Alvarez, Mauricio and Ramirez, Alex},
   9.501 +  title = {Parallel H.264 Decoding on an Embedded Multicore Processor},
   9.502 +  booktitle = {HiPEAC '09: Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers},
   9.503 +  year = {2009},
   9.504 + pages = {404--418}
   9.505 + }
   9.506 +
   9.507 +
   9.508 +"Narayanan's GPU scheduling tool"
   9.509 +@article{NarayananGPUSched,
   9.510 +  author = {Narayanan Sundaram and Anand Raghunathan and Srimat T. Chakradhar},
   9.511 +  title = {A framework for efficient and scalable execution of domain-specific templates on GPUs},
   9.512 +  journal ={International Parallel and Distributed Processing Symposium {(IPDPS)}},
   9.513 +  year = {2009},
   9.514 +  pages = {1-12},
   9.515 +}
   9.516 +
   9.517 +"Polyhedral for GPU from Ohio State"
   9.518 +@inproceedings{PolyForGPU,
   9.519 +   author = {Baskaran, Muthu Manikandan and Bondhugula, Uday and Krishnamoorthy, Sriram and Ramanujam, J. and Rountev, Atanas and Sadayappan, P.},
   9.520 +   title = {A compiler framework for optimization of affine loop nests for gpgpus},
   9.521 +   booktitle = {ICS '08: Proceedings of the 22nd annual international conference on Supercomputing},
   9.522 +   year = {2008},
   9.523 +   pages = {225--234},
   9.524 + }
   9.525 +
   9.526 +"Loulou's Polyhedral loop-nest optimization paper in PLDI 08"
   9.527 +@inproceedings{Loulou08,
   9.528 +   author = {Pouchet, Louis-No\"{e}l and Bastoul, C\'{e}dric and Cohen, Albert and Cavazos, John},
   9.529 +   title = {Iterative optimization in the polyhedral model: part ii, multidimensional time},
   9.530 +   booktitle = {ACM SIGPLAN conference on Programming language design and implementation {(PLDI)} },
   9.531 +   year = {2008},
   9.532 +   pages = {90--100},
   9.533 + }
   9.534 + 
   9.535 +
   9.536 +"Merge in HotPar"
   9.537 +@inproceedings{MergeInHotPar,
   9.538 +    author = {Michael D. Linderman and James Balfour and Teresa H. Meng and William J. Dally},
   9.539 +    booktitle = {HOTPAR '09: USENIX Workshop on Hot Topics in Parallelism},
   9.540 +    month = {March},
   9.541 +    title = {Embracing Heterogeneity \- Parallel Programming for Changing Hardware},
   9.542 +    year = {2009}
   9.543 +}
   9.544 +
   9.545 +
   9.546 +"Galois system for irregular problems"
   9.547 +@inproceedings{GaloisRef,
   9.548 +  author = {Kulkarni, Milind and Pingali, Keshav and Walter, Bruce and Ramanarayanan, Ganesh and Bala, Kavita and Chew, L. Paul},
   9.549 +  title = {Optimistic parallelism requires abstractions},
   9.550 +  booktitle = {PLDI '07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation},
   9.551 +  year = {2007},
   9.552 +  pages = {211--222}
   9.553 +}
   9.554 +
   9.555 +"Cool compiler book that talks about balancing task size with machine characteristics..  the one Amit had"
   9.556 +@book{Allen2002,
   9.557 +  author = {Kennedy, Ken and Allen, John R.},
   9.558 +  title = {Optimizing compilers for modern architectures: a dependence-based approach},
   9.559 +  year = {2002},
   9.560 +  publisher = {Morgan Kaufmann Publishers Inc.}
   9.561 + }
   9.562 +
   9.563 +
   9.564 +"Streaming languages and tools survery paper"
   9.565 +@MISC{Stephens95,
   9.566 +    author = {R. Stephens},
   9.567 +    title = {A Survey Of Stream Processing},
   9.568 +    year = {1995}
   9.569 +}
   9.570 +
   9.571 +
   9.572 +"Capsule"
   9.573 +@INPROCEEDINGS{Palatin06,
   9.574 +    author = {P Palatin and Y Lhuillier and O Temam},
   9.575 +    title = {CAPSULE: Hardware-assisted parallel execution of componentbased programs},
   9.576 +    booktitle = {In Proceedings of the 39th Annual International Symposium on Microarchitecture},
   9.577 +    year = {2006},
   9.578 +    pages = {247--258}
   9.579 +}
   9.580 +
   9.581 +"Sequioa"
   9.582 +@inproceedings{Sequioa06,
   9.583 + author = {Fatahalian,, Kayvon and Horn,, Daniel Reiter and Knight,, Timothy J. and Leem,, Larkhoon and Houston,, Mike and Park,, Ji Young and Erez,, Mattan and Ren,, Manman and Aiken,, Alex and Dally,, William J. and Hanrahan,, Pat},
   9.584 + title = {Sequoia: programming the memory hierarchy},
   9.585 + booktitle = {SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing},
   9.586 + year = {2006},
   9.587 + pages = {83}
   9.588 + }
   9.589 +
   9.590 + 
   9.591 + 
   9.592 + 
   9.593 +"Cole meta skeletons book"
   9.594 +@Book{Cole89,
   9.595 +  author = 	     {M Cole},
   9.596 +  title = 	     {Algorithmic skeletons: Structured management of parallel computation},
   9.597 +  publisher =    {Pitman},
   9.598 +  year =         {1989}
   9.599 +}
   9.600 +
   9.601 +
   9.602 +"Meta programming skeletons example"
   9.603 +@INPROCEEDINGS{Ginhac98,
   9.604 +    author = {Dominique Ginhac and Jocelyn Serot and Jean Pierre Derutin},
   9.605 +    title = {Fast prototyping of image processing applications using functional skeletons on a MIMD-DM architecture},
   9.606 +    booktitle = {In IAPR Workshop on Machine Vision and Applications},
   9.607 +    year = {1998},
   9.608 +    pages = {468--471}
   9.609 +}
   9.610 +
   9.611 +
   9.612 +"Parallel Skeletons meta programming"
   9.613 +@inproceedings{Serot08MetaParallel,
   9.614 + author = {Serot, Jocelyn and Falcou, Joel},
   9.615 + title = {Functional Meta-programming for Parallel Skeletons},
   9.616 + booktitle = {ICCS '08: Proceedings of the 8th international conference on Computational Science, Part I},
   9.617 + year = {2008},
   9.618 + pages = {154--163}
   9.619 + }
   9.620 + 
   9.621 + 
   9.622 +"Random skeletons for parallel programming article with lots of citations"
   9.623 +@INPROCEEDINGS{Darlington93,
   9.624 +    author = {J. Darlington and A. J. Field and P. G. Harrison and P. H. J. Kelly and D. W. N. Sharp and Q. Wu},
   9.625 +    title = {Parallel programming using skeleton functions},
   9.626 +    booktitle = {},
   9.627 +    year = {1993},
   9.628 +    pages = {146--160},
   9.629 +    publisher = {Springer-Verlag}
   9.630 +}
   9.631 +
   9.632 +
   9.633 +"View from Berkeley paper"
   9.634 +@article{Asanovic06BerkeleyView,
   9.635 +  title={{The landscape of parallel computing research: A view from berkeley}},
   9.636 +  author={Asanovic, K. and Bodik, R. and Catanzaro, B.C. and Gebis, J.J. and Husbands, P. and Keutzer, K. and Patterson, D.A. and Plishker, W.L. and Shalf, J. and Williams, S.W. and others},
   9.637 +  journal={Electrical Engineering and Computer Sciences, University of California at Berkeley, Technical Report No. UCB/EECS-2006-183, December},
   9.638 +  volume={18},
   9.639 +  number={2006-183},
   9.640 +  pages={19},
   9.641 +  year={2006},
   9.642 +}
   9.643 +
   9.644 +
   9.645 +
   9.646 +
   9.647 +"Berkeley Pattern Language"
   9.648 +@misc{BerkeleyPattLang,
   9.649 +  Note =         {http://parlab.eecs.berkeley.edu/wiki/patterns},
   9.650 +  Title =        {{Berkeley Pattern Language}}
   9.651 +}
   9.652 +
   9.653 +
   9.654 +"Keutzer reccomended Parallel Prog Patterns book"
   9.655 +@book{Mattson04Patterns,
   9.656 +  title={{Patterns for parallel programming}},
   9.657 +  author={Mattson, T. and Sanders, B. and Massingill, B.},
   9.658 +  year={2004},
   9.659 +  publisher={Addison-Wesley Professional}
   9.660 +}
   9.661 +
   9.662 +
   9.663 +"Skillicorn  Parallel Languages Survery book"
   9.664 +@article{Skillicorn98,
   9.665 +  title={{Models and languages for parallel computation}},
   9.666 +  author={Skillicorn, D.B. and Talia, D.},
   9.667 +  journal={ACM Computing Surveys (CSUR)},
   9.668 +  volume={30},
   9.669 +  number={2},
   9.670 +  pages={123--169},
   9.671 +  year={1998}
   9.672 +}
   9.673 +
   9.674 +
   9.675 +
   9.676 +"NESL language"
   9.677 +@conference{Blelloch93NESL,
   9.678 +  title={{Implementation of a portable nested data-parallel language}},
   9.679 +  author={Blelloch, G.E. and Hardwick, J.C. and Chatterjee, S. and Sipelstein, J. and Zagha, M.},
   9.680 +  booktitle={Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming},
   9.681 +  pages={102--111},
   9.682 +  year={1993},
   9.683 +  organization={ACM New York, NY, USA}
   9.684 +}
   9.685 +
   9.686 +
   9.687 +"Sisal"
   9.688 +@article{McgrawSisal,
   9.689 +  title={{SISAL: Streams and iteration in a single assignment language: Reference manual version 1.2}},
   9.690 +  author={McGraw, J. and Skedzielewski, SK and Allan, SJ and Oldehoeft, RR and Glauert, J. and Kirkham, C. and Noyce, B. and Thomas, R.},
   9.691 +  journal={Manual M-146, Rev},
   9.692 +  volume={1}
   9.693 +}
   9.694 +
   9.695 +
   9.696 +"Linda"
   9.697 +@article{Gelernter85Linda,
   9.698 +  title={{Generative communication in Linda}},
   9.699 +  author={Gelernter, D.},
   9.700 +  journal={ACM Transactions on Programming Languages and Systems (TOPLAS)},
   9.701 +  volume={7},
   9.702 +  number={1},
   9.703 +  pages={80--112},
   9.704 +  year={1985}
   9.705 +}
   9.706 +
   9.707 +
   9.708 +"ZPL"
   9.709 +@article{Lin94ZPL,
   9.710 +  title={{ZPL: An array sublanguage}},
   9.711 +  author={Lin, C. and Snyder, L.},
   9.712 +  journal={Lecture Notes in Computer Science},
   9.713 +  volume={768},
   9.714 +  pages={96--114},
   9.715 +  year={1994}
   9.716 +}
   9.717 +
   9.718 +
   9.719 +
   9.720 +
   9.721 +// Visual programming
   9.722 +@article
   9.723 + { baecker97,
   9.724 +   author = 	{Ron Baecker and Chris DiGiano and Aaron Marcus},
   9.725 +   title = 		{Software visualization for debugging},
   9.726 +   journal = 	{Communications of the ACM},
   9.727 +   volume = 	{40},
   9.728 +   number = 	{4},
   9.729 +   year = 		{1997}, 
   9.730 +   issn = 		{0001-0782},
   9.731 +   pages = 		{44--54},
   9.732 +   publisher = 	{ACM Press}
   9.733 + }
   9.734 +
   9.735 +
   9.736 +// Visual programming
   9.737 +@article
   9.738 + { ball96,
   9.739 +   author =	{T. A. Ball and S. G. Eick},
   9.740 +   title =	{Software Visualization in the Large},
   9.741 +   journal ={IEEE Computer},
   9.742 +   volume =	{29},
   9.743 +   number =	{4},
   9.744 +   year =	{1996},
   9.745 +   month =	{apr},
   9.746 +   pages =	{33--43}
   9.747 + }
   9.748 +
   9.749 +
   9.750 +// Milner references this, Chemical Abstract Machine
   9.751 +@book
   9.752 + {berry89,
   9.753 +  title={{The chemical abstract machine}},
   9.754 +  author={Berry, G. and Boudol, G.},
   9.755 +  year={1989},
   9.756 +  publisher={ACM Press}
   9.757 +}
   9.758 +
   9.759 +
   9.760 +// Cilk reference
   9.761 +@article
   9.762 + {blumofe95,
   9.763 + author = {Robert D. Blumofe and Christopher F. Joerg and Bradley C. Kuszmaul and Charles E. Leiserson and Keith H. Randall and Yuli Zhou},
   9.764 + title = {Cilk: an efficient multithreaded runtime system},
   9.765 + journal = {SIGPLAN Not.},
   9.766 + volume = {30},
   9.767 + number = {8},
   9.768 + year = {1995},
   9.769 + pages = {207--216}
   9.770 + }
   9.771 +
   9.772 +
   9.773 +// this has 1440 citations, so throwing it in..
   9.774 +// The complexity of symbolic checking of program correctness
   9.775 +@article
   9.776 + {burch90,
   9.777 +  title={{Symbolic model checking: 10^{20} states and beyond}},
   9.778 +  author={Burch, JR and Clarke, EM and McMillan, KL and Dill, DL and Hwang, LJ},
   9.779 +  journal={Logic in Computer Science, 1990. LICS'90, Proceedings},
   9.780 +  pages={428--439},
   9.781 +  year={1990}
   9.782 +}
   9.783 +
   9.784 +@article
   9.785 + {chamberlain98,
   9.786 +author = {B. Chamberlain and S. Choi and E. Lewis and C. Lin and L. Snyder and W. Weathersby},
   9.787 +title = {ZPL's WYSIWYG Performance Model},
   9.788 +journal = {hips},
   9.789 +volume = {00},
   9.790 +year = {1998},
   9.791 +isbn = {0-8186-8412-7},
   9.792 +pages = {50}
   9.793 +}
   9.794 +
   9.795 +
   9.796 +
   9.797 +// from http://libweb.princeton.edu/libraries/firestone/rbsc/aids/church/church1.html#1
   9.798 +@article{church41,
   9.799 +   author={A. Church},
   9.800 +   title={The Calculi of Lambda-Conversion},
   9.801 +   journal={Annals of Mathematics Studies},
   9.802 +   number={6},
   9.803 +   year={1941},
   9.804 +   publisher={Princeton University}
   9.805 +}
   9.806 +
   9.807 +
   9.808 +@misc
   9.809 + { CodeTimeSite,
   9.810 +   author =	{Sean Halle},
   9.811 +   key =	{CodeTime},
   9.812 +   title = 	{Homepage for The CodeTime Parallel Software Platform},
   9.813 +   note = 	{{\ttfamily http://codetime.sourceforge.net}}
   9.814 + }
   9.815 +
   9.816 +
   9.817 +
   9.818 +@misc
   9.819 + { CodeTimePlatform,
   9.820 +   author =	{Sean Halle},
   9.821 +   key =	{CodeTime},
   9.822 +   title = 	{The CodeTime Parallel Software Platform},
   9.823 +   note = 	{{\ttfamily http://codetime.sourceforge.net/content/CodeTime\_Platform.pdf}}
   9.824 + }
   9.825 +
   9.826 +
   9.827 +@misc
   9.828 + { CodeTimeVS,
   9.829 +   author =	{Sean Halle},
   9.830 +   key =	{CodeTime},
   9.831 +   title = 	{The Specification of the CodeTime Platform's Virtual Server},
   9.832 +   note = 	{{\ttfamily http://codetime.sourceforge.net/content/CodeTime\_Virtual\_Server.pdf}}
   9.833 + }
   9.834 +
   9.835 +
   9.836 +@misc
   9.837 + { CodeTimeOS,
   9.838 +   author =	{Sean Halle},
   9.839 +   key =	{CodeTime},
   9.840 +   title = 	{A Hardware Independent OS},
   9.841 +   note = 	{{\ttfamily http://codetime.sourceforge.net/content/CodeTime\_OS.pdf}}
   9.842 + }
   9.843 +
   9.844 +
   9.845 +@misc
   9.846 + { CodeTimeSem,
   9.847 +   author =	{Sean Halle},
   9.848 +   key =	{CodeTime},
   9.849 +   title = 	{The Big-Step Operational Semantics of the CodeTime Computational Model},
   9.850 +   note = 	{{\ttfamily http://codetime.sourceforge.net/content/CodeTime\_Semantics.pdf}}
   9.851 + }
   9.852 +
   9.853 +
   9.854 +@misc
   9.855 + { CodeTimeTh,
   9.856 +   author =	{Sean Halle},
   9.857 +   key =	{CodeTime},
   9.858 +   title = 	{A Mental Framework for Use in Creating Hardware-Independent Parallel Languages},
   9.859 +   note = 	{{\ttfamily http://codetime.sourceforge.net/content/CodeTiime\_Theoretical\_Framework.pdf}}
   9.860 + }
   9.861 +
   9.862 +
   9.863 +@misc
   9.864 + { CodeTimeTh1,
   9.865 +   author =	{Sean Halle},
   9.866 +   key =	{CodeTime},
   9.867 +   title = 	{The CodeTime Parallel Software Platform},
   9.868 +   note = 	{{\ttfamily http://codetime.sourceforge.net}}
   9.869 + }
   9.870 +
   9.871 +
   9.872 +@misc
   9.873 + { CodeTimeTh2,
   9.874 +   author =	{Sean Halle},
   9.875 +   key =	{CodeTime},
   9.876 +   title = 	{The CodeTime Parallel Software Platform},
   9.877 +   note = 	{{\ttfamily http://codetime.sourceforge.net}}
   9.878 + }
   9.879 +
   9.880 +
   9.881 +@misc
   9.882 + { CodeTimeRT,
   9.883 +   author =	{Sean Halle},
   9.884 +   key =	{CodeTime},
   9.885 +   title = 	{The CodeTime Parallel Software Platform},
   9.886 +   note = 	{{\ttfamily http://codetime.sourceforge.net}}
   9.887 + }
   9.888 +
   9.889 +
   9.890 +@misc
   9.891 + { CodeTimeWebSite
   9.892 +   author =	{Sean Halle},
   9.893 +   key =	{CodeTime},
   9.894 +   title = 	{The CodeTime Parallel Software Platform},
   9.895 +   note = 	{{\ttfamily http://codetime.sourceforge.net}}
   9.896 + }
   9.897 +
   9.898 +
   9.899 +@misc
   9.900 + { CodeTimeBaCTiL,
   9.901 +   author =	{Sean Halle},
   9.902 +   key =	{CodeTime},
   9.903 +   title = 	{The Base CodeTime Language},
   9.904 +   note = 	{{\ttfamily http://codetime.sourceforge.net/content/CodeTime\_BaCTiL.pdf}}
   9.905 + }
   9.906 +
   9.907 +@misc
   9.908 + { CodeTimeCert,
   9.909 +   author =	{Sean Halle},
   9.910 +   key =	{CodeTime},
   9.911 +   title = 	{The CodeTime Certification Strategy},
   9.912 +   note = 	{{\ttfamily http://codetime.sourceforge.net/content/CodeTime\_Certification.pdf}}
   9.913 + }
   9.914 +
   9.915 +
   9.916 +// Multiple inheritance: explains issues well and references LOOPS and CLOS
   9.917 +@inproceedings{ducournau94,
   9.918 +  author = {R. Ducournau and M. Habib and M. Huchard and M. L. Mugnier},
   9.919 +  title = {Proposal for a monotonic multiple inheritance linearization},
   9.920 +  booktitle = {OOPSLA '94: Proceedings of the ninth annual conference on Object-oriented programming systems, language, and applications},
   9.921 +  year = {1994},
   9.922 +  pages = {164--175},
   9.923 +  publisher = {ACM Press}
   9.924 +}
   9.925 +
   9.926 +
   9.927 +// 252 Citations, shows equivalence of mu-calculus and (nondeterministic) tree automata,
   9.928 +// so cited as foundation a lot
   9.929 +@article{emerson91,
   9.930 +  title={{Tree automata, mu-calculus and determinacy}},
   9.931 +  author={Emerson, EA and Jutla, CS},
   9.932 +  journal={Proceedings of the 32nd Symposium on Foundations of Computer Science},
   9.933 +  pages={368--377},
   9.934 +  year={1991}
   9.935 +}
   9.936 +
   9.937 +
   9.938 +// Introducs PRAM model, at same time, in same conference as 
   9.939 +@article{fortune78,
   9.940 +  title={{Parallelism in random access machines}},
   9.941 +  author={Fortune, S. and Wyllie, J.},
   9.942 +  journal={STOC '78: Proceedings of the tenth annual ACM symposium on Theory of computing},
   9.943 +  pages={114--118},
   9.944 +  year={1978},
   9.945 +  publisher={ACM Press New York, NY, USA}
   9.946 +}
   9.947 +
   9.948 +
   9.949 +
   9.950 +// Smalltalk reference
   9.951 +@book{goldberg83,
   9.952 +  title={{Smalltalk-80: the language and its implementation}},
   9.953 +  author={Goldberg, A. and Robson, D.},
   9.954 +  year={1983},
   9.955 +  publisher={Addison-Wesley}
   9.956 +}
   9.957 +
   9.958 +
   9.959 +// also introduces PRAM model, apparently independently
   9.960 +@inproceedings{goldschlager78,
   9.961 + author = {Leslie M. Goldschlager},
   9.962 + title = {A unified approach to models of synchronous parallel machines},
   9.963 + booktitle = {STOC '78: Proceedings of the tenth annual ACM symposium on Theory of computing},
   9.964 + year = {1978},
   9.965 + pages = {89--94},
   9.966 + location = {San Diego, California, United States},
   9.967 + doi = {http://doi.acm.org/10.1145/800133.804336},
   9.968 + publisher = {ACM Press},
   9.969 +}
   9.970 +
   9.971 +
   9.972 +// Java spec
   9.973 +@book
   9.974 + { gosling96,
   9.975 +   author = 	{J. Gosling and B. Joy and G. Steele and G. Bracha},
   9.976 +   title = 		{The Java Language Specification},
   9.977 +   publisher = 	{Addison-Wesley},
   9.978 +   year = 	{1996}
   9.979 + }
   9.980 +
   9.981 +
   9.982 +//  Survey of prototyping parallel apps
   9.983 +@article{hasselbring00,
   9.984 + author = {Wilhelm Hasselbring},
   9.985 + title = {Programming languages and systems for prototyping concurrent applications},
   9.986 + journal = {ACM Comput. Surv.},
   9.987 + volume = {32},
   9.988 + number = {1},
   9.989 + year = {2000},
   9.990 + issn = {0360-0300},
   9.991 + pages = {43--79},
   9.992 + doi = {http://doi.acm.org/10.1145/349194.349199},
   9.993 + publisher = {ACM Press},
   9.994 + address = {New York, NY, USA},
   9.995 + }
   9.996 +
   9.997 +
   9.998 +// Original CSP paper
   9.999 +@article{hoare78,
  9.1000 +   author={C. A. R. Hoare},
  9.1001 +   title={Communicating Sequential Processes},
  9.1002 +   journal={Communications of the ACM},
  9.1003 +   year={1978},
  9.1004 +   volume={21},
  9.1005 +   number={8},
  9.1006 +   pages={666-677}
  9.1007 +}
  9.1008 +
  9.1009 +
  9.1010 +// 8 citations.. probably from self..  want a paper that ties areas together..  
  9.1011 +// This paper does a beautiful job..
  9.1012 +@article{huth,
  9.1013 +  title={{A Unifying Framework for Model Checking Labeled Kripke Structures, Modal Transition Systems, and Interval Transition Systems}},
  9.1014 +  author={Huth, M.},
  9.1015 +  journal={Proceedings of the 19th International Conference on the Foundations of Software Technology \& Theoretical Computer Science, Lecture Notes in Computer Science},
  9.1016 +  pages={369--380},
  9.1017 +  publisher={Springer-Verlag}
  9.1018 +}
  9.1019 +
  9.1020 +
  9.1021 +//  Dataflow advances survey, includes large grain dataflow
  9.1022 +@article
  9.1023 + { johnston04,
  9.1024 +   author = 	{Wesley M. Johnston and J. R. Paul Hanna and Richard J. Millar},
  9.1025 +   title = 		{Advances in dataflow programming languages},
  9.1026 +   journal = 	{ACM Comput. Surv.},
  9.1027 +   volume = 	{36},
  9.1028 +   number = 	{1},
  9.1029 +   year = 		{2004},
  9.1030 +   issn = 		{0360-0300},
  9.1031 +   pages = 		{1--34},
  9.1032 +   doi = 		{http://doi.acm.org/10.1145/1013208.1013209},
  9.1033 +   publisher = 	{ACM Press},
  9.1034 +   address = 	{New York, NY, USA}
  9.1035 + }
  9.1036 +
  9.1037 +
  9.1038 +@book
  9.1039 + { koelbel93,
  9.1040 +   author =	{C. H. Koelbel and D. Loveman and R. Schreiber and G. Steele Jr},
  9.1041 +   title = 		{High Performance Fortran Handbook},
  9.1042 +   year = 	{1993},
  9.1043 +   publisher =	{MIT Press}
  9.1044 + }
  9.1045 +
  9.1046 +
  9.1047 +// mu calculus paper with 430 citations
  9.1048 +@article{kozen83,
  9.1049 +  title={{Results on the Propositional mu-Calculus}},
  9.1050 +  author={Kozen, D.},
  9.1051 +  journal={TCS},
  9.1052 +  volume={27},
  9.1053 +  pages={333--354},
  9.1054 +  year={1983}
  9.1055 +}
  9.1056 +
  9.1057 +
  9.1058 +// original kripke structure paper
  9.1059 +@article{kripke63,
  9.1060 +  title={{Semantical analysis of modal logic}},
  9.1061 +  author={Kripke, S.},
  9.1062 +  journal={Zeitschrift fur Mathematische Logik und Grundlagen der Mathematik},
  9.1063 +  volume={9},
  9.1064 +  pages={67--96},
  9.1065 +  year={1963}
  9.1066 +}
  9.1067 +
  9.1068 +
  9.1069 +@book
  9.1070 + { mcGraw85,
  9.1071 +   author = 	{J McGraw and S. Skedzielewski and S. Allan and R Odefoeft},
  9.1072 +   title = 		{SISAL: Streams and Iteration in a Single-Assignment Language: Reference Manual Version 1.2},
  9.1073 +   note = 	{Manual M-146 Rev. 1},
  9.1074 +   publisher = 	{Lawrence Livermore National Laboratory},
  9.1075 +   year = 	{1985}
  9.1076 + }
  9.1077 +
  9.1078 +
  9.1079 +// Milner's own citation to development of CCS
  9.1080 +@book{milner80,
  9.1081 +  title={{A Calculus of Communicating Systems, volume 92 of Lecture Notes in Computer Science}},
  9.1082 +  author={Milner, R.},
  9.1083 +  year={1980},
  9.1084 +  publisher={Springer-Verlag}
  9.1085 +}
  9.1086 +
  9.1087 +
  9.1088 +// Milner's own pi-calculus reference
  9.1089 +@article{milner92,
  9.1090 +  title={{A calculus of mobile processes, parts I and II}},
  9.1091 +  author={Milner, R. and Parrow, J. and Walker, D.},
  9.1092 +  journal={Information and Computation},
  9.1093 +  volume={100},
  9.1094 +  number={1},
  9.1095 +  pages={1--40 and 41--77},
  9.1096 +  year={1992},
  9.1097 +  publisher={Academic Press}
  9.1098 +}
  9.1099 +
  9.1100 +
  9.1101 +// more recent Pi calculus reference
  9.1102 +@book
  9.1103 + { milner99,
  9.1104 +   author = 	{Robin Milner},
  9.1105 +   title = 		{Communicating and Mobile Systems: The pi-Calculus},
  9.1106 +   publisher = 	{Cambridge University Press},
  9.1107 +   year = 	{1999}
  9.1108 + }
  9.1109 +
  9.1110 +
  9.1111 +// MPI reference
  9.1112 +@book
  9.1113 + { MPIForum94,
  9.1114 +   author = 	{M. P. I. Forum},
  9.1115 +   title = 		{MPI: A Message-Passing Interface Standard},
  9.1116 +   year = 	{1994}
  9.1117 + }
  9.1118 +
  9.1119 +
  9.1120 +// Petri nets original citation
  9.1121 +@article{petri62,
  9.1122 +  title={{Fundamentals of a theory of asynchronous information flow}},
  9.1123 +  author={Petri, C.A.},
  9.1124 +  journal={Proc. IFIP Congress},
  9.1125 +  volume={62},
  9.1126 +  pages={386--390},
  9.1127 +  year={1962}
  9.1128 +}
  9.1129 +
  9.1130 +
  9.1131 +// Pierce Type system book
  9.1132 +@book{pierce02,
  9.1133 +   title={Types and Programming Languages},
  9.1134 +   author={Pierce, B. C.},
  9.1135 +   year={2002},
  9.1136 +   publisher={MIT Press}
  9.1137 +}
  9.1138 +
  9.1139 +
  9.1140 +// Survey of Visual programming
  9.1141 +@Article
  9.1142 + { price,
  9.1143 +   author =	{B. A. Price and R. M. Baecker and L. S. Small},
  9.1144 +   title =	{A Principled Taxonomy of Software Visualization},
  9.1145 +   journal ={Journal of Visual Languages and Computing},
  9.1146 +   volume =	{4},
  9.1147 +   number =	{3},
  9.1148 +   pages =	{211--266}
  9.1149 + }
  9.1150 +
  9.1151 +
  9.1152 +
  9.1153 +@misc
  9.1154 + { pythonWebSite,
  9.1155 +   key = 	{Python},
  9.1156 +   title = 		{The Python Software Foundation Mission Statement},
  9.1157 +   note = 	{{\ttfamily http://www.python.org/psf/mission.html}}
  9.1158 + }
  9.1159 +
  9.1160 +
  9.1161 +// Roadmap for Revitalization of High End Computing
  9.1162 +@unpublished
  9.1163 + { reed03,
  9.1164 +   editor = 	{Daniel A. Reed},
  9.1165 +   title = 		{Workshop on The Roadmap for the Revitalization of High-End Computing},
  9.1166 +   day = 	{16--18},
  9.1167 +   month = 	{jun},
  9.1168 +   year = 	{2003},
  9.1169 +   note = 	{Available at {\ttfamily http://www.cra.org/reports/supercomputing.web.pdf}}
  9.1170 + }
  9.1171 +
  9.1172 +
  9.1173 +// Parallel Pascal
  9.1174 +@Article
  9.1175 + { reeves84,
  9.1176 +   author =	{A. P. Reeves},
  9.1177 +   title =		{Parallel Pascal -- An Extended Pascal for Parallel Computers},
  9.1178 +   journal =	{Journal of Parallel and Distributed Computing},
  9.1179 +   volume =	{1},
  9.1180 +   number =	{},
  9.1181 +   year =	{1984},
  9.1182 +   month =	{aug},
  9.1183 +   pages =	{64--80}
  9.1184 + }
  9.1185 +
  9.1186 +
  9.1187 +// Survey of parallel langs and models
  9.1188 +@article{skillicorn98,
  9.1189 + author = {David B. Skillicorn and Domenico Talia},
  9.1190 + title = {Models and languages for parallel computation},
  9.1191 + journal = {ACM Comput. Surv.},
  9.1192 + volume = {30},
  9.1193 + number = {2},
  9.1194 + year = {1998},
  9.1195 + issn = {0360-0300},
  9.1196 + pages = {123--169},
  9.1197 + doi = {http://doi.acm.org/10.1145/280277.280278},
  9.1198 + publisher = {ACM Press},
  9.1199 + address = {New York, NY, USA},
  9.1200 + }
  9.1201 +
  9.1202 +
  9.1203 +// LOOPS ref for multiple inheritance issues
  9.1204 +@article{stefik86,
  9.1205 +  title={Object Oriented Programming: Themes and Variations},
  9.1206 +  author={Stefik, M. and Bobrow, D. G.},
  9.1207 +  journal={The AI Magazine},
  9.1208 +  volume={6},
  9.1209 +  number={4},
  9.1210 +  year={1986}
  9.1211 +}
  9.1212 +
  9.1213 +
  9.1214 +// 240 citations to this book, so seems safe..  covers modal logics which is superset 
  9.1215 +//  of temporal logics
  9.1216 +@book{stirling92,
  9.1217 +  title={{Modal and Temporal Logics}},
  9.1218 +  author={Stirling, C.},
  9.1219 +  year={1992},
  9.1220 +  publisher={University of Edinburgh, Department of Computer Science}
  9.1221 +}
  9.1222 +
  9.1223 +
  9.1224 +//  Titanium website
  9.1225 +@misc
  9.1226 + { TitaniumWebSite,
  9.1227 +   author =	{Paul Hilfinger and et. al.},
  9.1228 +   title = 	{The Titanium Project Home Page},
  9.1229 +   note = 	{{\ttfamily http://www.cs.berkeley.edu/projects/titanium}}
  9.1230 + }
  9.1231 +
  9.1232 +
  9.1233 +// website with scans of original work by Turing
  9.1234 +@misc{turing38,
  9.1235 +   author={A. Turing},
  9.1236 +   note={http://www.turingarchive.org/intro/, and
  9.1237 +http://www.turing.org.uk/sources/biblio4.html, and
  9.1238 +http://web.comlab.ox.ac.uk/oucl/research/areas/ieg/e-library/sources/tp2-ie.pdf},
  9.1239 +   year={1938}
  9.1240 +}
  9.1241 +
  9.1242 +
  9.1243 +// First mention of von Neumann's architecture ideas
  9.1244 +@book{vonNeumann45,
  9.1245 +   title={First Draft of a Report on the EDVAC},
  9.1246 +   author={J. von Neumann},
  9.1247 +   year={1945},
  9.1248 +   publisher={United States Army Ordnance Department}
  9.1249 +}
  9.1250 +
  9.1251 +
  9.1252 +// The 203 Glynn Winskel book for Formal Semantics
  9.1253 +@book{winskel93,
  9.1254 +  title={{The Formal Semantics of Programming Languages}},
  9.1255 +  author={Winskel, G.},
  9.1256 +  year={1993},
  9.1257 +  publisher={MIT Press}
  9.1258 +}
  9.1259 +
  9.1260 +
    10.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
    10.2 +++ b/0__Papers/PRT/PRT__intro_plus_eco_contrast/latex/PRT__intro_plus_eco_syst_and_contrast.tex	Tue Sep 17 06:30:06 2013 -0700
    10.3 @@ -0,0 +1,2644 @@
    10.4 +%-----------------------------------------------------------------------------
    10.5 +%
    10.6 +%               Template for sigplanconf LaTeX Class
    10.7 +%
    10.8 +% Name:         sigplanconf-template.tex
    10.9 +%
   10.10 +% Purpose:      A template for sigplanconf.cls, which is a LaTeX 2e class
   10.11 +%               file for SIGPLAN conference proceedings.
   10.12 +%
   10.13 +% Guide:        Refer to "Author's Guide to the ACM SIGPLAN Class,"
   10.14 +%               sigplanconf-guide.pdf
   10.15 +%
   10.16 +% Author:       Paul C. Anagnostopoulos
   10.17 +%               Windfall Software
   10.18 +%               978 371-2316
   10.19 +%               paul@windfall.com
   10.20 +%
   10.21 +% Created:      15 February 2005
   10.22 +%
   10.23 +%-----------------------------------------------------------------------------
   10.24 +
   10.25 +
   10.26 +\documentclass[preprint]{sigplanconf}
   10.27 +
   10.28 +% The following \documentclass options may be useful:
   10.29 +%
   10.30 +% 10pt          To set in 10-point type instead of 9-point.
   10.31 +% 11pt          To set in 11-point type instead of 9-point.
   10.32 +% authoryear    To obtain author/year citation style instead of numeric.
   10.33 +\usepackage{amssymb,graphicx,calc,ifthen,subfig,dblfloatfix,fixltx2e}
   10.34 +
   10.35 +
   10.36 +% correct bad hyphenation here
   10.37 +\hyphenation{op-tical net-works semi-conduc-tor}
   10.38 +
   10.39 +\usepackage{wasysym}
   10.40 +\usepackage{amstext}
   10.41 +
   10.42 +\begin{document}
   10.43 +
   10.44 +\bibliographystyle{plain}
   10.45 +%
   10.46 +
   10.47 +\conferenceinfo{WXYZ '05}{date, City.} 
   10.48 +\copyrightyear{2005} 
   10.49 +\copyrightdata{[to be supplied]} 
   10.50 +
   10.51 +\titlebanner{banner above paper title}        % These are ignored unless
   10.52 +\preprintfooter{short description of paper}   % 'preprint' option specified.
   10.53 +
   10.54 +
   10.55 +\title{ The Proto-Runtime Infrastructure for Fast, Modular
   10.56 +Implementation of High Performance Parallel Runtime
   10.57 +Systems}
   10.58 +
   10.59 +
   10.60 +\authorinfo{Sean Halle}
   10.61 +           {Open Source Research Institute, INRIA,
   10.62 +           and TU Berlin}
   10.63 +           {seanhalle@opensourceresearchinstitute.org}
   10.64 +\authorinfo{Merten Sach}
   10.65 +           {TU Berlin}
   10.66 +           {msach@mailbox.tu-berlin.de}
   10.67 +\authorinfo{Albert Cohen}
   10.68 +           {Ecole Normal Supereur, and INRIA}
   10.69 +           {albert.cohen@inria.fr}
   10.70 +
   10.71 +\maketitle
   10.72 +
   10.73 +
   10.74 +\begin{abstract}
   10.75 + 
   10.76 +
   10.77 +
   10.78 +The proto-runtime approach has been used to implement
   10.79 +the runtime behavior of several parallel languages, including Reo[], PRDSL[], and HWSim[]. As detailed
   10.80 +in other papers, each language's
   10.81 +runtime system is high performance on multiple hardware platforms, including multi-core, NUMA, Adapteva, and
   10.82 +Kalray. The proto-runtime infrastructure made the implementations
   10.83 +fast, and the porting nearly effortless, while adding debugging
   10.84 +and performance monitoring features to the languages. In general, the proto-runtime approach provides advantages
   10.85 +for fast implementation of the runtime system,  portability of the runtime
   10.86 +code across hardware, and adds elusive debugging facilities.
   10.87 +Despite the successes, no publications covering the approach
   10.88 +have yet been accepted to a conference or journal.  Here we address this shortcoming by describing the
   10.89 +theory of the approach and the core architecture of its implementation, which is roughly the same on all
   10.90 +hardware platforms. 
   10.91 +
   10.92 +
   10.93 +?
   10.94 +
   10.95 +Why no pthreads -- those are portable, so is RPC
   10.96 +
   10.97 +Why not CAS custom -- that's high performance
   10.98 +
   10.99 +Why not MPI -- that's high performance and portable
  10.100 +
  10.101 +What extra does it buy, using PRT?
  10.102 +
  10.103 +Who is going to use it?
  10.104 +
  10.105 +?
  10.106 +
  10.107 +The proto-runtime abstraction has the potential to replace the Thread
  10.108 +abstraction, along with its primitives such as
  10.109 +semaphores, locks, critical sections, atomic
  10.110 +instructions like CAS and similar low-level building blocks,
  10.111 +as the basis upon which the runtime systems for parallel
  10.112 +languages and
  10.113 +operating systems  are built.  The proto-runtime abstraction
  10.114 + better balances many competing
  10.115 +factors, to provide value in  the big picture. It has
  10.116 +better direct hardware implementations, while its extensible
  10.117 +approach 
  10.118 + places complex parallel language constructs on the
  10.119 +same intimate hardware level as the current OS kernel's implementation of Thread constructs. It simultaneously makes those
  10.120 +complex language constructs easier to implement than
  10.121 +they are when using Thread constructs or atomic hardware instructions.    
  10.122 +It additionally improves the portability of parallel
  10.123 +application code and the portability of the parallel
  10.124 +language runtime system implementations. Further, the
  10.125 +proto-runtime abstraction makes key services for debugging,
  10.126 +verification, and similar language features become conveniently
  10.127 +available to language implementers. This balance and
  10.128 +its portability
  10.129 +benefits  make it suitable as the basis for an eco system
  10.130 +that addresses the write once run high performance
  10.131 +anywhere goal [Hotpar paper].
  10.132 +
  10.133 +?
  10.134 +
  10.135 +
  10.136 +
  10.137 +Thinking purely locally, in any given case, the number
  10.138 +of factors of interest can be reduced to the point
  10.139 +that any one competing approach can look superior.
  10.140 + However, in the larger picture, with all the 
  10.141 +factors included, proto-runtime is the only approach
  10.142 +that is strong in every
  10.143 +aspect. It is the only approach that balances all aspects
  10.144 +critical to an industry wide infrastructure that  future-proofs
  10.145 +existing
  10.146 +software, making it high performance on future architectures,
  10.147 +while  making the introduction of new architectures
  10.148 +quick and low effort, providing a ready base of applications.   ?
  10.149 +
  10.150 +?
  10.151 +
  10.152 +Domain Specific Languages that are embedded into a base language have promise to provide productivity, performant-portability and wide adoption for parallel programming. However such languages have too few users to support the large effort required to create them and port them across hardware platforms, resulting in low adoption of the method.
  10.153 +As one step to ameliorate this, we apply the proto-runtime approach, which reduces the effort to create and port the runtime systems of parallel languages. It modularizes the creation of runtime systems and the parallelism constructs they implement, by providing an interface
  10.154 +that separates the language-construct  and scheduling logic away from the low-level runtime details, including concurrency, memory consistency, and runtime-performance aspects.
  10.155 +As a result, new parallel constructs are written using sequential reasoning,  multiple languages can be mixed within
  10.156 +the same program, and reusable services such as performance
  10.157 +tuning and debugging
  10.158 +support are available. In addition, scheduling of work onto hardware is under language and application control, without interference from an underlying thread package scheduler. This enables higher quality scheduling decisions for higher application performance.
  10.159 +We present measurements of the time taken to develop runtimes for  new languages, as well as time to re-implement for existing ones,  which average  a few days each.  In addition, we measure performance of implementations
  10.160 +based on proto-runtime, going head-to-head with the standard distributions of Cilk, StarSs (OMPSs), and posix threads, showing that the proto-runtime matches or outperforms on large servers in all cases.
  10.161 +
  10.162 +?
  10.163 +
  10.164 +
  10.165 +replace lang-specific with interface, centralize services, minimize effort to create, give language control over hardware assignment..  side benefits: multi-lang, perf-tuning, debugging\end{abstract}
  10.166 +
  10.167 +
  10.168 +
  10.169 +
  10.170 +
  10.171 +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  10.172 +\section{Background and Motivation}
  10.173 +\label{sec:intro}
  10.174 +
  10.175 +[Note to reviewers: this paper's style and structure follow the official PPoPP guide to writing style, which is linked to the PPoPP website. We are taking on faith that the approach has been communicated effectively to reviewers and that we won't be penalized for following it's recommended structure and approach.]
  10.176 +
  10.177 +As  hardware  becomes increasingly parallel, programming must also
  10.178 +become parallel.  However,  the transition from sequential to parallel programming has been slow due to  the difficulty of the traditional parallel programming methods. 
  10.179 +
  10.180 +The main difficulties with parallel programming are: 1)  difficult mental model, which reduces productivity, 2) additional effort to rewrite the code for each hardware target to get acceptable performance and 3) disruption to existing practices, including steep learning curve, changes to the tools used, and changes in work practices. 
  10.181 +
  10.182 +Many believe that these can be overcome with the use of embedded style parallel Domain-Specific Languages (epDSLs) []. epDSL language
  10.183 +constructs match the mental model of the domain, while
  10.184 +they internally imply parallelism. For example, a simulation
  10.185 +epDSL called HWSim[] has only 10 constructs, which match
  10.186 +the actions taken during a simulation
  10.187 +of interacting objects.  They are mixed into sequential C code and take
  10.188 +only a couple of hours to learn.  Yet they encapsulate subtle
  10.189 +and complex dependencies that relate simulated time
  10.190 +to the physical time in the machine. They encapsulate the parallelism
  10.191 +present, while simultaneously making the implementation
  10.192 +simpler to think about than a purely sequential implementation.
  10.193 +
  10.194 +
  10.195 +
  10.196 + Despite this, the adoption of such languages has been slow, we believe due to the cost to create them and to port them across hardware targets. The small number of users of each language, which is specific to a narrow domain, makes this cost impractical.
  10.197 +
  10.198 +We propose that a method that makes epDSLs lower cost to produce as well as to port across hardware targets will allow them to fulfill their promise. We  show
  10.199 +how to apply the proto-runtime approach to help towards this goal.  
  10.200 +
  10.201 +In this approach, a language's runtime system is built
  10.202 +as a plugin that is connected to a pre-existing proto-runtime  instance installed on given hardware. Together, the plugin
  10.203 +plus proto-runtime instance form the runtime system
  10.204 +of the language. The proto-runtime instance itself acts as the infrastructure of a runtime system, and
  10.205 +encapsulates most of the hardware-specific details,
  10.206 +while providing a number of services for use by the
  10.207 +plugged in language module. 
  10.208 +
  10.209 +A proto-runtime instance is essentially a full runtime, but with two key pieces replaced by an interface. One  piece replaced is the logic of language constructs, and the other is logic for choosing which core to assign work onto. The proto-runtime instance then supplies
  10.210 +the rest of the runtime system. 
  10.211 +
  10.212 +The decomposition, into a proto-runtime plus  plugged-in  language behaviors, modularizes the construction of runtimes.  The proto-runtime is one module, which  embodies runtime internals, which are hardware oriented and independent of language. The plugged-in portions form the two other modules, which are language specific. The interface between them   occurs at a natural boundary, which separates   the hardware oriented portion of a runtime from the language oriented portion. 
  10.213 +
  10.214 +We claim the following benefits of the proto-runtime approach, each of which is  supported in the indicated section of  the paper:
  10.215 +
  10.216 +\begin{itemize}
  10.217 +
  10.218 +\item The proto-runtime approach modularizes the runtime (\S\ref{sec:Proposal}).
  10.219 +
  10.220 +%\item The modularization  is consistent with patterns that appear to be fundamental to parallel computation and runtimes (\S\ ). 
  10.221 +
  10.222 +\item The modularization  cleanly separates hardware
  10.223 +related runtime internals from the language-specific logic (\S\ref{sec:Proposal},
  10.224 +\S\ref{subsec:Example}). 
  10.225 +
  10.226 +\item The modularization gives the language control
  10.227 +over timing and placement of executing work (\S\ref{sec:Proposal}).
  10.228 +
  10.229 +
  10.230 +\item
  10.231 +
  10.232 +The modularization  selectively exposes hardware aspects relevant to placement of work. If the language takes advantage of this, it  can result in reduced communication between cores and increased application performance  (\S\ ).
  10.233 +
  10.234 +\begin{itemize}
  10.235 +
  10.236 +\item Similar control over hardware is not possible when the language is   built on top of a package like Posix threads or TBB, which has its own work-to-hardware assignment   (\S\ref{sec:Related}).
  10.237 +
  10.238 +\end{itemize}
  10.239 +
  10.240 +
  10.241 +\item The modularization results in reduced time to implement a new language's behavior, and in reduced time to port a language to new hardware (\S\ref{sec:Proposal},
  10.242 +\S\ref{subsec:ImplTimeMeas}).
  10.243 +
  10.244 +\begin{itemize}
  10.245 +
  10.246 +
  10.247 +\item  Part of the time reduction is due to the proto-runtime providing common services for all languages to (re)use.  Such services include debugging facilities, automated verification, concurrency handling, dynamic performance measurements for use in assignment and auto-tuning, and so on  (\S\ ).
  10.248 +
  10.249 +\item Part  is due to hiding the low
  10.250 +level hardware aspects inside the proto-runtime module,
  10.251 +independent from language (\S \ref{sec:intro}).
  10.252 +
  10.253 +\item Part  is due to  reuse of the effort of performance-tuning  the runtime internals (\S ).  
  10.254 +
  10.255 +\item  Part is due to using sequential thinking when implementing the language logic, enabled by  the proto-runtime protecting shared internal runtime state and exporting an interface that presents a sequential model  (\S\ref{subsec:Example}). 
  10.256 +
  10.257 +
  10.258 +\end{itemize}
  10.259 +
  10.260 +\item Modularization with similar benefits does not appear possible when using a package such as Posix threads or TBB,  unless the package itself is modified and then used  according to the proto-runtime pattern  (\S\ref{sec:Related}).
  10.261 +
  10.262 +
  10.263 +\item The proto-runtime approach appears to future-proof language
  10.264 +runtime
  10.265 +construction,  because the patterns underlying proto-runtime appear to be fundamental (\S\ref{subsec:TiePoints},
  10.266 +\S\ref{subsec:Example}), and so  should hold for future  architectures. Plugins are reused on those, although performance related updates to the
  10.267 +plugins may be desired.
  10.268 +
  10.269 +\end{itemize}
  10.270 +
  10.271 +The paper is organized as follows: We first expand on the value of embedded style parallel DSLs (epDSLs), and where the effort goes when creating one (\S\ref{subsec:eDSLEffort}). We focus on the role that  runtime implementation effort plays in the adoption of epDSLs, which motivates the value of the  savings provided by the proto-runtime approach. We then move on to the details of the proto-runtime approach (\S\ref{sec:Proposal}), and tie them to how a runtime is modularized (\S\ref{subsec:Modules}), covering how each claimed benefit is provided. 
  10.272 +We then show overhead measurements (\S\ref{subsec:OverheadMeas}) and implementation time measurements (\S\ref{subsec:ImplTimeMeas} ), which indicate that the proto-runtime approach is performance competitive while significantly reducing implementation and porting effort.
  10.273 +With that  understanding in hand, we then discuss  how the approach compares to related work (\S\ref{sec:Related}), and finally, we highlight the main conclusions drawn from the research (\S\ref{sec:Conclusion}).
  10.274 +
  10.275 +
  10.276 +
  10.277 +
  10.278 +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  10.279 +%
  10.280 +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  10.281 +\section{Background: The epDSL Hypothesis}
  10.282 +
  10.283 +%[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
  10.284 +
  10.285 +%[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
  10.286 +
  10.287 +%[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
  10.288 +
  10.289 +%[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
  10.290 +
  10.291 +Domain Specific Languages have been around for a while [], and recently have been suggested as a good approach for parallel programming[][stanford PPL].
  10.292 +
  10.293 +In essence, a DSL, or just Domain Language, captures patterns that are common in a particular domain of expertise, such as user interfaces, simulations of physical systems, bio-informatics,  and so on.  Each domain has a particular set of mental models, common types of computation, and common kinds of data structures. A  DSL captures these common elements in custom syntax.
  10.294 + 
  10.295 +
  10.296 +The custom syntax can capture parallelism information while simultaneously being natural to think about. In practice, multiple aspects of domains provide opportunities for parallelism. For example, the custom data structures seen by the coder can be internally implemented with distributed algorithms; common operations in the domain can be internally implemented with parallel algorithms; and, the domain constructs often imply dependencies. All of these are gained without the programmer being aware of this implied parallelism; they just follow simple language usage rules. 
  10.297 +
  10.298 +
  10.299 +
  10.300 +\subsection{Embedding a DSL into a base language}
  10.301 +
  10.302 +A style of domain language, which we feel has good adoption potential, is the so-called \textit{embedded} style of DSL (eDSL) [] [metaborg][stanford ppl]. In this variation, a program is written in a mix of a base sequential language plus domain language constructs. The syntax of the two is intermixed. A preprocessing step then translates the domain syntax into the base syntax, and includes calls to the domain language's runtime.
  10.303 +
  10.304 +
  10.305 +For example, use C (or Java) as the base language for an application, then mix-in custom syntax  from a user-interface eDSL.  To test the code, the developer modifies the build process to first perform the translation step, then pass the resulting source through the normal  compiler. The resulting executable contains calls to a runtime library that becomes linked, at run time, to an implementation that has been tuned to the hardware.
  10.306 +
  10.307 +As with HWSim, the number of such embedded
  10.308 +constructs tends to be low, easy to learn, and significantly
  10.309 +reduce the complexity of the code written. All while
  10.310 +implicitly specifying parallelism. 
  10.311 +
  10.312 +Additionally, parallel versions, or epDSLs have more than just a syntactic advantage over libraries.  The language has a toolchain that provides build-time optimization and can take advantage of relationships among distinct constructs within the code.  The relationship information allows derivation of communication patterns that inform the choice of placement of work, which is critical to performance on parallel hardware.
  10.313 +\subsection{Low learning curve, high productivity, and portability}
  10.314 + eDSLs tend to have low learning curve because domain experts are  already familiar with the concepts behind the language constructs, and there are relatively few constructs
  10.315 +for an embedded DSL. This is especially valuable for  those who are \textit{not} expert programmers. Embedded style DSLs further reduce learning curve because they  require no new development tools nor development procedures. Together, these address the goal of  a low learning curve for switching to parallel software development.
  10.316 +
  10.317 +Productivity has been shown to be enhanced by a well designed DSL, with studies  measuring
  10.318 +10x reduction in development time [][][].  Factors
  10.319 +behind this include simplifying the application code, modularizing it, and encapsulating  performance aspects inside the language.  Simplifying reduces the amount of code and the amount of mental effort. Modularizing separates concerns within the code and isolates aspects, which improves productivity. Encapsulating performance inside the DSL constructs removes them from the application programmer's concerns, which also improves productivity.
  10.320 +
  10.321 +Perhaps the most important productivity enhancement comes from hiding parallelism aspects inside the  DSL constructs. The language takes advantage of the domain patterns to present a familiar mental model, and then attaches synchronization, work-division, and communication implications to those constructs, without the programmer having to be aware of them.    Combining the simplicity, modularization, performance encapsulation, and parallelism hiding,  with congruence with the mental model of the domain,  together work towards the goal of high productivity.
  10.322 + 
  10.323 +Portability is aided by the encapsulation of performance aspects inside the DSL constructs. The aspects   that require large amounts of computation are often pulled into the language, so only the language implementation must adapt to new hardware. Although fully achieving such isolation isn't always possible, epDSLs hold promise for making significant strides towards it.
  10.324 +
  10.325 +\subsection{Low disruption and easy adoption} 
  10.326 +
  10.327 +Using an epDSL tends to have low disruption because the base language remains the same, along with most of the development tools and practices.
  10.328 + Constructs from the epDSL can be mixed into existing sequential code, incrementally replacing the high computation sections, while continuing with the same development  practices.
  10.329 + 
  10.330 + \subsection{ Few users means the effort of eDSLs must be low} \label{subsec:eDSLEffort}
  10.331 +
  10.332 +What appears to be holding epDSLs back from widespread
  10.333 +adoption is mainly the time, expertise, and cost to develop an epDSL.  The effort to create a usable epDSL needs to be reduced to the point that it is viable for a user base of only a few hundred.  
  10.334 +
  10.335 +The effort  falls into three categories:
  10.336 +
  10.337 +\begin{enumerate}
  10.338 +\item effort to explore  language design and create the epDSL syntax
  10.339 +\item effort to create the runtime that produces the epDSL behavior
  10.340 +\item effort to performance tune the epDSL on particular hardware
  10.341 +\end{itemize}    
  10.342 +
  10.343 +
  10.344 +\subsection{The big picture}
  10.345 +
  10.346 +Across the industry as a whole, when epDSLs become successful, there may be  thousands of epDSLs, that
  10.347 +each must  be mapped onto hundreds of different hardware platforms.  That multiplicative effect must be reduced in order to make the epDSL approach economically viable.
  10.348 +
  10.349 +The first category of eDSL effort is creating the front-end translation of custom syntax into the base language. This is a one-time effort that does not repeat when new hardware is added. 
  10.350 +
  10.351 +The effort that has to be expended on each platform is the runtime implementation and toolchain optimizations.
  10.352 +Runtime implementation includes hardware-specific low-level tuning and modification of mapping of work onto cores.
  10.353 +
  10.354 +This is where leveraging the proto-runtime approach
  10.355 +pays off. Hardware platforms cluster into groups with similar performance-related features.  Proto-runtime
  10.356 +presents a common abstraction for all hardware
  10.357 +platforms, but a portion of the interface supplies performance related
  10.358 +information specific to the hardware. This portion is  specialized for each
  10.359 +cluster. Examples of clusters include:
  10.360 +
  10.361 +\begin{itemize}
  10.362 +\item single chip shared coherent memory
  10.363 +\item multi-chip shared coherent memory (NUMA)
  10.364 +\item coprocessor with independent address space (GPGPU)
  10.365 +\item a network among nodes of the above categories
  10.366 +(Distributed) \item a hierarchy of sub-networks
  10.367 +\end{itemize}
  10.368 +
  10.369 +
  10.370 +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  10.371 +%
  10.372 +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  10.373 +\section{Our Proposal} \label{sec:Proposal}
  10.374 +
  10.375 +We propose addressing the runtime effort portion of creating
  10.376 +an epDSL by defining a modularization of runtimes, as seen in Fig. \ref{fig:PR_three_pieces}.  The low-level hardware details are collected into one module, which presents a common interface, called the \textit{proto-runtime
  10.377 +instance}. The language supplies
  10.378 +the top two modules, which plug in via the interface. The hardware specific module  (proto-runtime instance) presents the same interface
  10.379 +for all platforms, with a specialization for each category
  10.380 +of platform sharing similar performance related features.  The proto-runtime module only has to be implement once for a given platform, and is then reused by  all the languages.  
  10.381 +
  10.382 +\begin{figure}[ht]
  10.383 +  \centering
  10.384 +  \includegraphics[width = 1.5in, height = 1.1in]{../figures/proto-runtime__modules.pdf}
  10.385 +  \caption{Shows how the proto-runtime approach modularizes the implementation of a runtime. The three pieces are the proto-runtime implementation, an implementation of the language construct behaviors, and an implementation of the portion of  scheduling that chooses which work is assigned to which processor. }
  10.386 +  \label{fig:PR_three_pieces}
  10.387 +\end{figure}
  10.388 +
  10.389 +
  10.390 +Because of the modularization, a language has a much lower effort requirement, of implementing just for each category.
  10.391 +
  10.392 +The higher level of abstraction simplifies the task for the language implementer.
  10.393 +The language doesn't consider the low-level details of making the runtime itself run fast. It only has to consider the level of hardware feature that is exposed by the interface. 
  10.394 +
  10.395 +One additional benefit is that the assignment module
  10.396 +gives control to the language, to choose when and where it wishes work to execute.
  10.397 +This  simplifies implementation of language  features related to scheduling behavior.
  10.398 +It also enables the language implementor to use sophisticated
  10.399 +methods for choosing placement of work, which can significantly impact
  10.400 +application performance.  
  10.401 +
  10.402 +In this paper, we present work that applies to coherent
  10.403 +shared memory machines, both single chip and multiple chip. Extensions beyond this are currently in progress, to address multiple-address-space machines and hierarchical
  10.404 +heterogeneous collections of processors, which will appear in future papers.
  10.405 +
  10.406 +\subsection{Breakdown of the modules} \label{subsec:Modules}
  10.407 +
  10.408 +The language is broken into two parts, as seen in Fig.
  10.409 +\ref{fig:langBreakdown}. One is a thin wrapper library that
  10.410 +invokes the runtime and the other is a set of modules that are part of that invoked runtime. These are called
  10.411 +the \textit{language plugin} or just plugin. 
  10.412 +
  10.413 +
  10.414 +\begin{figure}[ht]
  10.415 +  \centering
  10.416 +  \includegraphics[width = 2.8in, height = 1.1in]{../figures/proto-runtime__modules_lang_breakdown.pdf}
  10.417 +  \caption{Shows how the code of the language implementation
  10.418 +  is broken into two pieces. The first is a thin wrapper
  10.419 +  that invokes the runtime, the other is a dynamic
  10.420 +  library that plugs into the runtime.}
  10.421 +  \label{fig:langBreakdown}
  10.422 +\end{figure}
  10.423 +  
  10.424 +
  10.425 +
  10.426 +Thus, a non-changing application executable is able to invoke hardware specific plugin code, which changes between machines. The plugin collects the two language modules into a dynamic library. The library is implemented, compiled,  distributed and installed separately from  applications.  The application executable contains only symbols of plugin functions, and during the run those are dynamically linked to machine-specific implementations.
  10.427 +
  10.428 +
  10.429 +In order to provide such modularization, we rely upon a model for specifying synchronization constructs that we call the tie-point model. The low-level nature of a tie-point places them below the level of  constructs,
  10.430 +even a simple  mutex. Instead, a mutex is specified in terms
  10.431 +of the primitives in the tie-point model. In turn,
  10.432 +the  tie-point primitives are implemented
  10.433 +by proto-runtime.
  10.434 +
  10.435 + This places all parallel constructs on the same level in the software stack, be they complex like the AND-OR parallelism of Prolog, or the wild-card matching
  10.436 +channels in coordination languages,  or ultra-simple acquire and release mutex constructs. All are implemented in terms of the same tie-point primitives provided by the proto-runtime instance.
  10.437 +
  10.438 +We have reached a point in the paper, now, where the order of explanation can take one of two paths: either
  10.439 +start with the abstract model of tie-points and explain how this affects the modularization of the runtime, or start with implementation details and work upwards towards the abstract model of tie-points.  We have chosen to start with the abstract tie-point model, but the reader is invited to skip to the section after it, which starts with code examples and ties code details to the abstract tie-point model.   
  10.440 +
  10.441 +
  10.442 +
  10.443 +\section{The tie-point model.}\label{subsec:TiePoints}
  10.444 +
  10.445 +
  10.446 +\subsection{timelines}
  10.447 +A tie-point relates timelines, so we talk a bit, first, about timelines. A timeline is the common element in parallelism.  If you look at any parallel language, it involves a number of independent timelines. It then controls which timelines are actively progressing relative to the others.
  10.448 +
  10.449 +For example, take a thread library, which we consider
  10.450 +a parallel language.  It provides a command to create a thread, where that thread represents an independent timeline. The library also provides the mutex acquire and release commands, which control which of those timelines advance relative to each other. When an acquire executes, it can cause the thread to block, which means the associated timeline suspends; it stops
  10.451 +making forward progress. The release in a different thread clears the block, which resumes the timeline. That linkage between suspend and resume of different timelines is the control the language exerts over which timelines are actively progressing.
  10.452 +
  10.453 +To build up to tie-points, we look at the nature of points on
  10.454 +a single timeline, by reviewing mutex behavior in detail. See the timeline shown in Fig \ref{fig:singleTimeline}.  Thread A, which is timeline A, tries to acquire the mutex, M,
  10.455 +by executing the acquire command. Timeline A stops, at point 1.S, then something external to it happens, and the timeline starts again at point 1.R.  The gap between is not seen by the code executed within the thread.  Rather, from the code-execution viewpoint, the acquire command is a single command, and hence the gap between 1.S and 1.R collapses to a single point on the timeline.
  10.456 +
  10.457 +
  10.458 +\begin{figure}[ht]
  10.459 +  \centering
  10.460 +  \includegraphics[width = 2.8in, height = 0.8in]
  10.461 +  {../figures/PR__timeline_single.pdf}
  10.462 +  \caption{The timeline suspends at 1.S and resumes
  10.463 +  at 1.R. From the viewpoint of the timeline, the gap collapses into a single point.}
  10.464 +  \label{fig:singleTimeline}
  10.465 +\end{figure}
  10.466 +
  10.467 +
  10.468 + Fig. \ref{fig:dualTimeline}  shows  two timelines: timeline A executing acquire and timeline B executing release. The release still suspends its timeline, but
  10.469 +it quickly resumes again because it is not blocked.
  10.470 +The release causes timeline A to also resume. The fact
  10.471 +of the release on one timeline has caused the end of the acquire on the other. This makes
  10.472 +the two collapsed points become what we term \textit{tied together} into a \textit{tie-point}.
  10.473 +
  10.474 +\begin{figure}[ht]
  10.475 +  \centering
  10.476 +  \includegraphics[width = 2.8in, height = 1.2in]
  10.477 +  {../figures/PR__timeline_dual.pdf}
  10.478 +  \caption{Two  timelines with tied together ``collapsed''
  10.479 +points.
  10.480 +Point 1 on timeline A forms a tie-point with point
  10.481 +2 on timeline B.
  10.482 +It is hidden activity that takes place inside the gaps that
  10.483 +establishes a causal relationship that ties them together.}
  10.484 +  \label{fig:dualTimeline}
  10.485 +\end{figure}
  10.486 +
  10.487 +Fig. \ref{fig:dualTimelineWHidden} adds detail about
  10.488 +how the release goes about causing the end of the block
  10.489 +on the acquire. It reveals
  10.490 +a hidden timeline, which is what performs the behavior of the
  10.491 +acquire and release constructs.  As seen, acquire starts
  10.492 +with a suspend, which is accompanied by a communication
  10.493 +sent to the hidden timeline.  The hidden timeline then
  10.494 +checks whether the mutex is free, sees that it isn't
  10.495 +and leaves  timeline A suspended. Later, timeline
  10.496 +B performs release, which suspends it and sends a communication
  10.497 +to the same hidden timeline. That then sees that timeline
  10.498 +A is waiting for the release and performs a special
  10.499 +control action that resumes timeline A, followed by
  10.500 +doing the control action again to resume timeline B.
  10.501 + It is inside the hidden timeline that the acquire
  10.502 +gets linked to the release, tying the constructs together.   
  10.503 +
  10.504 +
  10.505 +\begin{figure}[ht]
  10.506 +  \centering
  10.507 +  \includegraphics[width = 2.8in, height = 1.9in]
  10.508 +  {../figures/PR__timeline_dual_w_hidden.pdf}
  10.509 +  \caption{Two  timelines with tied together ``collapsed''
  10.510 +points  showing the detail of a hidden timeline that
  10.511 +performs the behavior that ties the points together.
  10.512 +Vertical dashed lines represent communication sent
  10.513 +as part of the suspend action, and the curvy arrows
  10.514 +represent special control that causes resume of the
  10.515 +target timelines. During the gaps in timelines A and
  10.516 +B, activity takes place in the hidden timeline, which
  10.517 +calculates that the timelines should be resumed, then
  10.518 +exercises control to make resume happen.}
  10.519 +  \label{fig:dualTimelineWHidden}
  10.520 +\end{figure}
  10.521 +
  10.522 +
  10.523 +
  10.524 +We show in \S\ref{sec:FormalTiePoint} that the pattern
  10.525 +of communications to and from the hidden timeline establishes
  10.526 +an ordering relationship between events before and
  10.527 +after the tied points. That implies a relation on
  10.528 +the visibility of events. 
  10.529 +
  10.530 +Fig \ref{fig:tie-pointGuarantees} shows the ordering relationship and the implied visibility of operations between
  10.531 +the timelines. Operations that execute  in
  10.532 +the first timeline before the tie-point are visible
  10.533 +in the second after the tie point, and vice versa. Likewise, operations that execute in one timeline after the tie-point are not  visible in the other timeline before the tie-point. Such an ordering satisfies
  10.534 +the requirements
  10.535 +of a synchronization construct. 
  10.536 +
  10.537 +
  10.538 +
  10.539 +\begin{figure}[ht]
  10.540 +  \centering
  10.541 +  \includegraphics[width = 2.8in, height = 1.25in]
  10.542 +  {../figures/PR__timeline_tie_point_ordering.pdf}
  10.543 +  \caption{The
  10.544 +visibility guarantees that result from a tie-point. Shows which
  10.545 + operations, such as writes,  performed on one timeline can be seen by the other
  10.546 +timeline. These visibilities are equivalent to establishing
  10.547 +an order between events before the tied points versus those after the tied
  10.548 +points.  Both timelines agree on what events are before
  10.549 +versus after the tied point.  }
  10.550 +  \label{fig:tie-pointGuarantees}
  10.551 +\end{figure}
  10.552 +
  10.553 +
  10.554 +\subsection{Formal definition of tie-point} \label{sec:FormalTiePoint}
  10.555 +In a moment we will show how any and all synchronization constructs
  10.556 +can be defined in terms of tie-points. Before getting
  10.557 +there, we must choose an, unavoidably arguable, definition of synchronization
  10.558 +construct. We then provide a formal definition of tie-point
  10.559 +and use it to show that a tie point
  10.560 +satisfies the conditions of any
  10.561 +such synchronization
  10.562 +construct.
  10.563 + 
  10.564 +Our formalism defines timelines, communication between
  10.565 +timelines, and suspend and resume of a timeline. It then shows a particular pattern, which is the characteristic pattern that defines a tie-point. We then show that when that characteristic pattern exists, then relations exist between timelines that have certain properties.
  10.566 +We conclude by showing a few classical definitions
  10.567 +of synchronization and show that those definitions
  10.568 +are upheld when  the tie-point pattern is present. Hence, those classical definitions can be satisfied via creation of a tie-point. 
  10.569 +
  10.570 +\subsubsection{}
  10.571 +
  10.572 +\begin{description}
  10.573 +\item[timeline:]
  10.574 +\(T = E \times\mathbb{N}, (E, <)\).  A timeline is an ordered
  10.575 +sequence of events. Given two events $e_\alpha, e_\beta \in E$ from a timeline, the events are ordered by the
  10.576 +subscripts, so: $e_\alpha < e_\beta$ iff $\alpha < \beta$,
  10.577 +and vice versa. 
  10.578 + Any and all memory locations in a system are part
  10.579 + of, or local to, exactly one timeline.  Only that
  10.580 +timeline can modify the locations (hence, side-effects require shared memory to have its own timeline that
  10.581 +is separate
  10.582 +from any timeline that code executes in).  
  10.583 +
  10.584 +\item[event:] 
  10.585 +\(E =\{c_{0,t},c_{1,t}, ..\} \cup \{s_{n,\alpha ,t}\} \cup \{r_{n,\beta , t}\}
  10.586 +\cup \{z_{\gamma ,t} \} \). There are four kinds of event
  10.587 +that can happen on a timeline, namely $c$, a step of computation,
  10.588 +which modifies the memory local to the timeline; $s$, a
  10.589 +send of a communication which pushes out contents from
  10.590 +the timeline's local memory; $r$, a receive of a communication
  10.591 +which modifies the timeline's local memory; and $z$,
  10.592 +a synchronization
  10.593 +construct which suspends then resumes the timeline in such a way
  10.594 +as to establish a relation between events on this timeline
  10.595 +versus events on a remote timeline. Suspend is denoted
  10.596 +$z\_s_{\gamma ,t}$ while resume is denoted $z\_r_{\gamma
  10.597 +,t}$ where $s$
  10.598 +and $r$ are literal while $\gamma$ denotes the position
  10.599 +on the timeline and $t$ is the timeline that executes
  10.600 +the synchronization construct. 
  10.601 +\item[communication:]
  10.602 +\(C = \{s,r\}, s < r\).  A communication is a set of
  10.603 +one send event from one timeline plus one or more receive events
  10.604 +from different timelines, with the send
  10.605 +event ordered before the receive event(s), denoted $s_{n,\alpha, t}\mapsto
  10.606 +r_{n,\beta,t}$ where $n$ distinguishes the communication
  10.607 +set, $\alpha$ and $\beta$ are the ordering upon the
  10.608 +timeline and $t$ denotes the timeline the event is on.  A communication
  10.609 +orders events on one timeline relative to events on another.
  10.610 +However, the ordering is only between two points. In
  10.611 +particular for two sends from timeline 1 to timeline
  10.612 +2, if \(s_{1,\_,1} < s_{2,\_,1}\) on timeline 1, then on
  10.613 +timeline 2, both \(r_{1,\_,2} < r_{2,\_,2}\) and \(r_{2,\_,2} < r_{1,\_,2}\) are valid, where ``$\_$'' in the position
  10.614 +of the ordering integer represents a wild
  10.615 +card. However, $s_{1,\_,1} \mapsto r_{1,\_,2}$
  10.616 +followed by $s_{2,\_,2} \mapsto r_{2,\_,1}$ where $r_{1,\_,2}
  10.617 +< s_{2,\_,2}$
  10.618 +  implies that $s_{1,\_,1} < r_{2,\_,1}$ always.  
  10.619 +
  10.620 +\item[hidden timeline:] We define a special kind of  "hidden" timeline that is not
  10.621 +seen by application code. It has an additional
  10.622 +kind of event available, which ends a synchronization
  10.623 +event on a different timeline.
  10.624 + We denote this $fro_{\delta,h}$ where $fro$ is literal,
  10.625 + standing for ``force resume other (timeline)", $\delta$ is the position
  10.626 + on the timeline and $h$ is the (hidden) timeline the
  10.627 +event is on. Additionally, a suspend event on an application
  10.628 +visible timeline implies a send from that timeline
  10.629 +to a hidden timeline. Hence $z\_s_{\gamma,t} \Rightarrow
  10.630 +s_{n,\gamma,t} \mapsto r_{n,\_,h}$  
  10.631 +
  10.632 +\item[tie-point:] Now, we define a tie-point as a set of two or more
  10.633 +synchronization points from different timelines which
  10.634 +are related by a particular pattern of communications.
  10.635 +As a result of the pattern, the set satisfies particular criteria. The pattern is that communications from the suspend synchronization events must converge on a common hidden timeline and that timeline must then emit a subsequent resume event for each of the suspended timelines,
  10.636 +as shown back in Fig. \ref{fig:dualTimelineWHidden}. 
  10.637 +
  10.638 +\end{description}
  10.639 +
  10.640 +We now show that from these definitions it follows:
  10.641 +[math here] which says that any event that comes after a tie point on one timeline is ordered after any event on a different timeline that precedes the tie-point on that timeline (note that the same tie point is common to both timelines).  The dual also holds true.
  10.642 +
  10.643 +We take the event immediately preceding and the event
  10.644 +immediately following two synchronization events on
  10.645 +two timelines.  The synchronization events begin with
  10.646 +a suspend half-event and ends with a resume half-event.
  10.647 +The suspend half-event is accompanied by a send to
  10.648 +a hidden timeline.  That hidden timeline has a receive,
  10.649 +and later in its sequence it has a receive for the
  10.650 +synchronization event from the second timeline. The
  10.651 +hidden timeline then performs resume of both timelines.
  10.652 +
  10.653 +From that, we get the following relations:
  10.654 +
  10.655 +Which shows that the event following on timeline 1 comes after the event preceding on timeline 2 and vice versa.
  10.656 +
  10.657 +This property of ordering events on two timelines in this way is the key requirement for several classical definitions of synchronization.  Hence, any implementation that exhibits this pattern of synchronization communications converging on a common hidden timeline, which subsequently resumes the synchronizations, in turn satisfies the conditions for a synchronization.
  10.658 +
  10.659 +\subsubsection{What is different about tie-point?}
  10.660 +Many readers will be wondering "so, how is implementing
  10.661 +a synchronization construct this way any different
  10.662 +from how they're currently implemented?"  The answer
  10.663 +is that currently, synchronization constructs are
  10.664 +implemented on top of other synchronization constructs,
  10.665 +where we consider an atomic Compare and Swap instruction
  10.666 +to be a synchronization construct.  It is only in the
  10.667 +hardware that a synchronization construct is assembled
  10.668 +from pieces.  We further claim that the hardware implements
  10.669 +according to the tie-point pattern described in our formal definition.
  10.670 +
  10.671 +What we consider to be a tie-point is any point that
  10.672 +has this pattern, independent of the semantics added.
  10.673 +For example, for the Compare And Swap (CAS) instruction,
  10.674 +the comparison and swap are the semantics of what the
  10.675 +instruction does, while the atomicity, or exclusive
  10.676 +access is the part that provides the ordering relations.
  10.677 +So, the presence of the ordering relations is the tie-point
  10.678 +portion, while the comparison and swap are the plugged-in
  10.679 +semantics portion associated with the tie point.
  10.680 +
  10.681 +In that way, tie-point can be considered to simply
  10.682 +say ``has the ordering relation of a synchronization
  10.683 +construct". Tie-point is nothing new, when viewed that way.  However, a tie-point is not a given, but rather
  10.684 +has to be constructed.  To get a tie-point, one must
  10.685 +create a construction from which the givens for a synchronization
  10.686 +can be derived.  Further, tie points can be constructed
  10.687 +for things that most would not readily consider a synchronization
  10.688 +construct.  For example, any asynchronous communication
  10.689 +establishes a half tie-point, because ordering can
  10.690 +be derived.  This is useful, for example, in defining
  10.691 +memory consistency models.
  10.692 +
  10.693 +The key here is the elements of the model within which
  10.694 +tie-point is defined.  In particular, memory does not
  10.695 +exist outside a timeline, the points on a timeline
  10.696 +have no ordering relative to points on another timeline,
  10.697 +ordering between timelines is only established by a communication, and timelines can suspend themselves
  10.698 +(or be suspended by a different timeline),
  10.699 +and be resumed by a different timeline.
  10.700 +
  10.701 +Within this model, the characteristics of a synchronization
  10.702 +can be derived.  That is the key difference, as usually
  10.703 +one states as a \textit{given}  that a construct exists that has the synchronization properties.  Tie-point
  10.704 +is derived, versus synchronization is given.
  10.705 +
  10.706 +True, the two are equally powerful.  
  10.707 +
  10.708 +More low level, less junk on top, more efficiency and
  10.709 +more control -- w/sync like threads, it has its own
  10.710 +scheduler, have no control over where and when work
  10.711 +happens.  
  10.712 +
  10.713 +It is different because it only directly provides half
  10.714 +the behavior, the time half. in the sense that 
  10.715 +
  10.716 +The claim is that from a theory standpoint, tie-point
  10.717 +is not more powerful -- proto-runtime can implement
  10.718 +synchronization constructs, and sync constructs can
  10.719 +implement other sync constructs..  
  10.720 +
  10.721 +But, sync constructs CANNOT implement all of proto-runtime!  They can't do the communications nor the hidden timeline nor create VPs
  10.722 +nor scheduling..   also, proto-runtime can do distributed
  10.723 +memory things that sync constructs cannot.
  10.724 +
  10.725 +The sync constructs can be used together with shared
  10.726 +memory-based communication in order to make more complex
  10.727 +sync constructs..  but they can't be used in a distributed
  10.728 +memory system to make distributed memory things.
  10.729 +
  10.730 +Unless use communication to implement shared memory
  10.731 +on top of distributed memory.. things like that.. It's
  10.732 +a question of what's fair game in the comparison --
  10.733 +proto-runtime the behavior is in the hidden timeline,
  10.734 +which is "inside" the construct, in a sense..  but using sync constructs to implement others, you lose
  10.735 +that "inside" notion..  it just becomes application
  10.736 +code that uses sync constructs..  with the app code
  10.737 +running in an application timeline..  so..  need to
  10.738 +get at that notion of animator, which has the "hidden"
  10.739 +timeline, versus function call.. 
  10.740 +
  10.741 +What about this.. it's a matter of constructing from
  10.742 +equally powerful versus from less powerful.. mmmm want
  10.743 +that notion of animator in there..  and want to get
  10.744 +at when an arrangement qualifies as having "switched
  10.745 +over to the animator" -- does implementing mutex from
  10.746 +just memory ops qualify as switching over to the animator
  10.747 +just by entering the code that implements the mutex?
  10.748 +Say, place that code in-line in the application code
  10.749 +everywhere it's used..
  10.750 +
  10.751 +Hmmmm.. could use the relation model to show that the
  10.752 +pure memory based implementation contains a tie-point,
  10.753 +which is how the more-primitive operations are able
  10.754 +to construct the more powerful mutex.     That might
  10.755 +be a more fruitful, easier to gain acceptance, approach..
  10.756 +show that things that have no time-related semantics,
  10.757 +only simple one-way communication, are able to construct
  10.758 +the time-related semantics.. and it is the presence
  10.759 +of the tie-point convergence pattern that does it.
  10.760 +
  10.761 +In fact, might take the Dijkstra original mutex from
  10.762 +must memory implementation and show the tie-point pattern
  10.763 +within it..  then also show the tie-point pattern within lock-free implementations..  the point being that all
  10.764 +you have to show is the presence of the tie-point pattern,
  10.765 +in order to prove synchronization properties..  where
  10.766 +"synchronization properties" is the existence of the ordering relation.. which is equivalent to agreement of before vs after.. which is equivalent to the visibility
  10.767 +relation, which is what a programmer cares about..
  10.768 +the visibility is what a programmer requires in a "mutual
  10.769 +exclusion".  
  10.770 +
  10.771 +This visibility guarantees is how it can be guaranteed that
  10.772 +those that are still "before" the mutex cannot influence
  10.773 +the one "after" the mutex, which is inside the critical section.  And also require vice versa,
  10.774 +that the one "after" the mutex, inside the critical
  10.775 +section, cannot take actions
  10.776 +that influence any "before" it..  similarly at the
  10.777 +end of the critical section, need the same isolation.
  10.778 +  
  10.779 +
  10.780 +Let's see..  the relation model said that something
  10.781 +with synchronization constraints can be created from
  10.782 +just communication plus hidden timeline..  as long
  10.783 +as get the convergence on that hidden timeline.
  10.784 +
  10.785 +What Henning was saying was that sync is defined as
  10.786 +the end-constraints.  So, the end-constraints IS what
  10.787 +a synchronization construct is.  It doesn't matter
  10.788 +how to implement one, it only matters the end constraints.
  10.789 +
  10.790 +So, what the relation thing showed was how to construct
  10.791 +a synchronization.  What need to show is that the relation
  10.792 +thing can also construct stuff that cannot be constructed
  10.793 +with a synchronization construct.
  10.794 +
  10.795 +I guess the question would be: if one starts with a
  10.796 +synchronization construct existing within a distributed
  10.797 +system..  well, then one can construct other sync constructs
  10.798 +from that one..
  10.799 +
  10.800 +For them, the question of "more primitive" is: can the more primitive
  10.801 +thing do stuff the "full" one cannot?  
  10.802 +
  10.803 +For me, the question of "more primitive"\ is: can one
  10.804 +of them be constructed from the other, which ONLY\
  10.805 +has simpler pieces?  Constructing one from itself says nothing..
  10.806 +but being able to construct one from something that
  10.807 +is NOT one, whose individual components all have less
  10.808 +than one..  that thing's pieces are all less powerful..
  10.809 +then it is a particular combination that brings the extra
  10.810 +time-related behavior of a sync construct into existence.
  10.811 +It is recognizing the particular pattern that brings
  10.812 +that extra into existence that is of value.
  10.813 +
  10.814 +It is that pattern that tells you how to get one from
  10.815 +simpler pieces.
  10.816 +
  10.817 +So, the story is: using only pieces that lack the "special"
  10.818 +synchronization construct property, construct something
  10.819 +that does have the synchronization property.  That,
  10.820 +is building something more powerful from pieces that
  10.821 +are less powerful.
  10.822 +
  10.823 +The other part of the story is: the proto-runtime cannot
  10.824 +be used by itself.  It requires addition before it
  10.825 +can be used.  That is, have to add the $M\mapsto M$, to arrive
  10.826 +at the $T\times M\mapsto M$, then can use the $T\times
  10.827 +M\mapsto M$..  but can't
  10.828 +use just the $T\times$ by itself -- that's non-sensical. 
  10.829 +So, provides a $(M\mapsto M, f)$ that is used to get the $T\times M\mapsto M$,
  10.830 +but can't use the $f$ inside an application.. it doesn't
  10.831 +do anything other than add the Tx..  so it doesn't
  10.832 +accomplish any steps of computation, nor does it provide
  10.833 +$T\times$ to any application code..  the $(M\mapsto M, f)$ is outside
  10.834 +of any language -- that's what CREATES a language.
  10.835 +
  10.836 +*****Can't define $(M\mapsto M, f)$ as part of its own language,
  10.837 +because it doesn't do anything.  No computation is
  10.838 +performed by it. ****  (so, what's the definition of
  10.839 +computation, then?)
  10.840 +
  10.841 +The other part of the story is the HWSim time behavior
  10.842 +-- those aren't sync constructs..  rather that is a
  10.843 +particular set of constraints on time..  constructed
  10.844 +out of primitives none of which have sych nor time
  10.845 +behavior by themselves beyond "comes after" of comm.
  10.846 +
  10.847 +Another part of the story is the singleton thing, constructed
  10.848 +directly..  Q: can that be built from sync constructs
  10.849 +in distributed system?  Does using sync constructs
  10.850 +do something that using primitives doesn't?  Does it
  10.851 +add something, fundamentally?  Well, it is in terms
  10.852 +of something that already has the property being constructed..
  10.853 +that's the issue..  in one case, taking something that
  10.854 +has the property and building something else that has
  10.855 +it..  in other case taking something that doesn't and
  10.856 +building something that does.
  10.857 +
  10.858 +So..  in the consistency model, just using the comes-after
  10.859 +property of communication to derive compound communication,
  10.860 +of particular write to particular read, via memory
  10.861 +locations.
  10.862 +
  10.863 +So, what is a tie-point in that consistency model?  It is the pattern that allows deriving an ordering, between different computation timelines.  There, the
  10.864 +tie-point was tying a write on one to a read on the
  10.865 +other, and thereby establishing a half-ordering between
  10.866 +the two timelines.
  10.867 +
  10.868 +Right..  so that should be it..  that a chain of communications results in an ordering between the end-points.  And that a synchronization is nothing more than two communication chains that are tied together.. where the tie equals the chains SHARING one link, on some intermediate timeline.
  10.869 +
  10.870 +Right.. thinking about mutex acquire and release..
  10.871 +the release is asynch..  the sending timeline resumes before
  10.872 +the hidden timeline receives notice.. but that just
  10.873 +establishes a half tie-point, no?
  10.874 +
  10.875 +In the async case, operations after the construct can be seen BEFORE the construct in the other timeline. Right.  So that's a half tie-point.  A full tie-point is that nothing after in either can be seen before by the other.
  10.876 +
  10.877 +Right.. so one distinction is this: a half tie-point
  10.878 +cannot be created using sync constructs "directly".
  10.879 + A sync construct is a full tie-point.  
  10.880 +
  10.881 +
  10.882 +================================================
  10.883 +
  10.884 +
  10.885 +\subsubsection{Lifeline, Timeline, and Projection}
  10.886 +We define a formal entity that we call a lifeline,
  10.887 +where a timeline is a type of lifeline.
  10.888 +We define event-types and specific occurrences of event-types, and show how multiple lifelines can observe the same occurrence. A projection between
  10.889 +lifelines is defined as an event initiated upon one lifeline being observed on a different lifeline.  The projection is from initiator to observer.
  10.890 +
  10.891 +\begin{description}
  10.892 +\item[event:] 
  10.893 +\(E   \) represents an event, which is something that
  10.894 +can be initiated or observed. 
  10.895 +\item[occurrence:]
  10.896 +\(O\in E \times\mathbb{N}\)  is the set of occurrences, where each occurrence associates a specific event with a unique identifier. A particular occurrence is denoted by subscripting with the value of the associated
  10.897 +integer, for example: \(O_{7}\) 
  10.898 +
  10.899 +\item[clock:]
  10.900 +\(t:I\rightarrow\mathbb{R}^{+}\) maps each integer
  10.901 +onto a  real number, such that \(I_{1}<I_{2}\Rightarrow
  10.902 +R_{1}<R_{2}\) . In general, different clocks have no relation to each other, but elements associated with a clock have a sequence defined by the integer
  10.903 +sequence of the clock. 
  10.904 +\item[lifeline:]
  10.905 +\(l = <\alpha ,  t> \) 
  10.906 + is a lifeline, where \(\alpha\)
  10.907 +is a sequence over \(Dom(t)\) and each element of \(\alpha\) is either an initiation of an occurrence, or an observation
  10.908 +of one. A \textit{beat} of the lifeline is one tuple, denoted \(l(i)\), while the occurrence associated 
  10.909 +to the beat is denoted\(\) \(O(l(i))  \) or equivalently \(O(\alpha(i)). \) The real value
  10.910 +associated with the beat is denoted \(t(l(i))\).  For a given lifeline, not every element of \(t\) must have an associated
  10.911 +\(\alpha\), but every \(\alpha\) must have a unique associated
  10.912 +\(I\) from the clock \(t\).  Note that \(\forall i , t(l(i)) < t(l(i+1))\).  At most one beat from one
  10.913 +lifeline can initiate an occurrence.  However, multiple
  10.914 +beats
  10.915 +from a given lifeline can observe the same occurrence,
  10.916 +including one initiated earlier in the sequence of
  10.917 +the lifeline,
  10.918 +and multiple lifelines may observe the same occurrence,
  10.919 +each multiple times.  
  10.920 +
  10.921 +\item[projection:]
  10.922 +Given \(l_{1} = <\alpha ,  t_{1}> \), \(l_{2} = <\beta ,  t_{2}> \) then a projection from \(l_{1}\) to \(l_{2}\)
  10.923 + is denoted  \(l_{1}(i) \uparrow l_{2}(j) \), where  \(l_{1}(i) \uparrow l_{2}(j)
  10.924 +\equiv O(l_{1}(i)) = O(l_{2}(j))\).
  10.925 + This says that the occurrence initiated by the ith beat of the first lifeline is observed by the jth beat
  10.926 +of the second lifeline. 
  10.927 +
  10.928 +\item[ordering tuple:] \(OT_{}\) is a tuple consisting
  10.929 +of a set of two  beats from two different lifelines, which do
  10.930 +not participate in projections, plus a set of projections
  10.931 +that cross the two beats in the forward direction. 
  10.932 +Given \(OT =<[l_{1}(x) , l_{2}(y)], [projections]> \) then \(OT\) is an
  10.933 +ordering tuple  iff \( [projections] \neq0 \forall p(i,j) \in  projections \nexists p(i,j)
  10.934 +|i<x \wedge j>y\  \) 
  10.935 +\item[program run:] \(\mathcal{R} \) is a particular set of lifelines.
  10.936 +The program run begins with the creation of any lifeline, and
  10.937 +ends with the end of all lifelines.
  10.938 +
  10.939 +\item[equivalent positions in different  sequences:] a partial ordering is defined.
  10.940 +Given two positions within different sequences, if
  10.941 +one or both both can be
  10.942 +validly rearranged, by using the partial ordering to
  10.943 +define valid rearrangements,  so they occupy 
  10.944 + the same position in their rearranged sequences, then
  10.945 +they are equivalent positions.
  10.946 +
  10.947 +\item[equivalent occurrences:] two occurrences are
  10.948 +equivalent if their event instances cannot be distinguished, given the observation
  10.949 +measurements of interest. If the observation measurement
  10.950 +involves sequences, then the two events must lie at
  10.951 +equivalent positions within their respective sequences.
  10.952 +
  10.953 +\item[equivalent lifelines:] two lifelines whose beats
  10.954 +can be paired, such that every beat in one lifeline
  10.955 +has an equivalent beat in the other.  The beats do
  10.956 +not have to occur in the same order in both lifelines.
  10.957 +Beats associated to occurrences that are not of interest can be dropped.
  10.958 +
  10.959 +\item[equivalent program runs:] two runs such that
  10.960 +their lifelines can be paired one-to-one, with every lifeline in one paired to an equivalent
  10.961 +lifeline in the other. The projections between lifelines
  10.962 +in one run can be different from the projections in
  10.963 +the other run.
  10.964 +
  10.965 +\item[tie-point:] a set of beats, one from each of two lifelines, such that this set of beats forms a separation set in all equivalent program runs. 
  10.966 +\end{description}
  10.967 +
  10.968 +
  10.969 +Some things to note: A particular occurrence
  10.970 +can be associated to at most one beat from a given
  10.971 +lifeline, but that same occurrence can also be associated
  10.972 +to beats from multiple other lifelines.  Also, an occurrence may
  10.973 +be initiated by a lifeline but never observed by any.
  10.974 +Every \(O\) has a set of projections associated with it.
  10.975 +
  10.976 +For example, the event could
  10.977 +be writing a value into a variable.  Two separate
  10.978 +write events are considered equivalent occurrences if
  10.979 +they both write the same particular value into whatever memory location
  10.980 +is associated to the same particular
  10.981 +variable, and happen within valid partial orderings
  10.982 +relative to the other occurrences.  This is normally
  10.983 +compared across re-creations of the "universe" that
  10.984 +provides the context for the orderings of events instances.
  10.985 +
  10.986 +=========
  10.987 +
  10.988 + Okay, talked it over with Sung -- what about making distinguished beats
  10.989 +-- as Sung poked around for, make the PR\ "suspend" be the
  10.990 +distinguished beat.  Then, as we worked out talking it
  10.991 +through, make the code that happens on the hidden timeline be the linkage between the beats -- so a tie-point is any number of distinguished beats such that the hidden calculation on one of the beats executed the resume for all of the other beats in the tie point.  That establishes how a tie point gets created..  separately, need a universal statement of what is guaranteed by a tie point.
  10.992 +
  10.993 +So, one thing, is that the hidden calc is normally chosen such that every equivalent program run reproduces equivalent tie points -- but defining equivalent relies upon defining the "meaning"\ of the constructs..  but maybe that thing above about equivalent in terms of partial order can be used, by saying all constructs
  10.994 +are associated with a partial ordering -- but, still can have truly non-deterministic behavior being the correct behavior.. hmmm, but that should still have a partial ordering!
  10.995 +
  10.996 +  What I\ really want to do is define tie-point in terms of the write-to-read.  A half tie point says what's before the pre is visible after in the post timeline.  And a full tie-point says that goes both ways.  So, acquire-release is only a half tie-point, because what's after the release in its timeline  can be seen before the acquire in its timeline. That makes it a half tie-point.  Also, whats before the acquire in its timeline does not necessarily have to be seen after the release in its timeline.. that also makes it a half tie-point.
  10.997 +
  10.998 +So, use the project definition, and the crossing definition, to say which crossing projects are allowed by a half tie point, and which of those must be eliminated to make it a full tie-point.  Then THAT\ defines the behaviors of a tie-point, independently from how it is created.
  10.999 +
 10.1000 +The full definition of tie-point, in terms of proto-runtime value, has both those -- the hidden timeline "math" thing along with the causality, gives the "creation" aspect of tie-point, and the allowed projections gives the "behavior" aspect of tie-point.
 10.1001 +
 10.1002 +From the projection "behavior" I can simply state "this
 10.1003 +defines what all synchronization constructs do" --
 10.1004 +the projection behavior is the whole purpose of a sync construct -- to ensure particular communication pattern when comm is via side-effect
 10.1005 +
 10.1006 +=======
 10.1007 +
 10.1008 +From first model, have the real-value constraints for  slide of suspend and resume relative to each other..
 10.1009 +
 10.1010 +The behavior of full tie-point is no back-cross projections, and there is a set of forward-crossing projections, which may be empty, and any of the tied timelines may
 10.1011 +be the initiating timeline.  For a half tie-point, have the origin lifeline. There is a set of forward-crossing projections with initiation on the origin lifeline,
 10.1012 +and backward crossing are allowed whose initiation
 10.1013 +is on non-origin lifeline. 
 10.1014 +  
 10.1015 +But a tie-point is more than just the behavior it defines.
 10.1016 + In order for a pair of special beats to form a tie-point,
 10.1017 +they must be causally linked on their internal lifelines.  This means that a sequence of changes of the internal
 10.1018 +state links the internal activity of one of the special beats to the internal activity of another special beat
 10.1019 +that executes the resume that ends  the second special beat. All special beats that are resumed inside the
 10.1020 +same internal activity will have the behavior of a
 10.1021 +full tie-point. Half tie-points can have both halves
 10.1022 +resumed in different internal activities.
 10.1023 +   
 10.1024 +A special beat has a variable-length span, as measured in the real-number of the clock. A special beat is associated to an isolated atomic span on a hidden lifeline. The only way to end the span of a special beat is via a "resume" beat on the hidden lifeline, which names the special beat to be ended. 
 10.1025 +
 10.1026 +The internal activity on the hidden lifeline enforces some description. 
 10.1027 +
 10.1028 +For
 10.1029 +example, send-receive descriptions are: send = if paired
 10.1030 +receiver is in shared context then resume both else place self into shared context. receive: if paired send is in shared context then resume both else place self into shared context.
 10.1031 +  
 10.1032 +For acquire-release.. acquire: if lock-owner inside shared
 10.1033 +context is empty then place self-name into lock-owner
 10.1034 +and resume self else place self onto end of sequence
 10.1035 +of special beats. release: remove self from lock-owner
 10.1036 +and place the next in sequence of special beats into
 10.1037 +lock-owner. If non-empty then resume the new lock-owner.
 10.1038 +in every case, resume self. Note, acquire-release can
 10.1039 +form either a half tie-point or a full tie-point. 
 10.1040 +?
 10.1041 +
 10.1042 +====
 10.1043 +
 10.1044 +Note to the reader.  This is a first pass at a formal description of tie-point. It likely contains more constraints than necessary. It should not be taken as the final formalism, nor is it implied to be elegant in any way, but simply an existence proof for a formal description
 10.1045 +of a useful subset of what the intuition of tie-point associates to.  
 10.1046 +
 10.1047 +
 10.1048 +   
 10.1049 +
 10.1050 +
 10.1051 +\subsection{How a synchronization construct relates
 10.1052 +to tie-points}
 10.1053 +
 10.1054 +To prepare for stating how the tie-point model can be used to
 10.1055 +specify a synchronization construct, we first state
 10.1056 +clearly what we mean by a ``synchronization construct''.
 10.1057 +
 10.1058 +The top of Fig \ref{fig:PRSyncConstrDef} shows two
 10.1059 +independent timelines, both performing reads and writes
 10.1060 +within a machine that has coherent shared memory. The
 10.1061 +timelines have no relative ordering defined, so any
 10.1062 +write on Timeline A can be received by any read of
 10.1063 +the same address on
 10.1064 +Timeline B, and vice versa.  This means that, in general,
 10.1065 +the use of a variable that is read and written by both will result in non-deterministic behavior.
 10.1066 +
 10.1067 +
 10.1068 +\begin{figure}[ht]
 10.1069 +  \centering
 10.1070 +  \includegraphics[width = 2.0in, height = 2.8in]
 10.1071 +  {../figures/PR__timeline_sync_def.pdf}
 10.1072 +  \caption{Depicts the meaning  we adopt for  `synchronization construct'.  One of them  controls communications between timelines
 10.1073 +by controlling the slide of timelines relative to each
 10.1074 +other. They imply certain visibility between writes  and reads on different timelines.}
 10.1075 +  \label{fig:PRSyncConstrDef}
 10.1076 +\end{figure}
 10.1077 +  
 10.1078 +
 10.1079 +
 10.1080 +To control the behavior of writes and reads to the
 10.1081 +same addresses, a common point must be established, which
 10.1082 +limits the ``sliding'' of the timelines relative to
 10.1083 +each other. A synchronization construct is used for
 10.1084 +this.
 10.1085 +The net effect of such a construct is to establish
 10.1086 +a common point that both timelines agree on.  This
 10.1087 +point separates reads and writes before it from reads
 10.1088 +and writes after it.
 10.1089 +
 10.1090 +For example, consider a simple lock used to protect a critical section.  The lock is acquired by one timeline
 10.1091 +before entering the critical section. Any writes performed
 10.1092 +on other timelines before the lock was granted must be complete before the critical section starts, so that reads performed inside the critical section see them. This is illustrated in the middle of Fig \ref{fig:PRSyncConstrDef}.
 10.1093 +
 10.1094 +The critical section ends by releasing the lock, which allows a different timeline to acquire and enter the critical section.  As seen in the bottom of Fig \ref{fig:PRSyncConstrDef},
 10.1095 +any writes performed by that new
 10.1096 +timeline after it acquires the lock must not be visible
 10.1097 +to reads performed by the old timeline before it released
 10.1098 +the lock. 
 10.1099 +
 10.1100 +With this intuition, we define a synchronization construct
 10.1101 +as an operation preformed on a timeline, which has
 10.1102 +the property that it creates
 10.1103 +a tie-point together with an operation performed on a different
 10.1104 +timeline.  Such operations that establish a tie-point
 10.1105 +fit our definition of synchronization constructs.
 10.1106 +
 10.1107 +
 10.1108 +\subsection{More on tie-points}
 10.1109 +
 10.1110 +Fig \ref{fig:dualTimeline} showed how a tie-point can be generated. The establishment was accomplished by
 10.1111 +a combination of primitive mechanisms. These include: 1) suspend; 2) an `invisible' timeline that executes
 10.1112 +behavior in the gaps; 3) resume
 10.1113 +called from that invisible timeline; and 4) enforcement
 10.1114 +of instruction completion relative to resume.  
 10.1115 + 
 10.1116 +What an established tie-point  provides is the notion that the tied points are the same ``instant" for both tied timelines.   What that means is that both timelines see events ordered relative to that point in the same way.
 10.1117 +
 10.1118 +
 10.1119 +Notice that the primitives that establish a tie-point
 10.1120 +do not involve any notion of  dependency or constraint
 10.1121 +on order of execution.  It is the behavior code that runs on the invisible
 10.1122 + timeline that embodies notions such as dependency
 10.1123 + between units of work, mutual exclusion,
 10.1124 + partial ordering of work, and so on.  However, the
 10.1125 + primitives do provide the notion of causality,  the ordering implied by causality, and enforcing completion
 10.1126 +of reads/writes.
 10.1127 +
 10.1128 +It is up to the language to supply the behavior that happens inside
 10.1129 +the gaps, which executes on the invisible timeline. This behavior is what decides which timelines end up
 10.1130 +sharing a tie point. It is that decision making, of which timelines to tie together, that implements the
 10.1131 +semantics of a synchronization construct.
 10.1132 +
 10.1133 +A workshop paper also discusses tie points
 10.1134 +[]. A formal treatment of tie-points is beyond the scope of this paper. However, a formal framework has been substantially completed and
 10.1135 +will be published in a future paper.  
 10.1136 +
 10.1137 +
 10.1138 +
 10.1139 +\subsection{Tie-points within a proto-runtime}
 10.1140 +
 10.1141 + Fig \ref{fig:dualTimeline} didn't say what entity owns the hidden timeline that executes  the behavior that takes place in the gaps.  This is what the proto-runtime does. An instance of the
 10.1142 +proto-runtime executes the language plugin behavior.
 10.1143 +It acts as the hidden timeline.
 10.1144 +
 10.1145 + The proto-runtime code module also supplies implementations
 10.1146 +of the primitives that are used to establish a tie-point, including these:
 10.1147 +
 10.1148 + %It provides the primitive that suspends a timeline and then causes language plugin behavior to execute in the gap. 
 10.1149 +  
 10.1150 +%The plugin behavior that runs in the proto-runtime when one timeline suspends is what chooses another timeline to resume as a consequence. That choice establishes causality between the suspensions of the two timelines,  and in the process ensures that a valid tie will exist between the two collapsed timeline points. The code of the primitives  is provided as part of the proto-runtime  code module, while the plugin behavior is executed by an   instance of a running proto-runtime.
 10.1151 +
 10.1152 +%The running proto-runtime instance is also known as the Master, while the application timelines are known as Slaves.  The behavior of the language constructs executes within the Master's timeline, while the behavior of application code executes within Slave timelines. 
 10.1153 +
 10.1154 +%\subsection{More about the proto-runtime}
 10.1155 +
 10.1156 +\begin{itemize}
 10.1157 +\item create a virtual processor (which has a suspendible timeline)
 10.1158 +\item create a task (which has an atomic timeline that runs to completion)
 10.1159 +
 10.1160 +\item suspend a timeline, then invoke a   function to handle the suspension -- handler is  supplied with
 10.1161 +parameters from application 
 10.1162 +\item resume a timeline, which makes it ready for execution
 10.1163 +\item end a timeline
 10.1164 +\item trigger choosing which virtual processor or task to begin execution on an offered
 10.1165 +core
 10.1166 +
 10.1167 +\end{itemize}
 10.1168 +
 10.1169 +Virtual processors and tasks, both, have associated timelines. The reason for having both is a practical one, as tasks are simpler, with less overhead,
 10.1170 +and many languages have the semantics of short, atomic, units of work that
 10.1171 +are not intended to suspend. Thus, tasks are treated differently inside the
 10.1172 +proto-runtime, and incur less overhead to create and run.
 10.1173 +
 10.1174 +A special feature of the proto-runtime is that if a task happens to execute
 10.1175 +a language command that causes suspension, then the proto-runtime automatically
 10.1176 +converts that task to a suspendible virtual processor. This helps support the mixing of different
 10.1177 +languages within the same program.
 10.1178 +
 10.1179 +
 10.1180 +The proto-runtime provides a mechanism for communicating information from the application code to the plugin function that was invoked to handle suspension. For example, the identity of a particular mutex a thread wishes to acquire
 10.1181 +can be communicated from the wrapper library to the plugin. 
 10.1182 +
 10.1183 +
 10.1184 +Because the proto-runtime tracks all the timelines, the end of a timeline has to be explicitly stated in the application code, by calling a wrapper library function. That then invokes the proto-runtime primitive,
 10.1185 +which informs the proto-runtime instance. The proto-runtime  performs internal bookkeeping related to the ending of the timeline, and notes that the core is now free and offers it to the plugin's Assigner function. 
 10.1186 +
 10.1187 +The proto-runtime involves the language into the process of choosing which core a given task
 10.1188 +or virtual processor executes on. The proto-runtime maintains control, but offers free cores to the Assigner
 10.1189 +portion of the plugin. It responds by then assigning a task or virtual processor to the core. The proto-runtime just offers, it is up to the language to decide what work that core should receive at that point in time.
 10.1190 +
 10.1191 +
 10.1192 +
 10.1193 +\subsection{Concrete Example}\label{subsec:Example}
 10.1194 +
 10.1195 +To make this concrete, consider the example of implementing
 10.1196 +acquire mutex and release mutex. The semantics are:
 10.1197 +
 10.1198 +\begin{itemize}
 10.1199 +\item Acquire Mutex: A thread  calls the construct,
 10.1200 +and
 10.1201 +provides the name of the mutex. If no thread owns the
 10.1202 +mutex, the calling thread is given ownership and it
 10.1203 +continues to make progress. However, if a different thread
 10.1204 +already owns the mutex, the calling thread is put into a queue
 10.1205 +of waiting threads, and stops making progress. 
 10.1206 +\item Release Mutex: A thread calls the construct and
 10.1207 +provides the name of the mutex. If the mutex has waiting threads in its queue, then the next thread is taken out and given ownership of the mutex. That thread is resumed, to once again make progress, as it the thread
 10.1208 +that called the release construct.. 
 10.1209 +\end{itemize} 
 10.1210 +
 10.1211 +This calls for a data structure that has two fields:
 10.1212 +one holds the thread that currently owns the mutex,
 10.1213 +the other holds a queue of threads waiting to acquire
 10.1214 +the mutex. The semantics of a construct involve multiple
 10.1215 +reads
 10.1216 +and writes of the data structure. Hence, the
 10.1217 + structure must  be protected
 10.1218 +from races between different threads. 
 10.1219 +
 10.1220 +The protection
 10.1221 +is where the difficulty comes into the implementation,
 10.1222 +and where performance issues come into the picture.
 10.1223 +It could be accomplished with a single global lock
 10.1224 +  that uses hardware primitives, or accomplished
 10.1225 +with wait-free data structures that only rely upon the coherence
 10.1226 +mechanism of the memory system, or even by message passing plus
 10.1227 +quorum techniques.
 10.1228 +
 10.1229 +However, the implementation of the semantics  is independent
 10.1230 +of the implementation of the protection. They are orthogonal,
 10.1231 +and an interface can be placed between them. One side
 10.1232 +of the interface implements checking and updating the fields of
 10.1233 +the data structure, while the other side implements
 10.1234 +protecting the first side from interference.
 10.1235 + 
 10.1236 +The side that provides protection requires fields,
 10.1237 +for its use, to be placed into the data structure used
 10.1238 +to represent a thread. To hide those details,
 10.1239 +the protection side should also provide
 10.1240 +primitives to create and destroy threads, as well as suspend
 10.1241 +and resume them.
 10.1242 +
 10.1243 +This interface that separates the semantic side from
 10.1244 +the protection
 10.1245 +side is the proto-runtime interface. It is what enables
 10.1246 +the modularization of runtime system implementations.
 10.1247 +
 10.1248 +The tie-point concept provides a model for thinking
 10.1249 +about how the semantic side controls ordering among multiple threads, without exposing any details of the protection side. The tie-point model involves thinking only about actions taken during suspension of timelines (threads). It assumes that those actions are protected from interference, and that suspend and resume of timelines are primitive operations made available. The model remains constant regardless of  implementation details.
 10.1250 + That provides a cross-hardware way of specifying synchronization
 10.1251 +behavior using just sequential thinking. The proto-runtime primitives implement the elements of the tie-point model.    
 10.1252 +
 10.1253 + %Currently, these constructs are either implemented directly in terms of hardware level synchronization constructs such as the atomic Compare And Swap (CAS) instruction, or else are a thin wrapper that invokes operating system behavior. However, the behavior of the OS\ kernel's threading primitives are themselves implemented in terms of hardware level synchronization
 10.1254 +%constructs. Either way,  developing the behavior proves
 10.1255 +%time consuming due to the difficulty of debugging hardware level synchronization behavior, and due to the difficulty of performance tuning such low level code across the full spectrum of patterns caused by applications.
 10.1256 +
 10.1257 +
 10.1258 +
 10.1259 +
 10.1260 +
 10.1261 +\section{Concrete Details}
 10.1262 +Now that we have seen the concepts of how to modularize
 10.1263 +a runtime system, using the tie-point model, it is
 10.1264 +time to make the concepts concrete by showing code
 10.1265 +segments that implement each of the concepts, and code
 10.1266 +segments that use the concepts.  We will start with
 10.1267 +the big picture and work down.
 10.1268 +
 10.1269 +The first stop will be the development process, showing
 10.1270 +how it is fractured into three separate and independent
 10.1271 +development activities.  Next, we will show examples
 10.1272 +of how application
 10.1273 +code invokes constructs, and follow the path of calls
 10.1274 +down to the point it switches over to the runtime system. Lastly,
 10.1275 +we will look at the flow of control inside the runtime,
 10.1276 +where we will focus on the interaction between plugin
 10.1277 +code and proto-runtime code.  
 10.1278 +
 10.1279 +In this last portion, we will show how the
 10.1280 +interface supplies the plugin with a consistent ``inside
 10.1281 +the runtime" environment.  Along with that, we will
 10.1282 +show how providing
 10.1283 +a consistent environment
 10.1284 + is an implementation of the "single hidden timeline" portion
 10.1285 + of the tie-point model. We will also show how it is
 10.1286 + the existence of a \textit{single} hidden timeline
 10.1287 + that allows the semantic portion of the language constructs
 10.1288 +to be written in a sequential style, without regard to concurrency issues.  
 10.1289 +
 10.1290 +
 10.1291 +\subsection{Three independent development efforts}
 10.1292 +
 10.1293 +To get a handle on the big picture,  we describe the
 10.1294 +three independent paths that development takes:
 10.1295 +one for development of proto-runtime code, one for
 10.1296 +development of language implementation, and one for
 10.1297 +application development. Each of these produces a separate
 10.1298 +installable artifact.
 10.1299 +The proto-runtime development produces a dynamic library, for each machine. The language development produces a dynamic library to plug into whichever proto-runtime library is installed on a given machine. It may also produce development tools that are used during compilation, distribution, and even installation and during the run. The application development produces a single source, which the language tools may then turn into multiple executables.
 10.1300 + 
 10.1301 +The proto-runtime code is developed separately from
 10.1302 +both language and application code, and packaged as a dynamic library. This library has multiple implementations. Each kind of hardware platform has a proto-runtime implemented specifically for it, and is tuned for low overhead on that hardware. The administrator of a particular machine chooses the proto-runtime implementation best suited to that hardware, and installs that.
 10.1303 +
 10.1304 +The language code is likewise developed separately from both proto-runtime and application code. Although multiple versions of a language may be implemented, there are significantly fewer versions than the number of proto-runtime versions. That is because most of the hardware details are abstracted away by the proto-runtime interface. 
 10.1305 +
 10.1306 +However, the interface does expose hardware features related to placement of work onto cores, so some variations may exist for the same interface. Again, the administrator chooses which language implementation best suits their machine and installs the corresponding dynamic library. 
 10.1307 +
 10.1308 +The wrapper library, however, is not
 10.1309 +installed on the machine where code runs. Rather, it
 10.1310 +is only used during development of an application,
 10.1311 +and remains independent of hardware.
 10.1312 + 
 10.1313 +Ideally the application is developed only once. It makes calls to the wrapper library, which in turn invokes the dynamic libraries of the language and proto-runtime.  
 10.1314 +When an application is executed, the loader binds the
 10.1315 +dynamic libraries, connecting them to the application.
 10.1316 + In this way, a single,
 10.1317 +unchanging, executable gains access to machine-specific implementations of language and proto-runtime.  
 10.1318 +
 10.1319 +However, the success of the compile-once approach has
 10.1320 +limits in practice. Each machine's characteristics determine the size of unit of work that gives the best performance. When too small, the overhead in the runtime system that is required to create the work, manage constraints, and perform assignment becomes larger than the work
 10.1321 +itself. When work-unit size is too large, then not enough units exist to keep all the cores busy. Thankfully, the range between is wide enough, for most applications, that neither limit is hit, on most machines. As machines evolve, though, this happy circumstance is likely to change, necessitating recompiling and possibly hand modifying the application code or some meta-form.
 10.1322 +
 10.1323 +\subsection{Walk through of activity during execution} 
 10.1324 +
 10.1325 +At this point, we present a picture of the flow of control on each
 10.1326 +of two cores, as the core is switched between application
 10.1327 +code and runtime code.  It is too early to understand
 10.1328 +the details, but this figure can be referred back to
 10.1329 +as each portion is discussed in the coming sub-sections.
 10.1330 +Each portion of the figure is labelled with the sub-section that describes that portion of activity. 
 10.1331 +
 10.1332 +At the top is the main program, which starts the proto-runtime,
 10.1333 +and creates a proto-runtime process.  Below that is
 10.1334 +depicted the creation of proto-runtime virtual processors,
 10.1335 +along with the animation of application code by those virtual
 10.1336 +processors.
 10.1337 +
 10.1338 +?
 10.1339 +
 10.1340 +The application  passes information to a wrapper library
 10.1341 +call,
 10.1342 +such as the ID of the mutex to acquire. The library function packages the
 10.1343 +information into a request data structure, then invokes a proto-runtime
 10.1344 +primitive. That suspends the virtual processor (timeline) that is executing
 10.1345 +that code.  The call to the primitive passes as arguments the request structure and a pointer
 10.1346 +to the plugin function that will handle the request.
 10.1347 +The handler runs inside the Master and chooses which
 10.1348 +other timelines to resume as a consequence of the wrapper-library
 10.1349 +call. Those timelines will then resume, returning from
 10.1350 +whatever wrapper-library call caused them to suspend.  In this way, the request handle implements the behavior of a
 10.1351 +synchronization construct.
 10.1352 +
 10.1353 +However, there is one last step between the request
 10.1354 +handler marking a timeline as ready to resume 
 10.1355 +and it becoming re-animated. That step is where the
 10.1356 +assignment half of the language plugin comes into play.
 10.1357 +The request handlers stack up work that is free to
 10.1358 +be executed, but it is the assigner that chooses which
 10.1359 +of those to place onto an offered core.
 10.1360 +
 10.1361 +
 10.1362 +
 10.1363 +
 10.1364 +
 10.1365 +\begin{figure*}[ht]
 10.1366 +  \centering
 10.1367 +  \includegraphics[width = 7.0in, height = 4.5in]
 10.1368 +  {../figures/Proto-Runtime__modules_plus_plugin_plus_code.pdf}
 10.1369 +  \caption{Illustration of the physical time sequence of the timelines of multiple virtual processors executing on multiple
 10.1370 +cores. The timelines run top to bottom, while calls
 10.1371 +between modules and returns run horizontally. The colors of Fn names indicate whether the
 10.1372 +code is part of the application (green), the proto-runtime module (blue), or the language  (red). The top two timelines are animated
 10.1373 +by core 1, while the bottom 2 are animated by core
 10.1374 +2. The boxes
 10.1375 +represent virtual processors, each with its associated
 10.1376 +timeline next to it. The timelines have no relative
 10.1377 +ordering, except at tie-points established by the Request
 10.1378 +Handlers.   Gaps in the timelines are caused by suspension,
 10.1379 +which is effected by primitives within the proto-runtime
 10.1380 +code module.}
 10.1381 +  \label{fig:physTimeSeq}
 10.1382 +\end{figure*}
 10.1383 +
 10.1384 +
 10.1385 +
 10.1386 +\subsection{Using language constructs}
 10.1387 +In the simple form of an eDSL, the language constructs
 10.1388 +take the form of function calls. The reader familiar
 10.1389 +with posix threads will have used function calls to
 10.1390 +perform mutex acquire commands and mutex release commands.
 10.1391 +Here, we illustrate invoking language commands in the
 10.1392 +same way.
 10.1393 +
 10.1394 +We use posix threads for our example because it is
 10.1395 +a familiar language that the reader already knows well.
 10.1396 +It allows us to illustrate the concepts new to proto-runtime without introducing potential confusion about what the language semantics are.
 10.1397 +
 10.1398 +\subsubsection{Main and startup}
 10.1399 +Before using a proto-runtime based language, the proto-runtime
 10.1400 +system must be started, and a proto-runtime process
 10.1401 +must be created.  Fig X shows this. Notice that the
 10.1402 +create process was given a pointer
 10.1403 +to a function.  This function is the seed of the proto-runtime
 10.1404 +based application code.  This seed must start all proto-runtime
 10.1405 +based languages that will be used in the application,
 10.1406 +and must create the virtual processors and tasks that
 10.1407 +perform the work and may in turn create more VPs and/or tasks that perform work.
 10.1408 +
 10.1409 +==main, with PR\_\_start and PR\_\_create\_process == 
 10.1410 +
 10.1411 +\subsubsection{Seed birth function and thread birth
 10.1412 +function}
 10.1413 +Fig X shows our example seed function. It first starts
 10.1414 +the language that will be used, which is Vthread. It
 10.1415 +is an implementation of posix threads that is on top of proto-runtime.
 10.1416 +Next, the seed uses Vthread commands to create two
 10.1417 +threads, and then uses Vthread join to wait for both
 10.1418 +threads to die.  Lastly it "dissipates", which is the
 10.1419 +command that kills the virtual processor that is animating
 10.1420 +the function.
 10.1421 +
 10.1422 +==seed\_birth\_Fn, with Vthread\_\_start(), Vthread\_\_create\_thread,
 10.1423 +Vthread\_\_join, Vthread\_\_stop, and dissipate==
 10.1424 +
 10.1425 +Notice the signature
 10.1426 +of the seed birth function. It returns void, and takes a pointer
 10.1427 +to void plus a pointer to a SlaveVP struct. This is
 10.1428 +the standard signature that must be used for all birth functions for
 10.1429 +proto-runtime created virtual processors or tasks.  
 10.1430 +
 10.1431 +
 10.1432 +Also, notice that the standard signature includes a
 10.1433 +pointer to a SlaveVP struct. This is a proto-runtime
 10.1434 +defined structure, which holds the meta-information
 10.1435 +about a virtual processor. The birth function is handed
 10.1436 +the structure of the virtual processor that is animating
 10.1437 +it.
 10.1438 +
 10.1439 +An illuminating aside is that the birth function for
 10.1440 +a posix thread doesn't need
 10.1441 +to be handed the structure representing the animating thread.
 10.1442 +That is because the operating system tracks which thread
 10.1443 +is assigned to which core.  Posix thread constructs work by executing
 10.1444 +an instruction that suspends the code executing on
 10.1445 +the core and switches
 10.1446 +the core over to animating the OS kernel code. The OS kernel
 10.1447 +then looks up the data structure that is assigned to
 10.1448 +the core.  
 10.1449 +
 10.1450 +That lookup is how the OS kernel gains the
 10.1451 +pointer to the thread that was animating the application
 10.1452 +code that called the posix construct.  But the implementation
 10.1453 +of proto-runtime illustrated in this paper doesn't
 10.1454 +have such a hardware based suspend instruction available,
 10.1455 +and so proto-runtime-based application code must explicitly pass around the pointer to the data
 10.1456 +structure of the virtual processor performing the animation. 
 10.1457 +
 10.1458 +Fig X shows the birth function of the threads created
 10.1459 +by the seed birth function.  It uses the Vthread equivalent
 10.1460 +of mutex acquire and release to protect access to
 10.1461 +a critical section. Notice that the signature
 10.1462 +is the same as the signature of the seed birth function.
 10.1463 +Also notice that the SlaveVP structure is handed to
 10.1464 +each invocation of a Vthread construct.  In the next
 10.1465 +several sub sections we will track how this SlaveVP structure
 10.1466 +is used.
 10.1467 +
 10.1468 +==thread birth function.. uses Vthread acquire and
 10.1469 +release to protect a counter plus print of count value==
 10.1470 +
 10.1471 +
 10.1472 +\subsection{Language Wrapper Library}
 10.1473 +
 10.1474 +Looking at the implementation of the Vthread calls
 10.1475 +reveals code such as in Fig X.
 10.1476 + 
 10.1477 +==wrapper lib code for mutex acquire==
 10.1478 +
 10.1479 +There's nothing much to it.  It just creates a data
 10.1480 +structure, fills it, then hands it to a proto-runtime
 10.1481 +call.  This is a starnd form for wrapper library
 10.1482 +calls. The data structure is used to carry information
 10.1483 +into the proto-runtime (the proto-runtime that was
 10.1484 +started by the PR\_\_start command).  The PR call is
 10.1485 +the equivalent of the hardware instruction that suspends
 10.1486 +application code and switches to the kernel.  For the
 10.1487 +implementation of PR illustrated in this paper, this
 10.1488 +call is implemented with assembly instructions.
 10.1489 +
 10.1490 +This wrapper library code is placed on the machine
 10.1491 +used during development of the application, and is
 10.1492 +compiled into the application executable.  However,
 10.1493 +the proto-runtime call is a link to a dynamic library,
 10.1494 +and is not part of the application executable.
 10.1495 +
 10.1496 +Notice that the PR\ primitive is given a pointer to
 10.1497 +a function.  This is called the handler function, and
 10.1498 +is part of the language plugin.  The proto-runtime
 10.1499 +will actually perform the call to the handler function, but in a carefully controlled
 10.1500 +way. It will provide the handler function with a carefully controlled environment
 10.1501 +to use while it handles this wrapper-library call.
 10.1502 +We will see in a moment how proto-runtime invokes the
 10.1503 +handler function, and what such a handler function
 10.1504 +looks like.
 10.1505 +
 10.1506 +First, here's the assembly that suspends the application code and
 10.1507 +switches to the proto-runtime code, as seen in Fig X
 10.1508 +
 10.1509 +==assembly of suspend and switch==
 10.1510 +
 10.1511 +All it does is save the program counter and stack pointer
 10.1512 +into the SlaveVP structure, then load in the program
 10.1513 +counter and stack pointer of the proto-runtime code,
 10.1514 +which was previously saved in different fields of that same SlaveVP structure.
 10.1515 +
 10.1516 +\subsubsection{proto-runtime code that is switched
 10.1517 +to}
 10.1518 +
 10.1519 +The PR assembly code switches the core to executing
 10.1520 +the (psuedo) code seen in Fig X.
 10.1521 +
 10.1522 +==animation master code, which calls plugin fns==
 10.1523 +
 10.1524 +All this does is invoke the handler function named
 10.1525 +in the wrapper library, and hands it an environmen.
 10.1526 +This is the hidden environment referred to in the tie-point
 10.1527 +model.  It must be accessed in an isolated, atomic,
 10.1528 +fashion.  The proto-runtime code seen here happens
 10.1529 +to use a global lock for each language's environment.
 10.1530 + However other implementations are possible.  In order
 10.1531 + to keep overhead low, it uses the Compare And Swap
 10.1532 + instruction to acquire the lock, and an exponential random
 10.1533 + backoff scheme when contention for the lock arises.
 10.1534 + 
 10.1535 + The handler function is the hidden behavior that executes
 10.1536 + on the hidden timeline that is mentioned in the tie-point
 10.1537 + model. The suspend primitive is what begins a special
 10.1538 + beat on the lifeline of the virtual processor that
 10.1539 + executed the wrapper library call. It is this handler
 10.1540 + code that then establishes the causal connections
 10.1541 + between such special beats, and so ties them together.
 10.1542 + The causal connection is via the changes make to the
 10.1543 + language environment.
 10.1544 + 
 10.1545 + So, in summary, the proto-runtime is the hidden timeline.
 10.1546 + The suspend primitive is what starts a special beat
 10.1547 + and starts the behavior on the hidden timeline. The
 10.1548 + lock is what isolates and sequentializes
 10.1549 + the behavior on the hidden timeline.  The language
 10.1550 + environment is the hidden state used to establish
 10.1551 + causal connection between special beats.
 10.1552 +
 10.1553 +
 10.1554 +
 10.1555 +This is not the plugin code, this is the library that the application executable includes. It's equivalent to the pthread library. When you look at the source of the pthread library, it's just a wrapper that invokes the OS. It doesn't do anything itself. The language libraries  are the same thing, just wrappers that invoke the proto-runtime primitives. Those suspend the VP and send a message to the proto-runtime. When the message arrives, it invokes the plugin to handle the task.
 10.1556 +
 10.1557 +Here's how the wrapper library connects a request to the request handler: via this function pointer, right here Fig X, given to the proto-runtime "suspend and send" primitive. The pointed-to function is part of the plugin. That runs inside the proto-runtime, and is what handles the message created in the wrapper library.
 10.1558 +
 10.1559 +
 10.1560 +If we go and look at that handler function, Fig X, we see that it has a standard prototype, so it takes a standard set of arguments. One of those, here in Fig X, is a language environment. This is the special sauce, it is the thing that is shared among all the cores. This language environment is where tasks are placed that are not yet ready to execute, and where suspended virtual processors are placed that are not yet ready to resume.
 10.1561 +
 10.1562 +Here, Fig X, you can see there's a hash table. The language environment contains that hash table. The tasks get parked in this hash table. Each time a task completes, it looks in the hash table, finds all tasks waiting for its completion, and updates the status of those waiting tasks. If this was the last task being waited for, the waiter is taken out of the hash table and put into the queue of ready to execute tasks.
 10.1563 +
 10.1564 +This is the semantics of the language. This is how the semantics of the language defines what dependencies are, and how it defines when a task's dependencies have been satisfied.  The implementation is just a data structure in the shared language environment. It is the proto-runtime that takes care of creating the tasks, creating the virtual processors, execute those, suspend them and resume them. The proto-runtime handles the mechanics of all that stuff. The language just figures out what are the constraints on making it ready.  
 10.1565 +
 10.1566 +?
 10.1567 +
 10.1568 +Separately, the proto-runtime calls the Assigner function, which is also part of the plugin dynamic library. Each time a task completes or a virtual processor suspends, the wrapper library invokes a proto-runtime primitive. Among other things, that primitive informs the proto-runtime about the completion of that work, which tells the proto-runtime that hardware resources have just been freed up.
 10.1569 +
 10.1570 +The proto-runtime then invokes the Assigner function, passing it information about the hardware that was just freed. The assigner is implemented by the language and uses some language-specific way to choose which of the ready work-units to execute on that hardware (a work-unit is either a ready-to-execute task or a ready-to-resume virtual processors).  This is how the language is given control over placement of work onto cores.   
 10.1571 +
 10.1572 +===================
 10.1573 +
 10.1574 +
 10.1575 +\subsection{not sure}
 10.1576 +A task is an atomic unit of work.  It runs to completion, without suspending. That characteristic allows the proto-runtime to internally treat a task differently than a virtual processor.  The fact that it never suspends means it doesn't need a stack, and needs less bookkeeping, which makes a task faster to create and faster to assign, for lower overhead.
 10.1577 +
 10.1578 +However, a task may optionally choose at some point to execute a language command that causes it to suspend. At the point it does that, the proto-runtime internally converts the task to a virtual processor. That allows the task to suspend and later resume, at the cost of gaining the normal virtual processor overhead. However, the virtual processor the task is converted to comes from a recycle pool and returns when the task completes.
 10.1579 +
 10.1580 +As an application programmer, you can create processes directly with an OS-like language built on top of the proto-runtime.  But you use a programming language to create tasks or virtual processors. For example, VSs has a way to create tasks.  VSs internally then uses a proto-runtime command to have the proto-runtime create a task for it.  Then VSs decorates the task with its own meta-data. It uses that meta-data to track when a task should be executed. 
 10.1581 +
 10.1582 +?
 10.1583 +
 10.1584 +The only thing you're allowed to do outside a language is create the environment in which you start a language.
 10.1585 +
 10.1586 +?
 10.1587 +
 10.1588 +The implementation of the language behavior is the plugin. The plugin has two parts: request handlers, which handle the messages that come when a VP suspends, and an assigner, which picks where particular VP resumes onto or a task runs. With VSs, the plugin provides the behavior of "submit task". 
 10.1589 +The request handler plus plugin together provide the two halves of what people normally call a scheduler.
 10.1590 +
 10.1591 +=================
 10.1592 +
 10.1593 +\subsection{more on tie-points}
 10.1594 +Any event visible before in one is visible in both after. The guarantee is between before in one and after in both. 
 10.1595 +
 10.1596 +From the program point of view, that acquire statement is one instant.  That entire gap in physical time is seen as a single instant to the code.
 10.1597 +
 10.1598 +However, the tie point is just one instant in the timelines.  After the point, one of the timelines could perform an event that interferes with an event from before the tie-point, and no guarantees are given about what the other timeline sees.  However, if another tie-point is created between them, then they are both guaranteed to see that second, interfering event, after the second tie-point.
 10.1599 +
 10.1600 +Take the example of a mutex, M.  The purpose of the only-one semantics of a mutex is to isolate read and write operations done by the owning thread from those done by other threads, which own before or after it.
 10.1601 +
 10.1602 +The mutex behavior is illustrated in Fig X. Timeline 1 writes to variable A at point 1, then releases the M at point 2. Timeline 2 acquires M, at the tied point 2 and reads A at point 3.  For M to provide isolation, it must guarantee that the A write operation at point 1 is seen by the other timeline's read operation, at point 3.  Likewise, it has to guarantees that nothing that happens in timeline 2 after the acquire of M, at point 2, will be seen by timeline 1 before its release, also at point 2.  
 10.1603 +
 10.1604 +That ordering guarantee is what we think of when we imagine the behavior of a mutex acquire-release pair.  All writes done by the releasing thread are seen as completed, by reads performed in the acquiring thread, and no writes in the acquiring thread are seen before the release by the releasing thread.  That is required in order to have value for the semantics of only one thread owns the mutex at any point. The purpose of only-one is to isolate read and write operations done by the owning thread from those done by the threads that own before or after it.
 10.1605 +
 10.1606 +
 10.1607 +The behavior is implemented in terms of a data structure that lives inside the controlling entity's environment.  The controlling entity looks up the data structure for the mutex being requested.  This data structure has a field that contains the name of the thread that current owns the mutex, plus a queue of threads waiting to acquire it.  So, the controlling entity first looks at the field that holds the current owner, sees that it is occupied, and then puts the thread's name into the queue of waiting threads.
 10.1608 +
 10.1609 +At some point later, the waiting thread reaches the top of the queue. At the point the owning thread executes the release operation, that owning thread also suspends, the controlling entity sees that suspend and that the thread wants to perform the release behavior. It looks up the release behavior and performs it.  This behavior looks up the mutex data structure in the controlling entity's environment, removes the releasing thread from the owner field, takes the top thread off the waiters, writes its name into the current owner, then marks both those threads as ready to resume their timelines.
 10.1610 +
 10.1611 +The proto-runtime is the controlling entity, which looks up the behaviors and performs them.  It also manages the environment that holds the data structures used by the behaviors. 
 10.1612 +
 10.1613 +===========
 10.1614 +
 10.1615 +The purpose of the M is to guarantee that what gets written to A here in this timeline is seen over here, in this other timeline.   
 10.1616 +
 10.1617 +So, to turn this simple mechanism into a synchronization construct, you add semantics on top, which determine the end of suspend in the two timelines.  The timelines voluntarily place themselves into suspend, and it is up to the controlling entity to decide at what point to end that suspension.  It is this choice of ending suspension that ties events in one timeline to events in another.  The semantics of deciding that end of suspension is the semantics of the synchronization construct.
 10.1618 +
 10.1619 +For example, take mutual exclusion within Threads. One thread executes a construct that asks to acquire the mutex.  At the point of executing, that thread suspends, so that timeline ceases advancing.  At some point later, the controlling entity sees that suspend, and sees that the timeline is attempting the acquire mutex activity.  It looks up the behavior for acquire mutex, which is then performed inside that controlling entity.
 10.1620 +
 10.1621 +============
 10.1622 +    
 10.1623 +
 10.1624 +\subsection{More on eDSLs}
 10.1625 +%======================================
 10.1626 +
 10.1627 +%We expand on the hypothesis that an embedded style Domain Specfic Language (eDSL) provides high programmer productivity, with a low learning curve. We also show (\S ) that when an application is written in a well designed eDSL, porting it to new hardware becomes simpler, because often only the language needs to be ported.  That is because the elements of the problem being solved that require large amounts of computation are often pulled into the language. Lastly (\S ),  we hypothesize that switching from sequential programming to using an eDSL is low disruption because the base language remains the same, along with most of the development tools and practices.
 10.1628 +
 10.1629 +%In \S \ref{sec:DSLHypothesis} we show that the small number of users of an eDSL means that the eDSL must be very low effort to create, and also low effort to port to new hardware.  At the same time, the eDSL must remain very high performance across hardware targets. 
 10.1630 +
 10.1631 +%In \S we analyze where the effort of creating an eDSL is expended. It turns out that in the traditional approach, it is mainly expended in creating the runtime, and in performance tuning the major domain-specific constructs. We use this to support the case that speeding up runtime creation makes eDSLs more viable. 
 10.1632 +
 10.1633 +%In \S we take a step back and examine what the industry-wide picture would be if the eDSL approach were adopted. A large number of eDSLs will come into existence, each with its own set of runtimes, one runtime for each hardware target.  That causes a multiplicative effect: the number of runtimes will equal the number of eDSLs times the number of hardware targets.  Unless the effort of implementing runtimes reduces, this multiplicative effect could dominate, which would retard the uptake of eDSLs.
 10.1634 +
 10.1635 +
 10.1636 +% ==============
 10.1637 +
 10.1638 +%Further, in \S we show that when an application is written in a well designed eDSL, porting it to new hardware becomes simpler because often only the language needs to be ported.  That is because the elements of the problem being solved that require large amounts of computation are often pulled into the language. Lastly, in \S we hypothesize that switching from sequential programming to using an eDSL is low disruption because the base language remains the same, along with most of the development tools and practices.  Hence, we cover how the three issues currently making parallel programming unattractive are addressed by embedded-style DSLs. 
 10.1639 +
 10.1640 +%We next show what the blocks to eDSLs are, and where the main effort in implementing an eDSL lies. Specifically, in \S \ref{sec:DSLHypothesis} we show that the small number of users of an eDSL means that the eDSL must be very low effort to create, and also low effort to port to new hardware.  At the same time, the eDSL must remain very high performance across hardware targets. 
 10.1641 +
 10.1642 +%In \S we analyze where the effort of creating an eDSL is expended. It turns out that in the traditional approach, it is expended in creating the translator for the custom DSL syntax, in creating the runtime, and in performance tuning the major domain-specific constructs. We propose that the MetaBorg[] or Rose[] translation approaches cover creating translators for custom syntax, and that tuning constructs is inescapable, leaving the question of runtime implementation time. 
 10.1643 +
 10.1644 +%In \S we explore the effects of runtime implementation time by taking a step back and examine what the industry-wide picture would be if the eDSL approach were adopted. A large number of eDSLs will come into existence, each with its own set of runtimes, one runtime for each hardware target.  That causes a multiplicative effect: the number of runtimes will equal the number of eDSLs times the number of hardware targets.  Unless the effort of implementing runtimes reduces, this multiplicative effect could dominate, which would retard the uptake of eDSLs. Thus, showing that an approach that mitigates this multiplicative effect is valuable, and is the role that the proto-runtime plays.    
 10.1645 +
 10.1646 +
 10.1647 +
 10.1648 +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 10.1649 +\subsection{Details}
 10.1650 +\label{subsec:Details}
 10.1651 +
 10.1652 +  what responsibilities are encapsulated in which modules, and what the interfaces between them look like. 
 10.1653 +
 10.1654 +modularization and its interface is what makes the proto-runtime reusable by all languages on given hardware, and the low-level tuning of the proto-runtime for specific hardware automatically benefits all the languages  on that hardware.   
 10.1655 +
 10.1656 +?
 10.1657 +
 10.1658 +
 10.1659 +
 10.1660 + overhead measurements 
 10.1661 +
 10.1662 +implementation time measurements
 10.1663 +
 10.1664 + discuss why equivalent user-level M to N thread packages haven't been pursued, leaving no viable user-level libraries to compare against.
 10.1665 +
 10.1666 + give numbers that indicate that the proto-runtime approach is also competitive with Cilk, and OMPSs, on large multi-core servers.
 10.1667 +
 10.1668 + summary of development time  of the various embedded languages created so far.  Unfortunately, no control is available to compare against, but we provide estimates based on anecdotal evidence of the time taken to develop the versions compared against for overhead. In the
 10.1669 +least, the same effort would have to be expended on
 10.1670 +each and every language that we expended on performance
 10.1671 +tuning our proto-runtime.
 10.1672 +
 10.1673 +  We continue  with a bigger picture discussion of the difference in design methods between traditional approaches and the proto-runtime implementations (\S ). We discuss OpenMP versus the equivalent proto-runtime version called VOMP (\S ).  Then (\S ) we discuss Cilk 5.4 vs the proto-runtime VCilk. Next we discuss pthread vs Vthread (\S ), and  OMPSs vs VSs (\S ).  These discussions attempt to give the two design philosophies and paint a picture of the development process in the two competing approaches.  The goal is to 
 10.1674 +
 10.1675 +illustrate how the proto-runtime approach maintains many of the features, through its centralized services, while significantly reducing implementation time, through reuse of the services, elimination of concurrency concerns in design and debugging, and in the simplifications in design and implementation caused by the clean modularization of the proto-runtime approach, and the regularization of implementation from one language to another.
 10.1676 +
 10.1677 +Then, with the full understanding of the proto-runtime approach in hand, we discuss  how it compares to related work (\S ).
 10.1678 +
 10.1679 +Finally, we highlight the main conclusions drawn from the work (\S ).
 10.1680 +
 10.1681 +
 10.1682 +
 10.1683 +?
 10.1684 +
 10.1685 +
 10.1686 +
 10.1687 +
 10.1688 + The behavior module creates work and determines when work is free the execute, it  tracks  constraints on work imposed by language semantics, and constraints
 10.1689 +due to data dependencies.
 10.1690 +
 10.1691 + a copy of the proto-runtime with language modules runs separately on each core and they communicate via shared variables in a shared language environment.  The proto-runtime protects access to the shared language environment so that language modules can be written in sequential style.  
 10.1692 +
 10.1693 +?
 10.1694 +
 10.1695 +The proto-runtime also implements "centralized" services that it makes available to all languages.  Hardware specific functions include communicating between processors and protecting the internal state used by the language modules.
 10.1696 +
 10.1697 +
 10.1698 +
 10.1699 +   this makes the proto-runtime be reused by all languages on given hardware, and the low-level tuning of the proto-runtime for specific hardware automatically benefits all the languages that run on that hardware.   
 10.1700 +
 10.1701 + implementing language logic, 
 10.1702 +
 10.1703 +show how the proto-runtime interface allows it to use sequential thinking. 
 10.1704 +
 10.1705 +give similar detail on the implementation of the assigner,
 10.1706 +we discuss how that has the potential to improve application performance by reducing communication between cores and reducing idle time of cores. 
 10.1707 +
 10.1708 +support  belief that the patterns we followed when modularizing are indeed fundamental and will remain valid for future languages and hardware. 
 10.1709 +
 10.1710 + discuss some of the centralized services provided by the current proto-runtime implementation, as well as planned future ones. 
 10.1711 +
 10.1712 +reusing language logic from one language implementation to another. 
 10.1713 +
 10.1714 +
 10.1715 +%%%%%%%%%%%%%%%%%%%%%%%%
 10.1716 +%%
 10.1717 +%%%%%%%%%%%%%%%%%%%%%%%%
 10.1718 +\section{Measurements}
 10.1719 +With the background on eDSLs and description of the proto-runtime approach behind us, we then provide overhead measurements in \S\ref{subsec:OverheadMeas} and implementation time measurements in \S\ref{subsec:ImplTimeMeas}
 10.1720 +
 10.1721 +\subsection{Overhead Measurements} \label{subsec:OverheadMeas}
 10.1722 +For the following, we use a 4-core single socket 2.4Ghz laptop, and a 4 socket by 10 core each server.
 10.1723 +
 10.1724 +For runtime performance:
 10.1725 +
 10.1726 +-- Vthread vs pthread: laptop and server on exe vs task (and fibonacci?)
 10.1727 +
 10.1728 +-- VCilk vs Cilk: laptop and server on fibonacci (from Albert)
 10.1729 +
 10.1730 +-- VOMP vs OpenMP: laptop and server on exe vs task and fibonacci
 10.1731 +
 10.1732 +-- VSs vs OMPSs: laptop and server on fibonacci and jpeg
 10.1733 +
 10.1734 +\begin{tabular}{|c|c|c|c|c|c|c|}\hline
 10.1735 +a & 2 & a & a & a & a & a \\\hline
 10.1736 +a & 2 & a & a & a & a & a \\\hline
 10.1737 +a & a & a & a & a & a & a \\\hline
 10.1738 +a & a & a & a & a & a & a \\\hline
 10.1739 +\end{tabular}
 10.1740 +\caption{}
 10.1741 +\label{tab}
 10.1742 +
 10.1743 +As seen, we didn't include application performance because we have not yet taken advantage of the opportunity to use language information to predict locality.  That research is in progress and will be reported in future papers.
 10.1744 +
 10.1745 +
 10.1746 +\subsubsection{Vthread Versus Highly Tuned Posix Threads}
 10.1747 +\label{sec:VthreadVsPthread}
 10.1748 +Measurements indicate that the proto-runtime approach has far lower overhead than even the current highly tuned Linux thread implementation, and discusses why equivalent user-level M to N thread packages haven't been pursued, leaving no viable user-level libraries to compare against.  
 10.1749 +\subsubsection{VCilk Versus Cilk 5.4}
 10.1750 +In \S we give numbers that indicate that the proto-runtime approach is also competitive with Cilk
 10.1751 +\subsubsection{VSs Versus StarSs (OMPSs)}
 10.1752 +OMPSs
 10.1753 +\subsubsection{VOMP Versus OpenMP}
 10.1754 +VOMP
 10.1755 +
 10.1756 +
 10.1757 +%%%%%%%%%%%%%%%%%%%%%%%%
 10.1758 +%%
 10.1759 +%%%%%%%%%%%%%%%%%%%%%%%%
 10.1760 +\subsection{Development Time Measurements}\label{subsec:ImplTimeMeas}
 10.1761 +Here we summarize the time to develop each of the epDSLs and each copy-cat language created so far. As a control, we estimate, based on anecdotal evidence, the time required to create the equivalent functionality, using the traditional approach.
 10.1762 +
 10.1763 +Table \ref{tabPersonHoursLang}, summarizes measurements
 10.1764 +of the time we spent to design, code, and debug an initial version working for each of the languages we created.  The results are shown in the same order we created them, with SSR the first. As we gained experience,  design and coding became more efficient.   These are hours spent at the keyboard or with pen and paper, and don't include think time during other activities in the day.
 10.1765 + 
 10.1766 +
 10.1767 +\begin{centering}
 10.1768 +\begin{tabular}{|l|r|r|r|r|r|r|r|}
 10.1769 +  \cline{2-8}
 10.1770 +  \multicolumn{1}{r|}{} & SSR & Vthread & VCilk & HWSim & VOMP & VSs & Reo\\
 10.1771 +  \cline{2-8}
 10.1772 +  \noalign{\vskip2pt}
 10.1773 +  \hline
 10.1774 +  Design & 19 & 6 & 3 & 52 & 18& 6 & 14\\
 10.1775 +  Code & 13 & 3 & 3& 32 & 9& 12 & 18\\
 10.1776 +  Test & 7 & 2 & 2& 12 & 8& 5 & 10\\
 10.1777 +  L.O.C. & 470 & 290 & 310& 3000 & 690 & 780 & 920\\
 10.1778 +  \hline
 10.1779 +\end{tabular}
 10.1780 +\caption
 10.1781 +{Hours to design, code, and test each embedded language. L.O.C. is lines of (original) C code, excluding libraries and comments.
 10.1782 +}
 10.1783 +\end{centering}
 10.1784 +\label{tabPersonHoursLang}
 10.1785 +
 10.1786 +%\subsubsection{Comparison of Design Approaches}
 10.1787 +%We give the bigger picture of the difference in  approach for each language, between the proto-runtime implementation and the distributed implementation.  The goal is to illustrate how the proto-runtime  centralized services, while significantly reducing implementation time, through reuse of the services, elimination of concurrency concerns in design and debugging, and in the simplifications in design and implementation caused by the clean modularization of the proto-runtime approach, and the regularization of implementation from one language to another.
 10.1788 +
 10.1789 +
 10.1790 +%%%%%%%%%%%%%%%%%%%%%%%%
 10.1791 +%%
 10.1792 +%%%%%%%%%%%%%%%%%%%%%%%%
 10.1793 +\section{Related Work} \label{sec:Related}
 10.1794 +
 10.1795 +We discuss  how proto-runtime compares to other approaches to implementing the runtimes of domain specific languages.  The criteria for comparison are: level of effort to implement the runtime, effort to port the runtime, runtime performance, and support for application performance. The main alternative implementation approaches are: posix threads, user-level threads, TBB, modifying libGomp, and using hardware primitives to make a custom runtime.
 10.1796 +
 10.1797 +We  summarize the conclusions in Table \ref{tab:CriteriaVsApproach}.
 10.1798 +
 10.1799 +
 10.1800 +\begin{center}
 10.1801 +\caption{Table \ref{tab:CriteriaVsApproach} shows how well each approach scores in the measures important to implementors of  runtimes for  DSLs. On the left are the implementation approaches. At the top are the measures. In a cell is the score on the measure for
 10.1802 +the approach. One plus is the lowest score, indicating the implementation approach is undesirable, 5 indicates the highest desirability.  The reasons for the scores are discussed in the text. } \label{tab:CriteriaVsApproach}
 10.1803 +
 10.1804 +\begin{tabular}{|c|c|c|c|c|}\hline
 10.1805 +Runtime Creation  & \textbf{impl.}& \textbf{porting} & \textbf{runtime} & \textbf{application} \\
 10.1806 +\textbf{} & \textbf{ease} & \textbf{ease} & \textbf{perf.} & \textbf{perf.}\\\hline
 10.1807 +\textbf{OS Threads} & ++ & ++ & + & + \\\hline
 10.1808 +%\textbf{User Threads} & ++& ++ & ++ & + \\\hline
 10.1809 +\textbf{TBB} & ++ & ++ & ++ & + \\\hline
 10.1810 +\textbf{libGomp} & +++ & ++ & +++ & ++++ \\\hline
 10.1811 +\textbf{HW primitives} & + & + & +++++ & +++++ \\\hline
 10.1812 +\textbf{Proto-runtime} & +++++ & +++++ & ++++ & +++++\\\hline
 10.1813 +\end{tabular}
 10.1814 +\end{center}
 10.1815 +
 10.1816 +
 10.1817 +
 10.1818 +The first two methods have poor runtime and application
 10.1819 +performance. They involve building the DSL runtime on top of OS threads\ or TBB, both of which have runtimes in their own right. So the DSL runtime runs on top of the lower-level runtime.  This places control of work placement inside the lower-level runtime, blocking the DSL runtime, which hurts application-code performance, due to inability to use data locality. In addition, OS threads have operating system overhead and OS-imposed fairness requirements, which keeps runtime performance poor as seen in Section \ref{sec:VthreadVsPthread}.
 10.1820 +
 10.1821 +Both also force the DSL implementation to manage concurrency explicitly, using lower-level runtime constructs such as locks.  TBB may have a slight advantage due to its task-scheduling commands, but only for task-based languages. Hence, implementation effort is poor for these approaches.  
 10.1822 +
 10.1823 +For the same reason, porting is poor for these two
 10.1824 +approaches. The DSL's runtime code needs to be rewritten and tuned for each hardware platform, or else some form of hardware-abstraction placed into the runtime.  But putting in a hardware abstraction is essentially an alternative way of implementing half of the proto-runtime approach, but without the centralization, reuse, and modularization benefits.
 10.1825 +
 10.1826 +Moving on to libGomp. Some  language researchers use libGomp (based on informal discussions) because of its very simple structure, which makes it relatively easy to modify, especially for simple languages. However, it provides no services such as debugging or performance tuning, and it has no modularization or reuse across languages benefits.  As the price of the simplicity, performance suffers, as seen in the experiments [].  Also, re-writes of the DSL runtime are required for each platform in order to tune it to hardware characteristics. However, because the runtime is directly modified, the language gains control over placement of work, enabling good application performance, if the extra
 10.1827 +effort is expended to take advantage.
 10.1828 +
 10.1829 +Lastly, we consider the alternative of writing a custom runtime from scratch, using hardware primitives such as the Compare And Swap (CAS) instruction, or similar atomic read-modify-write instructions.  This approach requires the highest degree of implementation effort, and the worst portability across hardware.  However, if sufficient effort is expended on tuning, it can achieve the best runtime performance and equal the best performance of application code. So far, the gap has proven small between highly tuned language-specific custom runtime performance and that of our proto-runtime, but we only have the CILK implementation as a comparison point. 
 10.1830 + 
 10.1831 +Putting this all together, Table \ref{tab:CriteriaVsApproach} shows that the proto-runtime approach is the only one that scores high in all of the measures. It makes initial language implementation fast, as well as reduces porting effort, while keeping runtime performance high and enabling high application performance. 
 10.1832 +
 10.1833 +
 10.1834 +
 10.1835 +%%%%%%%%%%%%%%%%%%%%%%%%
 10.1836 +%%
 10.1837 +%%%%%%%%%%%%%%%%%%%%%%%%
 10.1838 +\section{Conclusions and Future Work}
 10.1839 +The  main takeaways from the paper are first, the potential for embedded style Domain Specific Languages (eDSLs) to address the issues that are holding-back parallel programming, and second   the role that the proto-runtime approach can play in making eDSLs practical, by simplifying the runtime aspect of implementing a large number of eDSLs across the many hardware targets. 
 10.1840 +%The proto-runtime approach does this by modularizing the runtimes, providing reuse of centralized services, and reuse of the hardware-specific performance tuning, which is performed once per hardware, on the proto-runtime, then enjoyed by all the eDSLs.  Hence, the proto-runtime approach provides a significant piece of the puzzle of providing eDSLs, to bring parallel programming into the mainstream.
 10.1841 +
 10.1842 +
 10.1843 +%[[Hypothesis: Embedded-style DSLs -> high productivity + low learning curve + low disruption + low app-port AND quick time to create + low effort to lang-port + high perf across targets]] 
 10.1844 +
 10.1845 +
 10.1846 +Specifically, we have shown how the approach modularizes runtime code, in a way that appears applicable to any language or execution model. It isolates the hardware-specific portion  from language behavior as well as from the language-driven placement of work onto resources, providing interfaces between them.
 10.1847 +
 10.1848 +  The modularization reduces the effort of implementing a new language, especially for an embedded-style one where runtime creation is a significant portion of total  effort.  It causes the low level hardware portion to be reused by each language. And, the behavior implementation is simplified, by handling shared state inside the proto-runtime and exporting a sequential interface for  the behavior module to use. The simplification   reduces  effort, as does reuse of the hardware-specific portion, reuse of behavior code from one language to another, reuse of assignment code, and familiarity with the modular structure by implementors. Overall effort reduction was supported by measurements of implementation effort. 
 10.1849 +
 10.1850 +The proto-runtime approach makes it practical to maintain high overall runtime performance, with low effort for the language implementor. It is practical because high effort is put into performance-tuning the hardware-specific proto-runtime, which is then reused by each language. In this way the performance derived from the high tuning effort is inherited without extra effort by the language creators, thus amortizing the cost.
 10.1851 +
 10.1852 +Centralized services were implemented inside the proto-runtime portion, such as debugging facilities, automated verification, concurrency handling, hardware performance information gathering, and so on. We showed how they are reused by the languages. 
 10.1853 +
 10.1854 +Although we didn't measure it, we indicated how application performance can be increased due to giving the language direct control over placement of work, to take advantage of data affinity or application-generated communication patterns. This ability is due to the assignment module, which provides the language implementor with control over which core work is assigned to, and the order of executing each work unit.
 10.1855 +
 10.1856 +Work on the proto-runtime approach is in its infancy, and much remains to be done, including:
 10.1857 +\begin{itemize} 
 10.1858 +\item  Creating related interfaces for use with distributed memory hardware, and interfaces for hierarchical runtimes, to improve performance on many-level hardware such as high-performance computers, and to tie together runtimes for different types of architecture, to cover heterogeneous architectures and machines.
 10.1859 +\item Extending the proto-runtime interface to present hardware information that a work-assigner will need, but in a generic way that remains constant across many hardware configurations yet exposes all relevant information.
 10.1860 +\item Exploring work assignment implementations that take advantage of language and application knowledge to improve placement of work to gain higher application performance.
 10.1861 +\item Applying the proto-runtime approach to support a portability software stack, and supply OS services to applications via the proto-runtime, to further increase application-code portability.
 10.1862 +\end{itemize}
 10.1863 +
 10.1864 +
 10.1865 +\end{document}
 10.1866 +=============================================
 10.1867 +==
 10.1868 +==
 10.1869 +==
 10.1870 +==
 10.1871 +==
 10.1872 +=============================================
 10.1873 +
 10.1874 +\section{The Problem}
 10.1875 +
 10.1876 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
 10.1877 +
 10.1878 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
 10.1879 +
 10.1880 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
 10.1881 +
 10.1882 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
 10.1883 +
 10.1884 +
 10.1885 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
 10.1886 +
 10.1887 +While talking about the problems encountered by Domain Specific Languages (DSLs), we focus   on implications for the runtime system, due to its central role in the claims.  At the same time we will support the hypothesis that embedded-style DSLs  are high-productivity for application programmers, have a low learning curve, and cause low disruption to current programming practices.  While doing this we set the ground work for the next section, where we show that the main effort of implementing embedded-style DSLs is creating the runtime, and that when using the proto-runtime approach, embedded-style DSLs are low-effort to create and port and move the effort of porting for high performance out of the application and into the language.
 10.1888 +
 10.1889 +To give the needed depth, we'll first talk about a way to classify parallel languages  according to the structure of their runtime (subsection \ref{subsec:ClassifyingLangs}).  Then we'll talk about the sub-class of domain specific parallel languages, what sets them apart, and the implications for their runtime implementations (subsection \ref{subsec:DomSpecLangs}). That segues into the embedded style of language, and how the work of implementing them is mainly the work of implementing their runtime (subsection \ref{subsec:EmbeddedDSLs}).
 10.1890 + 
 10.1891 +Once that reduction from parallel languages in general to embedded style domain specific ones in particular is done, we'll give more on what embedded style DSLs look like from an application programmer's view (subsection \ref{subsec:AppProgViewOfDSL}). We will include depth on a particular embedded-style language, showing sample code that uses the constructs, then delving into needs within the implementation of that language, and behavior of the constructs during a run (subsection []).
 10.1892 +
 10.1893 +The main implications for runtime systems, which were uncovered within the section, are summarized at the end (subsection []).
 10.1894 +
 10.1895 +\subsection{Classifying parallel languages by virtual processor based vs task based}
 10.1896 +\label{subsec:ClassifyingLangs}
 10.1897 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
 10.1898 +
 10.1899 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
 10.1900 +
 10.1901 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
 10.1902 +
 10.1903 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
 10.1904 +
 10.1905 +
 10.1906 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
 10.1907 +
 10.1908 +One major axis for classifying parallel languages is whether they are virtual processor based or task based, which has implications for the structure of the runtime.
 10.1909 +
 10.1910 +A virtual processor is long-lived, and has a context that persists across suspend and resume, while a task has no preceding context to fit into and leaves no implied context when done.  Posix threads is a standard example of a virtual processor based parallel language, as are UPC, Charm, TBB, and so forth. All of these create virtual processors (aka threads), which suspend when they invoke synchronizations and other parallel-language constructs then resume after the construct completes.  Such virtual processors have their own private stack  to save the information that is needed upon resume.  
 10.1911 +
 10.1912 +In contrast, dataflow is a standard example of a task based language, as is CnC. For these languages, a task is passed all the information it needs at the point of creation, and is expected to run to completion.  If a task needs to invoke a parallelism construct, that invocation normally ends the task, while information needed by following tasks is saved explicitly in shared variables, or passed to the runtime as a continuation that is then handed to the task created when the construct completes.  
 10.1913 +
 10.1914 +Hybrids of the two also exist, such as OpenMP which implies thread creation, via the parallel-pragma, but also creates tasks via the for-pragma. As well, StarSs (OMPSs) mixes the two, with a main thread that creates meta-tasks that have to resolve their dependencies before being turned into executable tasks. Those tasks are also able to invoke barriers and other synchronization constructs, then resume.
 10.1915 +
 10.1916 +The runtime implementations of the two different types of execution model differ markedly.  Virtual processor (VP) based runtimes have to create a stack for each VP created, and manage the interleaving of the CPU's hardware stack.  They also require a mechanism to suspend and resume the VPs, and save them in internal structures while suspended.
 10.1917 +
 10.1918 +In contrast, task based runtimes need ultra-fast creation of tasks, and fast linkage from the end of one to the start of the next.  They tend to keep the task-structures in a queue and discard them when complete.  
 10.1919 +
 10.1920 +Hence, VP based runtimes revolve around storing suspended VPs inside structures that embody the constraints on when the VP can  resume.  But task based runtimes revolve around the conditions upon which to create new tasks, and the organization of the inputs to them.  The runtimes for hybrid languages have characteristics of both.
 10.1921 +
 10.1922 +
 10.1923 +\subsection{Domain specific parallel languages}
 10.1924 +\label{subsec:DomSpecLangs}
 10.1925 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
 10.1926 +
 10.1927 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
 10.1928 +
 10.1929 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
 10.1930 +
 10.1931 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
 10.1932 +
 10.1933 +
 10.1934 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
 10.1935 +
 10.1936 +Now we'll talk about the sub-class of Domain Specific Languages (DSLs): what sets them apart from other parallel languages, how they potentially solve the issues with parallel programming, and the implications for their runtime implementations.
 10.1937 +
 10.1938 +DSLs can be any of the three basic language types (VP based, task-based or hybrid), but they are distinguished by having constructs that correspond to features of one narrow domain of applications.  For example, we have implemented a DSL that is just for use in building hardware simulators [cite the HWSim wiki].  Its constructs embody the structure of simulators, and make building one fast and even simpler than when using a sequential language, as will be shown in Subsection [].  The programmer doesn't think about concurrency, nor even about control flow, they simply define behavior of individual hardware elements and connect them to each other.
 10.1939 +
 10.1940 +It is this fit between language constructs and the mental model of the application that makes DSLs highly productive and easy to learn, at the same time, it is also what makes applications written in them more portable.  Application patterns that have strong impact on parallel performance are captured as language constructs.  The rest of the source code has less impact on parallel performance, so just porting the language is enough to get high performance on each hardware target.
 10.1941 +
 10.1942 +In practice, designing such a language is an art, and for some hardware targets, the language can become intrusive.  For example, for porting to GPGPUs, their performance is driven by decomposition into many small, simple, kernels, which access memory in contiguous chunks.  Fitting into this pattern forces rearrangement of the base sequential code, and even constrains choice of algorithm.  Hence, a DSL that is portable to standard architectures as well as GPUs would place the GPU restrictions onto the code for all machines.  However, much excellent work [polyhedral, others] is being done on automated tools to transform standard code to GPU form, which would lift the  restrictions.  Also, constructs such as the DKU pattern [] map well onto GPUs as well as standard hardware.
 10.1943 +
 10.1944 +\subsection{The embedded style of DSL}
 10.1945 +\label{subsec:EmbeddedDSLs}
 10.1946 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
 10.1947 +
 10.1948 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
 10.1949 +
 10.1950 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
 10.1951 +
 10.1952 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
 10.1953 +
 10.1954 +
 10.1955 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
 10.1956 +
 10.1957 +We segue now into the embedded style of language, and show how the work of implementing them is mainly the work of implementing their runtime plus their complex domain constructs. We focus on  embedded style domain specific languages because it is the least effort-to-create form of DSL, and making DSLs practical requires it to be low effort to create them and  port them to various hardware targets.
 10.1958 +
 10.1959 +
 10.1960 +An embedded-style language is one that uses the syntax of a base language, like C or Java, and adds constructs that are specific to the domain. An added construct may be expressed in custom syntax that is translated to into a library call, or else directly  invoked by making a library call, as illustrated in Figure \ref{fig:EmbeddedEx}. Inside the library call, a primitive is used to escape the base language and enter the  embedded language's runtime, which then performs the behavior of the construct.
 10.1961 +
 10.1962 +
 10.1963 +\begin{figure}[h!tb]
 10.1964 +{\noindent
 10.1965 +{\footnotesize
 10.1966 +{\normalsize Creating a new virtual processor (VP):}
 10.1967 +\begin{verbatim}
 10.1968 +newVP = SSR__create_VP( &top_VP_fn, paramsPtr, animatingVP );
 10.1969 +\end{verbatim}
 10.1970 +
 10.1971 +{\noindent {\normalsize sending a message between VPs:}}
 10.1972 +\begin{verbatim}
 10.1973 +SSR__send_from_to( messagePtr, sendingVP, receivingVP );
 10.1974 +\end{verbatim}
 10.1975 +
 10.1976 +{\noindent {\normalsize receiving the message (executed in a different VP):}}
 10.1977 +\begin{verbatim}
 10.1978 +messagePtr = SSR__receive_from_to( sendingVP, receivingVP );
 10.1979 +\end{verbatim}
 10.1980 +}
 10.1981 +}
 10.1982 +
 10.1983 +\caption
 10.1984 +{Examples of invoking embedded-style  constructs.
 10.1985 +}
 10.1986 +\label{fig:EmbeddedEx}
 10.1987 +\end{figure}
 10.1988 +An embedded-style language differs from a library in that it has a runtime system, and a way to switch from the behavior of the base language to the behavior inside the runtime.  In contrast, libraries never leave the base language.  Notice that this means, for example, that a posix threads library is not a library at all, but an embedded language.
 10.1989 +
 10.1990 +As a practical matter, embedded-style constructs normally have a thin wrapper that invokes the runtime. However, some DSLs perform significant effort inside the library before switching to the runtime, or else after returning from the runtime.  These look more like traditional libraries, but still involve an escape from the base language and more importantly are designed to work in concert with the parallel aspects of the language. They  concentrate key performance-critical aspects of the application inside the language, such as dividing work up, or, for example, implementing a solver for differential equations that accepts structures created by the divider.
 10.1991 +
 10.1992 +It is the appearance of constructs being library calls that brings the low-disruption benefit of embedded-style DSLs.  The syntax is that of the base language, so the existing development tools and work flows remain intact when moving to an embedded style DSL.  In addition, the fit between domain concepts and language constructs minimizes mental-model disruption when switching and makes the learning curve to adopt the DSL very low. 
 10.1993 +
 10.1994 +\subsection{Application programmer's view of embedded-style DSLs}
 10.1995 +\label{subsec:AppProgViewOfDSL}
 10.1996 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
 10.1997 +
 10.1998 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
 10.1999 +
 10.2000 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
 10.2001 +
 10.2002 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
 10.2003 +
 10.2004 +
 10.2005 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
 10.2006 +
 10.2007 +Well designed DSLs have very few constructs, yet capture the most performance-critical domain patterns, in a way that feels natural to the application programmer.  This often means that data structures and usage patterns are part of the language. 
 10.2008 +
 10.2009 +For example, a linear-equation-solving language would define a standard data structure for the coefficients of the equations, and supply a construct by which the language is asked to perform the work of solving them. This feels very much like a library, but the runtime system dynamically performs division of work according to the hardware, and implements communication between cores and a scheduler that load balances and tries to take advantage of data affinity and even computational accelerators.  All of which puts performance in the hands of the runtime and is simple to use.
 10.2010 +
 10.2011 +An example of a DSL that we created using the proto-runtime approach is HWSim [], which is designed to be used for writing architectural simulators. 
 10.2012 +
 10.2013 +When using HWSim, a simulator application is composed of just three things: netlist, behavior functions and timing functions. These are all sequential code that call HWSim constructs at boundaries, such as the end of behavior, and use HWSim supplied data structures. To use HWSim, one creates a netlist composed of elements and communication paths that connect them.  A communication path connects an outport of the sending element to an inport of the receiving element. An action is then attached to the inport. The action is triggered when a communication arrives. The action has  a behavior function, which changes the state of the element,  and a timing function which calculates how much simulated time the behavior takes.   
 10.2014 +
 10.2015 +The language itself consists of only a few standard data structures, such as \texttt{Netlist}, \texttt{Inport}, \texttt{Outport},  and a small number of constructs, such as \texttt{send\_comm} and \texttt{end\_behavior}.  The advancement of simulated time is performed by a triggered action, and so is implied. The parallelism is also implied, by the only constraints on  order of execution of actions being  consistency.  
 10.2016 +
 10.2017 +The only parallelism-related restriction is that a behavior function may only use data local to the element it is attached to.   Parallel work is created within the system by outports that connect to multiple destination inports which means one output triggers multiple actions, and by behavior functions that generate multiple output communications each.
 10.2018 +
 10.2019 +Overall, simulator writers have fewer issues to deal with because time-related code has been brought inside the language, where it is reused across simulators, and because parallelism issues reduce to simply being restricted to data local to the attached element.  Both these increase productivity of simulator writers, despite using a parallel language.  The language has so few commands that it takes only a matter of days to become proficient (as demonstrated informally by new users of HWSim).  Also, parallelism related constructs in the language are generic across hardware, eliminating the need to modify application code when porting to new hardware (if the language is used according to the recommended coding style).     
 10.2020 +
 10.2021 +\subsection{Implementation of Embedded-style DSLs}
 10.2022 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
 10.2023 +
 10.2024 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
 10.2025 +
 10.2026 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
 10.2027 +
 10.2028 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
 10.2029 +
 10.2030 +
 10.2031 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
 10.2032 +
 10.2033 +When it comes to implementing an embedded-style of DSL, the bulk of the effort is in the runtime and the more complex domain specific constructs.
 10.2034 +
 10.2035 +Examples of constructs implemented for  DSLs include Abstract Data Types (ADTs), like linked lists, hash tables, and priority queues. Also, full algorithms, like solvers for systems of equations, or even linear algebra operations on matrices. It will be seen in subsection[] that the proto-runtime approach causes the implementation for such constructs to be reused, with high performance, across all the hardware targets in a hardware class such as the class of shared-memory multi-core platforms. 
 10.2036 +
 10.2037 +In addition, embedded style DSLs rely heavily on data types that are part of the language.  These are often domain-specific such as \texttt{Netlist}, \texttt{Inport}, and \texttt{Outport} in HWSim, or \texttt{Protein} in a bio-informatics DSL, but can also be common such as \texttt{SparseMatrix} in domains like data mining and scientific applications.
 10.2038 +
 10.2039 +
 10.2040 + During language design,  common patterns that consume significant development time or computation are placed into the language. Also, any patterns that expose hardware configuration, such as the number and size of pieces of work should be pulled into the language to aid portability. 
 10.2041 +
 10.2042 +If such design is successful then porting the application reduces to just porting the language. When the language has successfully captured the main computational patterns of the domain, then the application code encapsulates only a small portion of the performance, so it does not need to be tuned. Further, when patterns that expose hardware-motivated choices or hardware-specific commands are in the language, then the application code has nothing that needs to change when the hardware changes.
 10.2043 +
 10.2044 +For example, HWSim pulls hardware-specific patterns inside the language by handling all inter-core communications inside the language, and also by aggregating multiple elements together on the same core to tune work-unit size.    
 10.2045 +
 10.2046 +The advantage of placing these into the language, instead of application code, is portability and productivity.
 10.2047 +
 10.2048 +
 10.2049 +\subsection{Implementation Details of Embedded-style DSLs}
 10.2050 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
 10.2051 +
 10.2052 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
 10.2053 +
 10.2054 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
 10.2055 +
 10.2056 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
 10.2057 +
 10.2058 +
 10.2059 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
 10.2060 +
 10.2061 +?
 10.2062 +
 10.2063 +Figure [] shows\ the implementation of the wrapper library for HWSim's send\_and\_idle construct, which sends a communication on the specified outport, and then causes the sending element to go idle. Of note is the packaging of information for the runtime. It is placing  into the HWSimSemReq data structure, and then the application work is ended by switching to the runtime. The switch is via the send\_and\_suspend call, which is a primitive implemented in assembly that jumps out of the base C language and into the runtime.
 10.2064 +
 10.2065 +The switch to the runtime can be done in multiple ways.  Our proto-runtime uses assembly to manipulate the stack and registers. For posix threads language, when implemented in Linux, the hardware trap instruction is used to switch from application to the OS. The OS serves as the runtime that implements the thread behavior. 
 10.2066 +
 10.2067 +The core is  used by the construct implementation differently for   VP based languages vs  for task based languages.
 10.2068 +
 10.2069 +For VP based languages, once inside the runtime,  a synchronization construct performs the behavior shown abstractly in Figure []. In essence, a synchronization construct is a variable length delay, which waits for activities outside the calling code to cause specific conditions to become true.  These activities could be actions taken by other pieces of application code, such as releasing a lock, or they could be hardware related, such as waiting for a DMA transfer to complete.  
 10.2070 +
 10.2071 +While one piece of application code (in a VP) is suspended, waiting, other pieces can use the core to perform their work, as long as the conditions for those other pieces are satisfied. Hence, the runtime's construct implementation checks if conditions are met, and if not stores the suspended piece (VP). If the construct can change conditions for others, it updates them. For example, the lock-release construct updates state for VPs waiting for the lock.  Separately, for VPs whose conditions have been met, when a core becomes available, the runtime chooses which VP to assign to which core.  
 10.2072 +
 10.2073 +These are the two behaviors a construct performs inside the runtime: managing conditions on which work is free, and managing assignment of free work onto cores.
 10.2074 +
 10.2075 +For task based languages, a task runs to completion then always switches to the runtime at the end.  Hence, no suspend and resume exists. Once inside, the runtime's job is to track conditions on which tasks are ready to run, or which to create.  For example, in dataflow, a task is created only once all conditions for starting it are met.  Hence, the only language constructs are "instantiate a task-creator", "connect a task creator to others", and "end a task".  During a run, all of the runtime behavior takes place inside the "end a task" construct, where the runtime sends outputs from the ending task to the inputs of connected task-creators.  The "send" action modifies internal runtime state, which represents the order of inputs to a creator on all of its input ports. When all inputs are ready, it creates a new task, then when hardware is ready, assigns the task to a core.
 10.2076 +
 10.2077 +
 10.2078 +One survey[] discusses DSLs for a variety of domains, and this list of DSLs was copied from their paper:
 10.2079 +\begin{itemize} 
 10.2080 +\item In Software Engineering: Financial products [12, 22, 24], behavior control and coordination [9, 10], software architectures [54], and databases [39].
 10.2081 +\item Systems Software:  Description and analysis of abstract syntax trees [77, 19, 51], video device driver specifications [76], cache coherence protocols [15], data structures in C [72], and operating system specialization [63].
 10.2082 +\item Multi-Media: Web computing [14, 35, 4, 33], image manipulation [73], 3D animation [29], and drawing [44].
 10.2083 +\item Telecommunications: String and tree languages for model checking [48], communication protocols [6], telecommunication switches [50], and signature computing [11].
 10.2084 +\item Miscellaneous: Simulation [2, 13], mobile agents [36], robot control [61], solving partial differential equations [26], and digital hardware design [41].
 10.2085 +\end{itemize}
 10.2086 +
 10.2087 +\subsection{Summary of Section}
 10.2088 + [[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
 10.2089 +
 10.2090 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
 10.2091 +
 10.2092 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
 10.2093 +
 10.2094 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
 10.2095 +
 10.2096 +
 10.2097 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
 10.2098 +
 10.2099 +This section illustrated the promise of DSLs for solving the issues with parallel programming. The HWSim example  showed that well designed parallel DSLs can actually improve productivity, and have a low learning curve, as well as reduce the need for touching application code when moving to new target hardware.  The section showed that the effort of implementing an embedded style DSL is mainly that of implementing its runtime and complex domain constructs, and that a well-designed DSL captures most of the performance-critical aspects of an application inside the DSL constructs. Hence, porting effort reduces to just performance-tuning the language (with caveats for some hardware). This effort is, in turn, reused by all the applications that use the DSL.
 10.2100 +
 10.2101 +The stumbling point of DSLs is the small number of users, after all, how many people write hardware simulators? Perhaps  a few thousand people a year write or modify applications suitable for HWSim. That means the effort to implement HWSim has to be so low as to make it no more effort than writing a library, effectively a small percentage of a simulator project.  
 10.2102 +
 10.2103 +The runtime is a major piece of the DSL implementation, so reducing the effort of implementing the runtime goes a long way to reducing the effort of implementing a new DSL. 
 10.2104 +
 10.2105 +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 10.2106 +\section{Description}
 10.2107 +\label{sec:idea}
 10.2108 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
 10.2109 +
 10.2110 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
 10.2111 +
 10.2112 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
 10.2113 +
 10.2114 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
 10.2115 +
 10.2116 +
 10.2117 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
 10.2118 +
 10.2119 +?
 10.2120 + 
 10.2121 +
 10.2122 +Now that we have made the case that embedded style DSLs have potential to solve many parallel programming issues, and that a major obstacle to uptake of them is their implementation effort,   we describe the proto-runtime concept and show how it addresses this obstacle to DSLs. As shown,  embedded style DSL implementation effort and porting effort is mainly that of creating the runtime and implementing the more complex language constructs. We show here that the proto-runtime approach dramatically reduces the effort of creating a DSL runtime, through a number of features.
 10.2123 +
 10.2124 +
 10.2125 +\begin{figure}[ht]
 10.2126 +  \centering
 10.2127 +  \includegraphics[width = 2in, height = 1.8in]{../figures/PR_three_pieces.pdf}
 10.2128 +  \caption{Shows how the proto-runtime approach modularizes the implementation of a runtime. The three pieces are the proto-runtime implementation, an implementation of the language construct behaviors, and an implementation of the portion of a scheduler that chooses which work is assigned to which processor. }
 10.2129 +  \label{fig:PR_three_pieces}
 10.2130 +\end{figure}
 10.2131 +
 10.2132 +
 10.2133 +The main feature is the proto-runtime's approach to modularizing the runtime code. As shown in Fig \ref{fig:PR_three_pieces}, it  breaks the runtime into three pieces: a cross-language piece, which is the proto-runtime implementation, a piece that implements the language's constructs  and plugs into the proto-runtime, and a piece that assigns work onto  hardware and also plugs into the proto-runtime.
 10.2134 +
 10.2135 +The modularization appears to remain valid across parallel languages and execution models, and we present underlying patterns that support this observation.  We analyze the basic structure of a synchronization construct, and point  out how the proto-runtime modularization is consistent with it.
 10.2136 +
 10.2137 +\subsection{Creating an eDSL}
 10.2138 +
 10.2139 +
 10.2140 +\begin{figure}[ht]
 10.2141 +  \centering
 10.2142 +  \includegraphics[width = 2in, height = 1.8in]{../figures/eDSL_two_pieces.pdf}
 10.2143 +  \caption{An embedded style DSL consists of two parts: a runtime and a wrapper library that invokes the runtime}
 10.2144 +  \label{fig:eDSL_two_pieces}
 10.2145 +\end{figure}
 10.2146 + 
 10.2147 +As shown in Fix \ref{fig:eDSL_two_pieces}, to create an embedded style DSL (eDSL), do two things: create the runtime and create a wrapper-library that invokes the runtime and also implements the more complex language constructs.
 10.2148 +
 10.2149 +As seen in Fig X, a library call that invokes a language construct is normally a thin wrapper that only communicates to the runtime. It places information to be sent to the runtime into a carrier, then invokes the runtime via a primitive. The primitive suspends the base language execution and switches the processor over to the runtime code.
 10.2150 +
 10.2151 +\subsection{The Proto-Runtime Modularization}
 10.2152 +
 10.2153 +\subsubsection{Dispatch pattern}
 10.2154 +-- standardizes runtime code
 10.2155 +-- makes familiar going from one lang to another
 10.2156 +-- makes reuse realistic, as demonstrated by VSs taking SSR constructs
 10.2157 +
 10.2158 +-- show the enums, and the switch table
 10.2159 +
 10.2160 +-- point out how the handler receives critical info -- the semEnv, req struct and calling slave
 10.2161 +
 10.2162 +\subsubsection{The Request Handler}
 10.2163 +-- cover what a request handler does.. connect it to the wrapper lib, and the info loaded into a request struct.
 10.2164 +
 10.2165 +-- give code of a request handler.. within on-going example of implementing pthreads, or possibly HWSim, or pick a new DSL 
 10.2166 +
 10.2167 +\subsection{Exporting a performance-oriented machine view }
 10.2168 +The proto-runtime interface exports a view of the machine that shows performance-critical aspects.  Machines that share the same architectural approach have the same performance-critical aspects, and differ only in the values. 
 10.2169 +
 10.2170 +For example, the interface models cache-coherent shared-memory architectures  as a collection of memory pools connected by networks.  The essential variations among processor-chips are the sizes of the pools, the connections between them, such as which cores share the same L2 cache, and the latency and bandwidth between them.
 10.2171 +
 10.2172 +Hence, a single plugin can be written that gathers this information from the proto-runtime and uses it when deciding which work to assign to which core.  Such a plugin will then be efficient across all machines that share the same basic architecture.
 10.2173 +
 10.2174 +This saves significant effort by allowing the same plugin to be reused for all the machines in the category.
 10.2175 + 
 10.2176 +\subsection{Services Provided by the Proto-runtime}
 10.2177 +
 10.2178 +-- Put services into the low-level piece..  plugins have those available, and inherit lang independent such as debugging, perf counters..  provides effort reduction because lang doesn't have to implement these services.
 10.2179 +
 10.2180 +-- -- examples of iherited lang services inside current proto-runtime: debugging and perf-tuning..  verification, playback have been started (?)
 10.2181 +
 10.2182 +-- -- examples of plugin services: creation of base VP, the switch primitives, the dispatch pattern (which reduces effort by cleanly separating code for each construct), handling consistency model (?), handling concurrency
 10.2183 +
 10.2184 +\subsection{eDSLs talking to each other}
 10.2185 +-- show how VSs is example of three different DSLs, and H264 code is three different languages interacting (pthreads, OpenMP, StarSs)
 10.2186 +
 10.2187 +-- make case that proto-runtime is what makes this practical !  Their point of interaction is the common proto-runtime innards, which provides the interaction services.. they all use the same proto-runtime, and all have common proto-runtime objects, which is how the interaction becomes possible.
 10.2188 +
 10.2189 +\subsection{The Proto-runtime Approach Within the Big Picture}
 10.2190 +
 10.2191 +-- Give background on industry-wide, how have langs times machines..  
 10.2192 +-- say that proto-runtime has synergistic advantages within this context. -- repeat that eDSLs talk to each other.
 10.2193 +-- give subsubsection on MetaBorg for rewriting eDSL syntax into base lang syntax.
 10.2194 +-- bring up the tools issue with custom syntax -- compiling is covered by metaborg re-writing..  can address debugging with eclipse.. should be possible in straight forward way that covers ALL eDSLs.. their custom syntax being stepped through in one window, and stepping through what they generate in separate window (by integrating generation step into eclipse).. even adding eclipse understanding of proto-runtime.. so tracks the sequence of scheduling units..  and shows the request handling in action in third window..
 10.2195 + 
 10.2196 +Preview idea that many players will contribute, and will get people that specialize in creating new eDSLs (such as one of authors)..
 10.2197 +-- For them, code-reuse is reality, as supported by VSs example, 
 10.2198 +-- and the uniformity of the pattern becomes familiar, also speeding up development, as also supported by VSs, HWSim, VOMP, and DKU examples.
 10.2199 +-- for those who only create a single eDSL, the pattern becomes a lowering of the learning curve, aiding adoption
 10.2200 +
 10.2201 +-- Restate and summarize the points below (covered above), showing how they combine to shrink the wide-spot where all the runtimes are. 
 10.2202 +
 10.2203 +-- The low-level part implemented on each machine, exports a view of the machine that shows performance-critical aspects
 10.2204 +
 10.2205 +-- Collect machines into groups, based on performance critical aspects of hardware.. provides reduction in effort because only one plugin for entire group. 
 10.2206 +
 10.2207 +-- Put services into the low-level piece..  plugins have those available, and inherit lang independent such as debugging..  provides effort reduction because lang doesn't have to implement these services.
 10.2208 +
 10.2209 +
 10.2210 +\section{(outline and notes)}
 10.2211 +
 10.2212 +-- What a plugin looks like: 
 10.2213 +
 10.2214 +-- -- pattern of parallel constructs.. ideas of Timeline, tie-point, animation, suspension, VP states, constraints, causality, work-units, meta-units, updates in constraint states attached to the meta-units
 10.2215 +
 10.2216 +-- -- a sych construct is something that creates a tie between two work-units.  So, the logic of the construct simply establishes causality -- the ending of one work-unit causes the freedom to start animation of another.  
 10.2217 +
 10.2218 +-- -- --  Examples: mutex is end of work-unit that frees lock causes freedom to start work-unit that gets the lock.  They are causally tied.  The semantics of the construct is the particular conditions existing inside the runtime (in this case ownership condition of a mutex), and what changes those conditions (in this case releasing lock removes one from owner, plus acquire-lock sets one as wanting the lock), and how freedom to be animated is affected by the changes in conditions (in this case, removal of ownership must precede gaining ownership) on what makes a work-unit free (in this case, being given ownership of the mutex), 
 10.2219 +
 10.2220 +-- Hence, precisely, the parallelism model of the language defines constraints, which are implemented as state inside the runtime. Constructs provided do a number of things:  signal bringing a set of constraints into existence (create a mutex), signal update to the state of those constraints (release mutex, state desire to acquire), and trigger the runtime to propagate those changes, which results in additional changes to states, including marking meta-units as free to be animated.  cause creation of meta-units (explicitly as in VSs, or via creating entities that trigger creation as in dataflow, or via creating entities that consist of consecutive work-units as in pthreads).
 10.2221 +
 10.2222 +
 10.2223 +-- Recipe for how to make the language plugin: time reduction is part due to simplifying the parallelism construct logic..  
 10.2224 +
 10.2225 +
 10.2226 +
 10.2227 +
 10.2228 +\subsection{The Cross-language Patterns Behind the Proto-runtime}
 10.2229 +
 10.2230 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
 10.2231 +
 10.2232 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
 10.2233 +
 10.2234 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
 10.2235 +
 10.2236 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
 10.2237 +
 10.2238 +
 10.2239 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
 10.2240 +
 10.2241 +An application switches to the runtime, which does scheduling work then switches back to application code.
 10.2242 +
 10.2243 +
 10.2244 +\subsection{Some Definitions}
 10.2245 +
 10.2246 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
 10.2247 +
 10.2248 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
 10.2249 +
 10.2250 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
 10.2251 +
 10.2252 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
 10.2253 +
 10.2254 +
 10.2255 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
 10.2256 +
 10.2257 +We adopt the concepts of work-unit, virtual processor (VP), animation, and tie-point as discussed in a previous paper []. A work-unit is the trace of instructions executed between two successive switches to the runtime, along with the data consumed and produced during that trace.  A Virtual Processor is defined as being able to animate either the code of a work-unit or else another VP, and has state that it uses during animation, organized as a stack.  Animation is definedd as causing time of a virtual processor to advance, which is equivalent to causing state changes according to instructions, while suspension halts animation, and consequently causes the end of a work-unit (a more complete definition of animation can be found in the dissertation of Halle[]).  A tie-point connects the end of one work-unit to the beginning of one in a different VP, so a tie-point represents a causal relationship between two work-units, and establishes an ordering between those work-units, effectively tying the time-line of the VP animating one to the time-line of the VP animating the other work-unit.
 10.2258 +
 10.2259 +In addition, we introduce a definition of the word task, which is a single work-unit coupled to a virtual-processor that comes into existence to animate the work-unit and dissipates at completion of the work-unit.  By definition of work-unit, a task cannot suspend, but rather runs to completion.  If the language defines an entity that has a timeline that can be suspended by switching to the runtime, then such an entity is not a task. Pure Dataflow[] specifies tasks that fit our definition.
 10.2260 +
 10.2261 +\subsection{Handling Memory Consistency Models}
 10.2262 +
 10.2263 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
 10.2264 +
 10.2265 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
 10.2266 +
 10.2267 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
 10.2268 +
 10.2269 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
 10.2270 +
 10.2271 +
 10.2272 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
 10.2273 +
 10.2274 +Weak memory models can cause undesired behavior when work-units on different cores communicate through shared variables.  Specifically, the receiving work-unit can see memory operations complete in a different order than the code of the sending work-unit specifies.
 10.2275 +
 10.2276 +For example, consider a proto-runtime implemented on shared memory hardware that has a weak consistency model, along with a language that implements a traditional mutex lock.  All memory operations performed in the VP that releases the lock should be seen as complete by the VP that next acquires the lock.  
 10.2277 +
 10.2278 +It is up to the proto-runtime to enforce this, using hardware primitives.  It has to ensure that all memory operations performed, by a task or VP, before switching to the runtime are completed before any dependent task or VP is switched into from the runtime.  More precisely, the proto-runtime has to ensure that all memory operations performed by a work-unit are visible in program order to any tied work-units. In some cases the language plugin has to alert the proto-runtime of the causality between work-units.
 10.2279 +
 10.2280 +
 10.2281 +The proto-runtime does not, however, protect application code that attempts to communicate between VPs or tasks directly, without using a parallelism construct to protect the communication.
 10.2282 +
 10.2283 +
 10.2284 +
 10.2285 +=======
 10.2286 +
 10.2287 +  I plan to explain VMS as a universal pattern that exists in all runtimes: that is, that the application switches to runtime, which does a scheduling decision and then switches back.  I'll explain it first with just master and slaves, leaving out the core\_loop.  Explain it as a normal runtime that has had two key pieces removed and replaced with interfaces.  The language supplies the missing pieces.  Then, introduce the core\_loop stuff as a performance enhancement used when lock acquisition dominates (as it does on the 4 socket 40 core machine).
 10.2288 +   Next, give HWSim as an example of a real domain specific (it's working, ref manual attached), and focus on how the modularity allowed pulling constructs from other languages (singleton and atomic), and a breakdown of implementation time vs design time, and so on.  Highlight how VMS's features for productivity and encapsulation solve the practical problems for domain-specific languages.
 10.2289 +   Finally, show that VMS performance is good enough, by going head-to-head with pthreads and OpenMP (doing a VMS OpenMP implementation now).  And also StarSs if I have time.  I'll run overhead-measuring on them, and also regular benchmarks.
 10.2290 +
 10.2291 +=================
 10.2292 +
 10.2293 +\subsection{The patterns}
 10.2294 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
 10.2295 +
 10.2296 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
 10.2297 +
 10.2298 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
 10.2299 +
 10.2300 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
 10.2301 +
 10.2302 +
 10.2303 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
 10.2304 +
 10.2305 +
 10.2306 +Soln: modularize runtime, to reduce part have to mess with, hide part that has low-level details, reuse low-level tuning effort, and reuse lang-spec parts.
 10.2307 +
 10.2308 +Benefits: lang impl doesn't have to touch low-level details, inherit centralized services, can reuse code from other languages to add features.
 10.2309 +
 10.2310 +Performance must be high, or the labor savings don't matter.  By isolating the low-level details inside the proto-runtime, they can be intensively tuned, then all the languages inherit the effort. 
 10.2311 +
 10.2312 +Part of what makes this so easy is the dispatch pattern.. adding a construct reduces to adding into switch and writing handler..  borrow constructs by taking the handler from the other lang.
 10.2313 +
 10.2314 +By isolating the low-level details inside the proto-runtime, they can be intensively tuned, then all the languages inherit the effort.  Compare that to current practices, where the runtime code is monolithic.. each language has to separately modify the runtime, understanding and dealing with the concurrency, and then on a new machine, each language has to re-tune the low-level details, worrying about the consistency model on that machine, how its particular fence and atomic instructions work, and so on.
 10.2315 +We spent 2 months performance tuning the current version, but only 18 hours implementing VSs on top of it, and VSs inherited the benefit from all that effort.  So did VOMP, and SSR, and VCilk, and so on..  each time we improved the proto-runtime, all the languages improved, with no effort on the part of the language creator. 
 10.2316 +
 10.2317 +
 10.2318 +\subsubsection{Views of synchronization constructs}
 10.2319 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
 10.2320 +
 10.2321 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
 10.2322 +
 10.2323 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
 10.2324 +
 10.2325 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
 10.2326 +
 10.2327 +
 10.2328 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
 10.2329 +
 10.2330 +One view of sync constructs is that they are variable-length calls. The 
 10.2331 +basic hardware does this by stalling the pipeline.
 10.2332 +
 10.2333 +Another view is that they mark the boundary of a communication made via shared read/write.  A load or store of a single location has a precise boundary enforced by the hardware, but if a pipeline desires to load, modify, then write a single location it has to have additional hardware. It has to make the multiple primitive load/store operations appear as a single operation.
 10.2334 +
 10.2335 +Moving up to the application level, the same pattern exists: an operation the application wants to do may involve many loads and stores, but it wants the collection to appear as a single indivisible operation.  So the application-level equivalent of a load or store involves multiple memory locations but is to be treated as a single indivisible operation.  This requires the application-level equivalent of the hardware that made the read-modify-write into a single indivisible operation.  That equivalent is what a synchronization construct is.  The reason a sync construct takes a variable amount of time is that it  waits until all other indivisible operations that might conflict have completed.
 10.2336 +
 10.2337 +Another way to think of the sync construct is that it enforces sharp communication boundaries.  The multiple read and write operations are treated as a single communication with the shared-state.  If any other part of the application sees only part of the communication, it sees something inconsistent and thus wrong.  So the sync constructs ensure that communications are complete, so the parts of the application only see complete communications from other parts.  
 10.2338 +
 10.2339 +\subsubsection{Universal Runtime Patterns}
 10.2340 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
 10.2341 +
 10.2342 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
 10.2343 +
 10.2344 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
 10.2345 +
 10.2346 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
 10.2347 +
 10.2348 +
 10.2349 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
 10.2350 +
 10.2351 +Unified pattern within parallel languages: create multiple timelines, then control relative progress of them, and control location each chunk of progress takes place.
 10.2352 +
 10.2353 +Another universal pattern: code runs, switches to runtime, some point later switches back to code, making application run be a collection of trace segments bounded by runtime calls.
 10.2354 +The runtime tracks constraints (dependencies) among units, creates and destroys units, and assigns ready units to hardware.
 10.2355 +
 10.2356 +Units have a life-line, which is fundamental to parallel computation, as demonstrated in a paper by some of the authors [].
 10.2357 +
 10.2358 +Every unit has a meta-unit that represents it in the runtime. A  unit is defined as the trace of application code that exists between two scheduling decisions. Looking at this in more detail, every runtime has some form of internal bookkeeping state for a unit, used to track constraints on it and make decisions about when and where to execute. This exists even if that state is just a pointer to a function that sits in a queue. We call this bookkeeping state for a unit the meta-unit.
 10.2359 +
 10.2360 +Each  unit also has a life-line, which progresses so:  creation of the meta-unit \pointer , state updates that affect constraints on the unit \pointer,   the decision is made to animate the unit  \pointer, movement of the meta-unit plus data to physical resources that do the animation \pointer  , animation of the unit, which does the work \pointer,  communication of state-update, that unit has completed, and hardware is free \pointer ,  constraint updates within runtime, possibly causing new meta-unit creations or freeing other meta-units to be chosen for animation.  This repeats for each unit. Each step is part of the model.
 10.2361 +
 10.2362 +Note a few implications: first, many activities internal to the runtime are part of a unit's life-line, and take place when only the meta-unit exists, before or after the work of the actual unit; second, communication that is internal to the runtime is part of the unit life-line, such as state updates; third, creation may be implied, such as in pthreads, or triggered such as in dataflow, or be by explicit command such as in StarSs, and once created, a meta-unit may languish before the unit it represents is free to be animated.
 10.2363 +
 10.2364 +\subsubsection{Putting synchronization constructs together with universal runtime patterns}
 10.2365 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
 10.2366 +
 10.2367 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
 10.2368 +
 10.2369 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
 10.2370 +
 10.2371 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
 10.2372 +
 10.2373 +
 10.2374 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
 10.2375 +
 10.2376 +Putting these  together, gives us that any parallelism construct that has a synchronization behavior causes the end of a work-unit, and a switch to the runtime.  The code following the construct is a different work-unit that will begin after the constraint implied by the construct is satisfied. 
 10.2377 +
 10.2378 +The runtime is made up of the infrastructure for the constraints and assignment, such as communicating bookkeeping state between cores, and protecting internal runtime updates of shared information. Plus, the logic of the constructs and logic of choosing an assignment of work to cores.
 10.2379 +
 10.2380 +For large machines, the infrastructure dominates the time to execute a parallelism construct, while for smaller machines, like single-socket, the logic of constructs and assignments has a chance to be significant.   
 10.2381 +
 10.2382 +\begin{figure}[ht]
 10.2383 +  \centering
 10.2384 +  \includegraphics[width = 2in, height = 1.8in]{../figures/SCG_stylized_for_expl.pdf}
 10.2385 +  \caption{Something to help understanding}
 10.2386 +  \label{fig:SCG_expl}
 10.2387 +\end{figure}
 10.2388 +
 10.2389 +
 10.2390 +
 10.2391 +
 10.2392 +%%%%%%%%%%%%%%%%%%%%%
 10.2393 +\section{The Details}
 10.2394 +[[Hypothesis: Embedded-style DSLs -> high productivity + low learning curve + low disruption + low app-port AND quick time to create + low effort to lang-port + high perf across targets]]
 10.2395 +[[Claims: modularize runtime, mod is fund patterns, mod sep lang logic from RT internals, mod makes internal reusable & lang inherit internal perf tune & inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
 10.2396 +
 10.2397 +The interfaces between lang logic and proto-runtime.  
 10.2398 +
 10.2399 +Demonstrate: modular runtime, how reduces part have to mess with, hides part that has low-level details, reuses low-level tuning effort, and reuses lang-spec parts.
 10.2400 +
 10.2401 +Demonstrate Benefits: lang impl doesn't touch low-level details, inherits centralized services (debug support), reuses code from other languages to add features.
 10.2402 +
 10.2403 +\subsection{Reuse of Language Logic}
 10.2404 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
 10.2405 +
 10.2406 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
 10.2407 +
 10.2408 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
 10.2409 +
 10.2410 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
 10.2411 +
 10.2412 +
 10.2413 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
 10.2414 +
 10.2415 +Demonstrate reuse of language logic:
 10.2416 +All the languages have copied singleton, atomic, critical section and transaction. In VOMP, took the task code from VSS, in VSS, took the send and receive code from SSR..  for DKU, took the code almost verbatim from earlier incarnation of these ideas, and welded it into SSR, and took VSs tasks and put into SSR. Thus, circle completes.. VSs took from SSR, now SSR takes from VSs..  pieces and parts are being borrowed all over the place and welded in where they're needed.
 10.2417 + 
 10.2418 +Part of what makes this so easy is the dispatch pattern.. adding a construct reduces to adding into switch and writing handler..  borrow constructs by taking the handler from the other lang. 
 10.2419 +
 10.2420 +Another part is that code for the constructs is isolated from concurrency details, which are inside the proto-runtime. All the dynamic system issues, and best way to impl locks, and need for fences, and so on is isolated from the construct logic.  This isolation is also how porting effort is lowered (or in many cases eliminated), and is how runtime performance is kept high.
 10.2421 +
 10.2422 +?
 10.2423 +
 10.2424 +Performance must be high, or the labor savings don't matter.  By isolating the low-level details inside the proto-runtime, they can be intensively tuned, then all the languages inherit the effort.  Compare that to current practices, where the runtime code is monolithic.. each language has to separately modify the runtime, understanding and dealing with the concurrency, and then on a new machine, each language has to re-tune the low-level details, worrying about the consistency model on that machine, how its particular fence and atomic instructions work, and so on.
 10.2425 +We spent 2 months performance tuning the current version, but only 18 hours implementing VSs on top of it, and VSs inherited the benefit from all that effort.  So did VOMP, and SSR, and VCilk, and so on..  each time we improved the proto-runtime, all the languages improved, with no effort on the part of the language creator. 
 10.2426 +
 10.2427 +?
 10.2428 +
 10.2429 +In addition to runtime performance, application level performance must be high.  The runtime's performance only affects overhead, and so is only a factor for small work-unit (task) sizes.  But data affinity affects performance for all work.
 10.2430 +
 10.2431 +The proto-runtime approach partially addresses this by giving the language the opportunity to directly control placement of work.  This isn't possible when building on top of threads, because the scheduling is in a separate, lower-level, layer where assignment of work to core is made in isolation, blind to language constructs and
 10.2432 +other application features.
 10.2433 +
 10.2434 +
 10.2435 +
 10.2436 +
 10.2437 +%%%%%%%%%%%%%%%%%%%%%
 10.2438 +\section{Measurements}
 10.2439 +
 10.2440 +\subsection{Implementation time}
 10.2441 +
 10.2442 +
 10.2443 +\subsection{Runtime and Application Performance}
 10.2444 +
 10.2445 +
 10.2446 +%%%%%%%%%%%%%%%%%%%%%
 10.2447 +\section{Related Work}
 10.2448 +
 10.2449 +
 10.2450 +%%%%%%%%%%%%%%%%%%%%%
 10.2451 +\section{Conclusion and Future Work}
 10.2452 +\label{sec:conclusion}
 10.2453 +
 10.2454 +
 10.2455 +
 10.2456 +\end{document} 
 10.2457 +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 10.2458 +Here is an example of netlist creation:
 10.2459 +
 10.2460 +The circuit has two elements, each with one input port, one output port, and a single activity-type. The elements are cross-coupled, so output port of one connects to input port of the other.  The input port has the  activity-type attached as its trigger.  The activity is empty, and just sends a NULL message on the output port.  The activity's duration in simulated time and the resulting communication's flight duration in simulated time are both constants.
 10.2461 +
 10.2462 + Note that HWSimElem data type is generic.  An elem is specialized by declaring  inports and outports, and by connecting activity types to in-ports. Behavior is attached to an element by attaching activity types to in-ports of the element.
 10.2463 +
 10.2464 +First, here is the top-level function that creates and returns the netlist structure:
 10.2465 +
 10.2466 +To use HWSim, one creates a netlist composed of elements and communication paths connecting them.  An element has a number of in-ports and outports, and a communication path connects an outport of the source element to an inport of the destination elements. The inport has an action attached, which in turn has a behavior function and a timing function, both of which are triggered by the arrival of a communication.  The behavior function has local persistent state of the element available to use, and can generate out-going communications. The timing function calculates how much Guest (simulated) time the behavior spanned.  In addition, communication paths have an attached function that calculates time from being sent until arrival of the communication. Both the behavior and timing function are application-programmer provided.  The entire simulator application is composed of those three things: netlist, behavior functions and timing functions, and all are sequential code.  
 10.2467 +
 10.2468 +The embedded DSL consists of standard data structures, such as netlist, inport, outport, that the application must use in the language-defined way, and a small number of language calls, such as send_comm and end_behavior.  The advancement of simulated time is implied, and the parallelism is implied.  The only parallelism-related restriction is that a behavior function may only use data local to the element it is attached to.  If state in the hardware is shared, such as registers or memory, then other elements access that state by sending communications to the element that contains the state.  Parallelism is created within the system by outports that connect to muliple destination inports, and by behavior functions that generate multiple output communications each. 
 10.2469 +
 10.2470 +
 10.2471 +\begin{small}\begin{verbatim}
 10.2472 +HWSimNetlist *
 10.2473 +createPingPongNetlist()
 10.2474 + { HWSimNetlist       *netlist;
 10.2475 +   HWSimElem         **elems;
 10.2476 +   HWSimActivityType **activityTypes;
 10.2477 +   HWSimCommPath     **commPaths;
 10.2478 +   int32               numElems, numActivityTypes, numCommPaths;
 10.2479 +\end{verbatim}\end{small}
 10.2480 +
 10.2481 +The first thing to do is create the netlist structure, which holds three things: element structs, activity type structs, and communication path structs. It also has two collections of pointers to the traces collected during the run, but these are handled internally by HWSim.
 10.2482 +\begin{small}\begin{verbatim}
 10.2483 +   netlist = malloc( sizeof(HWSimNetlist) );
 10.2484 + 
 10.2485 +   numElems         = 2; 
 10.2486 +   elems            = malloc( numElems * sizeof(HWSimElem *) );
 10.2487 + 
 10.2488 +   numCommPaths     = 2;
 10.2489 +   commPaths        = malloc( numCommPaths * sizeof(HWSimCommPath *) );
 10.2490 + 
 10.2491 +   numActivityTypes = 1;
 10.2492 +   activityTypes    = malloc( numActivityTypes * sizeof(HWSimActivityType *) );
 10.2493 +   
 10.2494 +   netlist->numElems         = numElems;
 10.2495 +   netlist->elems            = elems;
 10.2496 +   netlist->numCommPaths     = numCommPaths;
 10.2497 +   netlist->commPaths        = commPaths;
 10.2498 +   netlist->numActivityTypes = numActivityTypes;
 10.2499 +   netlist->activityTypes    = activityTypes;
 10.2500 +\end{verbatim}\end{small}
 10.2501 +
 10.2502 +Now, create the activity types.  During the run, an activity instance is created each time a communication arrives on an in-port. The activity instance is a data structure that points to the activity type.  The activity type holds the pointers to the behavior and timing functions.
 10.2503 +\begin{small}\begin{verbatim}
 10.2504 +      //have to create activity types before create elements
 10.2505 +      //PING_PONG_ACTIVITY is just a #define for readability
 10.2506 +   netlist->activityTypes[PING_PONG_ACTIVITY] = createPingPongActivityType();
 10.2507 +\end{verbatim}\end{small}
 10.2508 +
 10.2509 +Next, create the elements, and pass the netlist structure to the creator. It will take pointers to activity types out of the netlist and place them into the in-ports of the elements.
 10.2510 +\begin{small}\begin{verbatim}
 10.2511 +   elems[0] = createAPingPongElem( netlist ); //use activity types from netlist
 10.2512 +   elems[1] = createAPingPongElem( netlist ); 
 10.2513 +\end{verbatim}\end{small}
 10.2514 +
 10.2515 +Now, the reset in-port of one of the elements has to be set up to trigger an activity. Every element has a reset in-port, but normally they are set to NULL activity type. Here, we want only one of the two elements to have an activity triggered when the reset signal is sent to start the simulation.
 10.2516 +
 10.2517 +Note that during initialization, all the elements become active, each with its own timeline, but unless an activity is triggered in them they remain idle, with their timeline suspended and not making progress. Only ones that have an activity type attached to their reset in-port will begin to do something in simulated time when simulation starts.
 10.2518 +\begin{small}\begin{verbatim}   
 10.2519 +      //make reset trigger an action on one of the elements
 10.2520 +   elems[1]->inPorts[-1].triggeredActivityType =
 10.2521 +              netlist->activityTypes[PING_PONG_ACTIVITY];
 10.2522 +\end{verbatim}\end{small}
 10.2523 +
 10.2524 +Now, connect the elements together by creating commPath structures. A comm path connects the out-port of one element to the in-port of another. A given port may have many comm paths attached. However, an in-port has only one kind of activity type attached, and all incoming communications fire that same activity. There are multiple kinds of activity, including kinds that have no timing, and so can act as a dispatcher. These end themselves with a continuation activity, which is chosen according to the code in the behavior function. So, a commPath only connects an out port to an in port. 
 10.2525 +
 10.2526 +This code sets fixed timing on the comm paths. It also uses a macro for setting the connections. The format is: sending elem-index, out-port, dest elem-index, in-port:
 10.2527 +\begin{small}\begin{verbatim}
 10.2528 +      //elem 0, out-port 0 to elem 1, in-port 0
 10.2529 +   commPaths[0]= malloc(sizeof(HWSimCommPath));
 10.2530 +   setCommPathValuesTo(commPaths[0],0,0,1,0);
 10.2531 +   commPaths[0]->hasFixedTiming  = TRUE;
 10.2532 +   commPaths[0]->fixedFlightTime = 10; //all time is stated in (integer) units
 10.2533 +
 10.2534 +      //elem 1, out-port 0 to elem 0, in-port 0
 10.2535 +   commPaths[1]= malloc(sizeof(HWSimCommPath));
 10.2536 +   setCommPathValuesTo(commPaths[1], 1,0,0,0);
 10.2537 +   commPaths[1]->hasFixedTiming  = TRUE;
 10.2538 +   commPaths[1]->fixedFlightTime = 10; //all time is stated in (integer) units
 10.2539 +\end{verbatim}\end{small}
 10.2540 +
 10.2541 +done building netlist, return it
 10.2542 +\begin{small}\begin{verbatim}
 10.2543 +   return netlist;
 10.2544 + }
 10.2545 +\end{verbatim}\end{small}
 10.2546 +
 10.2547 +The macro that sets the connections inside a comm path struct
 10.2548 +\begin{small}\begin{verbatim}
 10.2549 +#define setCommPathValuesTo( commPath, fromElIdx, outPort, toElIdx, inPort)\
 10.2550 +do{\
 10.2551 +   commPath->idxOfFromElem     = fromElIdx; \
 10.2552 +   commPath->idxOfFromOutPort  = outPort; \
 10.2553 +   commPath->idxOfToElem       = toElIdx; \
 10.2554 +   commPath->idxOfToInPort     = inPort; \
 10.2555 + }while(0); //macro magic for namespace
 10.2556 +\end{verbatim}\end{small}
 10.2557 +
 10.2558 +Creating an element involves creating arrays for the in-ports and out-ports, then configuring the in-ports.  The out-ports are automatically filled in during simulation start-up, by HWSim. The most interesting feature is that each in-port is assigned an activity type, which all arriving communications trigger. During the simulation, each incoming communication creates an activity instance, which points to this triggered activity type. The behavior and timing of the instance are calculated by the behavior and timing functions in the activity type. Notice that the activity type pointers are taken from the netlist, so they have to be created before creating the elements.
 10.2559 +\begin{small}\begin{verbatim}
 10.2560 +HWSimElem *
 10.2561 +createAPingPongElem( HWSimNetlist *netlist )
 10.2562 + { HWSimElem *elem;
 10.2563 +   elem = malloc( sizeof(HWSimElem) );
 10.2564 +   elem->numInPorts  = 1;
 10.2565 +   elem->numOutPorts = 1;
 10.2566 +   elem->inPorts = HWSim_ext__make_inPortsArray( elem->numInPorts );
 10.2567 +   elem->inPorts[-1].triggeredActivityType = IDLE_SPAN; //reset port
 10.2568 +   elem->inPorts[0].triggeredActivityType  = netlist->activityTypes[PING_PONG_ACTIVITY];
 10.2569 +        return elem;
 10.2570 + }
 10.2571 +\end{verbatim}\end{small}
 10.2572 +
 10.2573 +Creating an activity type involves setting the pointers to the behavior and timing functions, which are defined inside a separate directory where all the behavior and timing functions are defined. An activity may have behavior set to NULL, or timing set to NULL, and may have fixed timing.  The structure has flags to state the combination. 
 10.2574 +\begin{small}\begin{verbatim}
 10.2575 +HWSimActivityType *
 10.2576 +createPingPongActivityType( )
 10.2577 + { HWSimActivityType *pingPongActivityType;
 10.2578 +   pingPongActivityType = malloc( sizeof(HWSimActivityType) );
 10.2579 +   
 10.2580 +   pingPongActivityType->hasBehavior   = TRUE;
 10.2581 +   pingPongActivityType->hasTiming     = TRUE;
 10.2582 +   pingPongActivityType->timingIsFixed = TRUE;
 10.2583 +   pingPongActivityType->fixedTime     = 10;
 10.2584 +   pingPongActivityType->behaviorFn    = &pingPongElem_PingActivity_behavior;
 10.2585 +   return pingPongActivityType;
 10.2586 + }
 10.2587 +\end{verbatim} \end{small}
 10.2588 +
 10.2589 +
 10.2590 +=========
 10.2591 +
 10.2592 +All behavior functions take a ptr to the activity instance they are executing the behavior of. The instance contains a pointer to the elem, and most behaviors will use the element's elemState field. It holds all the persistent state of the element, which remains between activities.
 10.2593 +
 10.2594 +Here is the behavior function from the ping-pong example:
 10.2595 +\begin{small}\begin{verbatim} 
 10.2596 +void
 10.2597 +pingPongElem_PingActivity_behavior( HWSimActivityInst *activityInst )
 10.2598 + {    //NO_MSG is #define'd to NULL, and PORT0 to 0
 10.2599 +   HWSim__send_comm_on_port_and_idle( NO_MSG, PORT0, activityInst );
 10.2600 + }
 10.2601 +\end{verbatim}\end{small}
 10.2602 +
 10.2603 +There are four ways a behavior can end:
 10.2604 +\begin{description}
 10.2605 +\item end, no continuation: 
 10.2606 +\begin{small}\begin{verbatim} HWSim__end_activity_then_idle( HWSimActivityInst *endingActivityInstance )\end{verbatim}\end{small}
 10.2607 +\item end, with continuation: 
 10.2608 +\begin{small}\begin{verbatim} HWSim__end_activity_then_cont( HWSimActivityInst *endingActivityInstance,
 10.2609 +                                HWSimActivityType *continuationActivityType)\end{verbatim}\end{small}
 10.2610 +\item end by sending a communication, with no continuation: 
 10.2611 +\begin{small}\begin{verbatim} HWSim__send_comm_on_port_then_idle( void *msg, int32 outPort, 
 10.2612 +                                HWSimActivityInst *endingActivityInstance)\end{verbatim}\end{small}
 10.2613 +\item end by sending a communication, with continuation: 
 10.2614 +\begin{small}\begin{verbatim} HWSim__send_comm_on_port_then_cont( void *msg, int32 outPort, 
 10.2615 +                                HWSimActivityInst *endingActivityInstance
 10.2616 +                                HWSimActivityType *continuationActivityType)\end{verbatim}\end{small}
 10.2617 +
 10.2618 +
 10.2619 +=============
 10.2620 +
 10.2621 +
 10.2622 +\subsection{Activity Timing Functions}
 10.2623 +All activity timing functions take a ptr to the activity instance they are calculating the timing of. The instance contains a pointer to the element the activity is in. The behavior function is free to communicate to the timing function by leaving special data inside the element state.  The timing function might also simply depend on the current state of the element.
 10.2624 +
 10.2625 +Here's an example:
 10.2626 +\begin{small}\begin{verbatim} 
 10.2627 +HWSimTimeSpan
 10.2628 +sampleElem_sampleActivity_timing( HWSimActivityInst *activityInst )
 10.2629 + {
 10.2630 +   return doSomethingWithStateOfElem( sendingActivity->elem->elemState );
 10.2631 + }
 10.2632 +\end{verbatim}\end{small}
 10.2633 +
 10.2634 +\subsection{Calculating the time-in-flight of a communication path}
 10.2635 +
 10.2636 +The timing function for a communication path is similar to that of an activity. Except, the timing might also depend on configuration data or state stored inside the comm path struct, so that is passed to the timing function as well.
 10.2637 +
 10.2638 +\begin{small}\begin{verbatim}
 10.2639 +HWSimTimeSpan
 10.2640 +commPath_TimeSpanCalc( HWSimCommPath *commPath, HWSimActivityInst *sendingActivity )
 10.2641 + { return doSomethingWithStateOfPathAndElem( commPath, sendingActivity->elem->elemState );
 10.2642 + }
 10.2643 +\end{verbatim}\end{small}
 10.2644 +
 10.2645 +
 10.2646 +
 10.2647 +
    11.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
    11.2 +++ b/0__Papers/PRT/PRT__intro_plus_eco_contrast/latex/Paper_Design_2.txt	Tue Sep 17 06:30:06 2013 -0700
    11.3 @@ -0,0 +1,32 @@
    11.4 +
    11.5 +======
    11.6 +
    11.7 +Details of VMS interface, details of its impl on multi-core, details of differences on different machines.
    11.8 +
    11.9 +wrapper-lib calls VMS-supplied primitive that suspends the virtual-processor calling the lib, and sends a request to VMS.  VMS calls lang-supplied plugin to handle requests -- this is the part of the scheduler that handles constraints -- it determines which virt-processors must remain suspended, and which are free to be re-animated.
   11.10 +
   11.11 +The language is implemented as either a collection of wrapper-lib calls embedded into the base language, or as custom syntax that uses uses the VMS-supplied primitive to suspend virtual processors and send requests to VMS.
   11.12 +
   11.13 +
   11.14 +VMS is invisible to the application, only language constructs are visible.  From the application-programmer point of view, the embedded version looks like a function call, albeit the data-struc of the virtual-processor animating the code has to be passed as a parameter to the wrapper-lib call.
   11.15 +
   11.16 +Hence, VMS is invisible to the application, only language constructs are visible.  
   11.17 +
   11.18 +The wrapper-lib call is standard library code that is loaded along with the application executable.
   11.19 +
   11.20 +However, VMS primitives may be hardware-implemented, or loaded as OS modules, or dynamic or static libraries. Rhey are naturally custom instructions, but may be emulated by software.
   11.21 +
   11.22 +The interface between application-executable and language-runtime is the VMS-primitive that sends a request to VMS.  The language-runtime receives the request under control of VMS, which calls a language-supplied request-handling function and passes the request as a parameter.  This passive behavior of the request handler leaves control-flow inside VMS, which is part of hiding concurrency from the language-runtime implementation.
   11.23 +
   11.24 +The interface between the runtime and VMS is VMS's plugin API.  The runtime is implemented as two functions, whose pointers are handed to VMS.  VMS then controls the flow of execution.  When a request is ready for the runtime, VMS cIalls the request-handler function, and when a spot on hardware is free for work, VMS calls the scheduler-assign function.  Hence, the language implements its runtime as two isolated functions.  By keeping control-flow inside VMS, the language-specific portion of the runtiem is simplified.
   11.25 +
   11.26 +This structure is also the reason VMS encourages reuse of scheduler code. The VMS API separates out control flow from scheduling, so scheduling code is isolated, with well-defined interfaces.  Scheduling is then further sub-divided into modules: constraint-management (IE enforcing dependencies); and choosing physical location to place work. Each has its own well-defined interface, and they communicate to each other via VMS-managed shared state.
   11.27 +
   11.28 +The greatest application performance impact due to the scheduler is communication it causes.
   11.29 +
   11.30 +, management of the memory hierarchy, and the match between work-characteristics and hardware-characteristics (IE, assigning to accelerator vs CPU).  Hence, significant work goes into implementing strategies and mechanisms for finding the best assignment-choices. Such implementations are only loosely coupled to language, through the shared state by which the request-handler informs the assigner of what work is ready to be animated.
   11.31 +   
   11.32 +Hence, it is straight-forward to reuse the code that assigns work to physical locations.  The only language-specific influence on the assigner is the shared constraint-state.
   11.33 +
   11.34 + 
   11.35 +
    12.1 Binary file 0__Papers/VMS/VMS__Foundation_Paper/VMS__Full_conference_version/figures/PR__timeline_dual_w_hidden.pdf has changed
    13.1 Binary file 0__Papers/transfer_figures_from_attachment/VMS_flat.png has changed
    14.1 Binary file 0__Papers/transfer_figures_from_attachment/VMS_nested.png has changed
    15.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
    15.2 +++ b/0__Papers/transfer_figures_from_attachment/VMS_numbers.txt	Tue Sep 17 06:30:06 2013 -0700
    15.3 @@ -0,0 +1,28 @@
    15.4 +
    15.5 +
    15.6 +measurements of total runtime for the h264 decoder running with both runtimes, but I haven't found a good way to isolate the time spent in the runtime for the nanos runtime.
    15.7 +
    15.8 +graphs comparing total runtime across different task sizes, with nested tasks or with all tasks submitted by the master thread / seedVP (flat), for VMS and nanos.
    15.9 +
   15.10 +Green is elapsed wallclock time, red is user time.
   15.11 +
   15.12 +There's very little available parallelism, so there's a performance minimum in the middle around 9-10 blocks per task. After that the overhead nixes any additional parallelism you might gain from slicing tasks more finely.
   15.13 +
   15.14 +For VMS I also have the following measurements, for a run with nested tasks and 8 blocks per task:
   15.15 +
   15.16 +
   15.17 +for a run with nested tasks and 8 blocks per task:
   15.18 +
   15.19 +Total busy cycles/Total overhead/Percentage: 5910976399 / 1172314900 / 19.83 %
   15.20 +Avg overhead per unit: 36669
   15.21 +Critical path length:  1960539705 cycles
   15.22 +Overhead contribution to critical path:  237533850 cycles =  12.1157377937 %
   15.23 +Overhead breakdown along critical path:
   15.24 +Total overhead:         237533850 cycles     | 100 %
   15.25 +Request Handler:         21778888 cycles     | 9.17 %
   15.26 +Scheduler:                5580024 cycles     | 2.35 %
   15.27 +ReqHdlr to Scheduler:     1886857 cycles     | 0.79 %
   15.28 +Master to Work switch:    4069306 cycles     | 1.71 %
   15.29 +Work to Core switch:      3403127 cycles     | 1.43 %
   15.30 +Coreloop until Lock:      1868531 cycles     | 0.79 %
   15.31 +Lock Acquire:           198947117 cycles     | 83.76 %
    16.1 Binary file 0__Papers/transfer_figures_from_attachment/nanos_flat.png has changed
    17.1 Binary file 0__Papers/transfer_figures_from_attachment/nanos_nested.png has changed
    18.1 Binary file 1__Presentations/13__Jy_01__DSLDI/software_stack.png has changed
    19.1 Binary file 1__Presentations/13__Sp_08__DFM_workshop/Reo_plus_ProtoRuntime.odp has changed
    20.1 Binary file 1__Presentations/13__Sp_08__DFM_workshop/Reo_plus_ProtoRuntime.pdf has changed
    21.1 Binary file 1__Presentations/13__Sp_08__DFM_workshop/Reo_plus_ProtoRuntime.pot has changed