changeset 51:b42c7d2948ac

perf-tuning: partly through implementation re-write, just capturing a roll-back point
author Sean Halle <seanhalle@yahoo.com>
date Sun, 03 Jun 2012 10:54:02 -0700
parents 9f835e24cb7a
children 4433e4c6a758
files 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex
diffstat 1 files changed, 184 insertions(+), 23 deletions(-) [+]
line diff
     1.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Fri Jun 01 19:45:35 2012 +0200
     1.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Sun Jun 03 10:54:02 2012 -0700
     1.3 @@ -275,6 +275,7 @@
     1.4  % correct bad hyphenation here
     1.5  \hyphenation{op-tical net-works semi-conduc-tor}
     1.6  
     1.7 +\usepackage{wasysym}
     1.8  
     1.9  \begin{document}
    1.10  
    1.11 @@ -304,30 +305,37 @@
    1.12  
    1.13  
    1.14  \begin{abstract}
    1.15 -abstract here.
    1.16 -\end{abstract}
    1.17 +Performance tuning is an important aspect of parallel programming. After all, why go parallel, if not for higher performance? Yet existing tools leave much to be desired because they don't make the cause of performance loss evident. Only once the cause has been determined can a solution be formulated.
    1.18 +
    1.19 +We describe an approach that relies upon a new model of parallel computation to link performance loss to its cause, whether that be in the hardware, runtime, or application code. The visualizations produced clearly identify idle cores, and tie the idleness to causal interactions within the runtime and hardware, and from there to the parallelism constructs that constrained the runtime and hardware behavior.
    1.20 + 
    1.21 +
    1.22 +
    1.23 +This is implemented for multi-core hardware, and we walk through a tuning session on a large multi-core machine to illustrate how performance loss is identified and how hypotheses for the cause are generated. We also give a concise description of the implementation and the computation model. \end{abstract}
    1.24  
    1.25  
    1.26  \section{Introduction and Motivation}
    1.27  
    1.28 -Performance visualizations and tuning tools for parallel programs are critical to achieving good performance, and yet current solutions leave something to be desired. All too often, performance tuning consists of staring at columns of numbers, trying to \emph{guess} the cause of performance loss. Despite many different views of performance measurements, such as statistics by line of code, message sends and receives, and core usage timelines, the user doesn't know why a function runs in a particular spot on a particular core, nor whether that is desired behavior or erroneous behavior.
    1.29 +Performance visualizations and tuning tools for parallel programs are critical to achieving good performance, and yet current solutions leave something to be desired. All too often, performance tuning consists of staring at abstract views of usage or statistics, trying to \emph{guess} the cause of performance loss. Despite many different views of performance measurements, such as frequency by line of code, message sends and receives, and core usage timelines, the user doesn't know why a function runs in a particular spot on a particular core, nor whether that is desired behavior or erroneous behavior.
    1.30  
    1.31 -Examining these myriad views can feel like being advised by the fabled six blind men examining an elephant by touching the part nearest to each\ref{The poems of John Godfrey Saxe/The Blind Men and the Elephant}: one, reaching the side, proclaims it much like a wall; the second, feeling the tusk, compares it to a spear; the next, feeling the trunk, is reminded of a snake; the fourth, wrapping his arms around a leg, divines a tree; the fifth, chancing upon the ear, likens it to a fan; while the sixth, seizing the tail, insists that is is like a rope. All of these views are indeed correct and provide important information, but it is hard to reconstruct from them a view of the whole.
    1.32 +Examining these myriad views can feel like being advised by the fabled six blind men examining an elephant by touching the part nearest to each\ref{The poems of John Godfrey Saxe/The Blind Men and the Elephant}: one, reaching the side, proclaims it much like a wall; the second, feeling the tusk, compares it to a spear; the next, feeling the trunk, is reminded of a snake, and so on. All of these views are indeed correct and provide important information, but they don't connect to provide a view of the whole, especially the interactions between parts.
    1.33  
    1.34 -Current tools, although they cover all the parts of the application code, fail to adequately connect their observations to the runtime behavior, scheduling decisions, and consequent hardware behavior. The decision about which task or virtual processor is assigned to which core at what point in time is at the heart of parallel performance, so these choices need to be highlighted and connected to the application features influencing them.
    1.35 +Current tools may cover all the parts of the application code, but fail to adequately connect their observations to the runtime behavior, scheduling decisions, and consequent hardware behavior. The decision about which task or virtual processor is assigned to which core at what point in time is at the heart of parallel performance, so these choices need to be highlighted and connected to the application features influencing them.
    1.36  
    1.37 -To provide a more complete picture, what is needed is a model or framework that can capture the relevant application features (units of work and the constraints governing their execution), expose the scheduling decisions taken within the parameters allowed by the application and other constraints, and provide a structure to which to attach the measurements captured during execution, in order to be able to relate the measurements among each other and to the application.
    1.38 +To provide a more complete picture, what is needed is a model or framework that can capture the relevant application features (units of work and the constraints governing their execution), expose the scheduling decisions taken in accordance with the application and other constraints, and provide a structure to which to attach the measurements captured during execution. This is needed  to relate the measurements to each other and to the application.
    1.39  
    1.40 -Our contribution is such a model and framework. Using it, we generate views that indicate the schedulable units in the code, all the constraints on scheduling them, and connect them to application code. It includes constraints imposed by the application, the runtime implementation and hardware details. 
    1.41 +Our contribution is such a model and framework. Using it, we generate views that indicate the schedulable units in the code, and all the constraints on scheduling them. It includes constraints imposed by the application, the runtime implementation and hardware details. 
    1.42  
    1.43 -The connections made within the views enable more accurate hypotheses for the cause of performance losses. The views connect units to the specific segments of code that compose the units, and connect each constraint on scheduling choices to the precise source of the constraint, within the code, hardware, or runtime. Resource usage is separated into categories: application work, non-overlapped communication (which results from scheduling decisions), and scheduling/runtime overhead. 
    1.44 -For parameters that affect units and constraints, the effects of parameter choices are also shown.
    1.45 -In visual form, the integration and connections have the effect of making areas of performance loss evident, and the causes of the loss clear. 
    1.46 +The connections made within the views enable more quickly generating the correct hypotheses for the cause of performance losses. The views connect units to the specific segments of code that compose the units, and connect each constraint on scheduling choices to the precise source of the constraint, within the code, hardware, or runtime. 
    1.47  
    1.48 -In the following, we will describe our model of computation, which drives the instrumentation and visualization, and links the aspects of performance to the collected performance tuning information.
    1.49 -The model and visualization are illustrated with a story line, which shows how they are used to performance tune a standard parallel application on a large multi-core system.
    1.50 +Computation-core usage is displayed in categories: application work, idleness due to non-overlapped communication (which results from scheduling decisions), and scheduling/runtime overhead. 
    1.51 +In addition, the influence of parameters that affect units and constraints can be deduced.
    1.52 +Visual features indicate causal connections between activities, allowing root causes  to be traced, and then linked to application code. 
    1.53  
    1.54 -We start with background on performance tuning and an overview of previous approaches in section \ref{related}. We shall introduce our model through a case study in section \ref{casestudy}, and then expand on the theory behind it in section \ref{theory}. Finally, we will conclude in section \ref{conclusion}.
    1.55 +In the following, we will describe our model of computation, which drives the instrumentation and visualization, and links the aspects of performance to the collected information.
    1.56 +The usage of the model and visualization is illustrated with a story line of  performance tuning a standard parallel application on a large multi-core system.
    1.57 +
    1.58 +We start with background on performance tuning and an overview of previous approaches in section \ref{related}. We shall introduce our model through a case study in section \ref{casestudy}, and then expand on the theory behind it in section \ref{theory}. Section \ref{sec:Implementation} will tie the model to implementation details. Finally, we will conclude in section \ref{conclusion}.
    1.59  
    1.60  
    1.61  \section{Background and Related Work}
    1.62 @@ -335,7 +343,7 @@
    1.63  
    1.64  Performance tuning is an iterative process. The programmer usually has some expectation of the performance of her program, and takes measurements during execution that are then compared to the desired outcome. A mental model, constructed through experience and knowledge of the mechanics of execution, is then used to generate a hypothesis explaining any discrepancies between the measurement and expectations. This hypothesis is then linked, again through a mental model, to things within the programmer's control, to suggest a change to make to the code. The modified code is then run again, and these steps are repeated until the programmer is satisfied with the performance of the program.
    1.65  
    1.66 -To understand the performance of parallel programs, more than just runtime measurements are important. Context for those measurements must also be gathered, which includes the characteristics of the units of work that are scheduled onto resources, as well as the constraints on scheduling those. It is also crucial to record during the run not only measurements of when the hardware was busy doing what, but also the scheduling decisions that effected this hardware usage.
    1.67 +To understand the performance of parallel programs, more than just runtime measurements are important. Context for those measurements must also be gathered, which includes the characteristics of the units of work that are scheduled onto resources, as well as the constraints on scheduling those. It is also crucial to record during the run not only measurements of when the hardware was busy doing what, but also the scheduling decisions that affected this hardware usage.
    1.68  
    1.69  Based on this pattern of performance tuning, we evaluate the advantages and shortcomings of different tools' representations of an application and its performance. The first four approaches are found, sometimes in combinations, in most classic performance evaluation tools, while the fifth is starting to emerge, enabled by the growing adoption of task-based programming languages in recent years.
    1.70  
    1.71 @@ -371,7 +379,9 @@
    1.72  
    1.73   These shortcomings are rectified by recent parallel languages such as CnC and StarSs, which cleanly identify tasks and the constraints on them. The people who develop such a language also develop language-specific tools to go with it. 
    1.74  
    1.75 -The StarSs tool tries to simplify the view for the user. It doesn't give performance information directly, but instead identifies tasks and tells the user whether it thinks the task size is too small, just right, or   -- instead, it has a recommended task size, which is between too small, which has too much overhead and too big, which has too few tasks to load balance. It makes task a one color if too short, another if just right, and a third if too long.
    1.76 +The StarSs tool tries to simplify the view for the user. It doesn't give performance information directly, but instead identifies tasks and tells the user whether it thinks the task size is too small, just right, or    tool big. Too small has too much runtime overhead, while too big has too few tasks to keep the cores busy. It makes task one color if too short, another if just right, and a third if too long.
    1.77 +
    1.78 +This is a step in the right direction, but it doesn't include effects of code choices on communication, and gives the programmer only limited information. Subtle and complex interactions between code, runtime, and hardware are often to blame for performance loss, and the limited information supplied here is no help with such complexities.
    1.79  
    1.80  
    1.81  \section{Illustrative Story of Performance Tuning}
    1.82 @@ -421,7 +431,7 @@
    1.83    %  {\includegraphics[scale=0.015]{../figures/194.pdf} 
    1.84    %  }\quad
    1.85    \end{minipage}
    1.86 -\caption{Performance tuning with Scheduling Consequence Graphs}
    1.87 +\caption{Performance tuning with Scheduling Consequence Graphs (all figures drawn with the same scale)  }
    1.88  \label{story}
    1.89  \end{figure*}
    1.90  
    1.91 @@ -443,7 +453,7 @@
    1.92  
    1.93  The language used is SSR, which is based on rendez-vous style send and receive operations made between virtual processors (VPs). It has commands for creating and destroying VPs, and three kinds of send-receive paired operations. 
    1.94  
    1.95 -The first, \emph{send\_from\_to} specifies a specific sender and specific receiver. It is used by the results VP to tell the divider VP that he work is complete. The second, \emph{send\_of\_type\_to}, specifies a specific receiver, but the sender is anonymous, which increases flexibility while maintaining some control over scope. This  is used by the VPs doing the pieces to send their partial-result to the results processor. The third kind, \emph{send\_of\_type}, only specifies the type, and so acts as a global communication channel; this is not used in our application.
    1.96 +The first, \emph{send\_from\_to} specifies a specific sender and specific receiver. It is used by the results VP to tell the divider VP that the work is complete. The second, \emph{send\_of\_type\_to}, specifies a specific receiver, but the sender is anonymous, which increases flexibility while maintaining some control over scope. This  is used by the VPs doing the pieces to send their partial-result to the results processor. The third kind, \emph{send\_of\_type}, only specifies the type, and so acts as a global communication channel; this is not used in our application.
    1.97  
    1.98   The language also includes a \emph{singleton} construct that designates a piece of code as to be executed only once, which we use to  rearrange and copy data to get better cache behavior. A given copy is shared by several virtual processors on different cores, but the copy only needs to be performed once. 
    1.99  
   1.100 @@ -655,7 +665,7 @@
   1.101  
   1.102  A consequence graph ties together scheduling decisions made on units with the consequences in the hardware of those decisions. The goal is to charge each segment of time on a physical core to exactly one box in a consequence graph.
   1.103  
   1.104 -In the UCCs, for a higher up runtime, each lower runtime it schedules onto is treated as a machine. We saw in Fig X that a unit has an entire  UCC in the level below, so there is a corresponding consequence graph. The UCC states the degrees of scheduling freedom, while the consequence graph shows the hardware consequences resulting from the particular scheduling choices made in the runtime.
   1.105 +In the UCCs, for a higher  runtime, each lower runtime it schedules onto is treated as a machine. We saw in Fig X that a unit has an entire  UCC in the level below, so there is a corresponding consequence graph. The UCC states the degrees of scheduling freedom, while the consequence graph shows the hardware consequences resulting from the particular scheduling choices made in the runtime.
   1.106  
   1.107   Now, the question is, what hardware usage should get counted towards one of the units? The answer is: only the time spent on the cores that is used to schedule, do work, and wait for non-overlapped communication of that unit.
   1.108  
   1.109 @@ -666,18 +676,167 @@
   1.110  The last question is how to handle communication consequences. This is tricky because decisions in higher-level runtimes set the context for decisions in lower-level ones. This means a higher-level choice is linked to the consequences from lower-level choices. The value of a consequence graph is due to linking the size of boxes in it to the decisions made by the scheduler, as represented by the shape. It's not clear how to divide, among the levels, the time that cores spend waiting for non-overlapped communication. We have no good answer at the moment and leave it for future work.
   1.111  
   1.112  \section{Implementation}
   1.113 +\label{sec:Implementation}
   1.114 +
   1.115 +Now that the usage and theory are in hand, we give the details of implementation. We attempt a bottom-up view, to provide an alternate path to understanding the model and visualizations,  as well as providing a path to derive the benefits directly from the details. 
   1.116 +
   1.117 +
   1.118 +This section will show the level of effort needed to implement our approach for a new language. In short, this involves inserting collection points into the runtime, then modifying the post-processing that produces the visualization.
   1.119 +
   1.120 +
   1.121 +
   1.122 +We cast the implementation in terms of the computation model, then identify the
   1.123 +points inside the runtime that correspond to points in the model. It is only in these spots that instrumentation gets inserted into runtime code. 
   1.124 +
   1.125 +One benefit seen directly from these details is that many things remain the same across languages, because the computation model is invariant across languages.  This includes the basic nature of the visualizations and details of the implementation process.
   1.126 +The reason is that visualizations are based on units, constraints, and causality, all of which exist in every language (even though some older ones obfuscate them).
   1.127 +
   1.128 +Another benefit evident from the details in this section is that the instrumentation is done only once, for a language. All applications written in the language inherit the visualizations, without any change to the application code.
   1.129 +
   1.130 +\subsection{Meta-units and unit life-line in the computation model}
   1.131 +
   1.132 +In preparation for mapping the model onto implementation details, we define a meta-unit and unit life-line. These form the basis to decide  the points in the runtime at which data is collected.
   1.133 +
   1.134 +Every unit has a meta-unit that represents it in the runtime. A  unit is defined as the trace of application code that exists between two scheduling decisions. Looking at this in more detail, every runtime has some form of internal bookkeeping state for a unit, used to track constraints on it and make decisions about when and where to execute. This exists even if that state is just a pointer to a function that sits in a queue. We call this bookkeeping state for a unit the meta-unit.
   1.135 +
   1.136 +Each  unit has a life-line, which progresses so:  creation of the meta-unit \pointer , state updates that affect constraints on the unit \pointer,   the decision is made to animate the unit  \pointer, movement of the meta-unit plus data to physical resources that do the animation \pointer  , animation of the unit, which does the work \pointer,  communication of state-update, that unit has completed, and hardware is free \pointer ,  constraint updates within runtime, possibly causing new meta-unit creations or freeing other meta-units to be chosen for animation.  This repeats for each unit. Each step is part of the model.
   1.137 +
   1.138 + Note a few implications: first, many activities internal to the runtime are part of a unit's life-line, and take place when only the meta-unit exists, happening before or after the work of the actual unit; second, communication that is internal to the runtime is part of the unit life-line, such as state updates; third, creation may be implied, such as in pthreads, or triggered such as in dataflow, or be by explicit command such as in StarSs, and once created, a meta-unit may languish before the unit it represents is free to be animated.
   1.139 +
   1.140 +\subsection{Mapping model onto implementation details in runtime}
   1.141 +
   1.142 +The meta-unit and life-line aspects of the computation model map straight-forwardly to the UCC visualization. The constraints in the UCC are those stated in or implied by the application (with the complexities noted in Section \ref{theory}). 
   1.143 +
   1.144 +However, the CG is not a strict expression of the model, rather it's purpose is practical. It shows usage of the cores, and relates that to the quantities in the model. Hence, the measurements for the CG all are boundaries, where the core's time switches from one category in the model to a different.
   1.145 +
   1.146 +This differs from the model in subtle ways. Most notably, the model declares segments of time where communications take place, while the CG doesn't measure the communication time, rather it captures idleness of the core caused by the non-overlapped portion of that communication. 
   1.147 +
   1.148 +The reason for the difference is that the CG is intended to highlight the quantity of most interest to application coders, which is accounting for core usage and assigning each idle period to a cause. The assignments of units to cores determine the source and destination of communications. Hence, non-overlapped communication idle periods are consequences of the  assignment choices made by the scheduler.  This, by they way, leads to the name: scheduling consequence graph.  
   1.149 +
   1.150 + 
   1.151 +
   1.152 +What must be collected during the run differs between the two types of visualization. For the UCC it is unit boundaries and the constraints related to each unit.
   1.153 +For the CG, the same units must  be collected, but also the time a core spends on each segment of the unit's life-line.  However, implementation details of the runtime may cause things such as idling the core during lock acquisition to be counted as time spent on one of a unit's life segments. What core activities go to which life segments changes from runtime to runtime. 
   1.154 +
   1.155 +The CG also represents each  cause of a transition from one segment of core usage to another. Such a causation is always a causal dependency of some kind, even ones that correspond to complex constraint constructs in the application. These causations are collected and  tied to one of: construct constraint, runtime internal constraint, or hardware constraint. In this paper, all are collected, but the only causations visualized are construct constraints that cross cores, with propendent on one core and its dependent on another.
   1.156 +
   1.157 +\subsection{Instrumenting our implementation of SSR on top of VMS}
   1.158 +
   1.159 +We used a version of SSR implemented on top of a proto-runtime system called VMS.  This proto-runtime embodies most of a runtime implementation, but has replaced two key portions with interfaces. Those portions are the handling of language construct constraints and the decision of which core to assign a unit to. To implement a language, one simply supplies those two portions of code, via the interface.
   1.160 +
   1.161 +VMS also has the advantage for our approach of being written in accordance with the computation model, which makes instrumenting it especially convenient. 
   1.162 +
   1.163 +Collecting a unit: A unit begins by transitioning out of the runtime to start the unit's code, and ends by leaving the unit's code back to the runtime. These transitions both have a convenient spot in VMS at which to insert the code that records the units. 
   1.164 +
   1.165 +Collecting the constraints for UCC: In VMS, each language construct has its own handler. Each handler has code inserted to record which unit invoked the construct. This is how the constraints on units are collected for the UCC.
   1.166 +
   1.167 +Those two make up all that is needed to gather the UCC. For the CG, 
   1.168 +
   1.169 +===============  
   1.170 +
   1.171 +(This is not true:) plus each point that state update directly affects    the life-state of the unit. Such state updates include life-state of other units involved in constraints with the target unit (because they may cause a change in target unit's life-state), updates of hardware state that directly cause a change in target unit's life state (such as hardware becoming free coupled to the scheduler choosing the target unit), and lastly any internal runtime state changes related to the unit (such as idling until a lock is acquired). The language provides construct-based constraints, while the runtime and hardware provide additional ones.
   1.172 +
   1.173 +===============?
   1.174 +
   1.175 +\subsection{Elements}
   1.176 +The  process of instrumenting is stra 
   1.177 +
   1.178 +?
   1.179 +
   1.180 +-] Follow details up to visualization
   1.181 +
   1.182 +
   1.183 +
   1.184 + 
   1.185 +-] Highlight Benefit: Main effort is just inserting collection points into runtime, and if using VMS, this is especially simple and straight-forward.
   1.186 +
   1.187 +-] Effort for implementing in new language: inserting collection points into runtime, and modifying portions of CG script that connect the blocks, and modifying portions of UCC script that draw lines connecting units.
   1.188 +
   1.189 +\subsection{Elements}
   1.190 +
   1.191 +The main elements of the approach are: collection points, the output from them, and the post-processing that draws the visualizations.
   1.192 +
   1.193 +\subsubsection{Collection points}
   1.194 +
   1.195 +
   1.196 +There are two distinct kinds of collection points: those for the UCC and those for the CG.?
   1.197 +
   1.198 +The UCC requires 
   1.199 +
   1.200 +?
   1.201 +
   1.202 +
   1.203 +Collection Points:  -- this is for bottom-up and for skeptic, plus implementer
   1.204 +
   1.205 +
   1.206 +-] For UCC, these are points in life-line of unit being executed where collect
   1.207 +
   1.208 +-] For CG, these are points in life-line of unit being executed where collect
   1.209 +
   1.210 +-] For CG, these are the points inside runtime that get collected
   1.211 +
   1.212 +
   1.213 +
   1.214 +
   1.215 +
   1.216 +-] for SSR, for CG, put this collection at this point <blah blah>.. again, see VMS for more on the points..
   1.217 +
   1.218 +?
   1.219 +
   1.220 +-] Highlight Benefit: model is independent of language -- it is based on life-line of a unit.  Concept of unit goes across languages, and they all have a life-line.  visualization is in terms of units and constraints, which are universal, so it remains same, across languages.  Only change is which construct a line in CG belongs to, and the UCC shape of constraint constructs. 
   1.221 +
   1.222 +?
   1.223 +
   1.224 +Output format:   for implementer
   1.225 +
   1.226 +-] The collection is made into a ListOfArrays, and dumped out in-order into a file
   1.227 +
   1.228 +-] Each point in file has: <info>
   1.229 +
   1.230 +Scripts:     for implementer
   1.231 +
   1.232 +-] One that does X
   1.233 +
   1.234 +-] one generates SVG of UCC
   1.235 +
   1.236 +--] arranged so:
   1.237 +
   1.238 +-] one generates SVG of CG
   1.239 +
   1.240 +To adapt to a new language:    for implementer
   1.241 +
   1.242 +-] If use VMS, then only collection points that change are in request handlers
   1.243 +
   1.244 +-] portion of script to modify is X
   1.245 +
   1.246 +-] script has X lines, only Y are specific to a language
   1.247 +
   1.248 +
   1.249 +Notice, all work is inside runtime and post-processing scripts
   1.250 +
   1.251 +Notice, the process for a new language is the same, indep of particular exe model or constructs of the language -- it comes down to the same points on the lifeline of a unit, all that changes is the pattern of constraints implied by a particular construct.
   1.252 +
   1.253 +
   1.254 +
   1.255 +==================
   1.256  
   1.257  The visualization relies on data collected from the runtime during execution. There are two kinds of information that need to be recorded: identification of units and constraints, and execution metric measurements.
   1.258  The first can be obtained from the language runtime at the places where constraints are checked and modified and units are created. The second have to be recorded as the unit progresses through different stages of execution.
   1.259  
   1.260  \subsection{Units and Constraints}
   1.261  
   1.262 -As units are defined by scheduling decisions, the creation of a unit is easiest to register at the point where the unit is assigned to a processing element. This ensures that all units that are executed are recorded, and all units that are recorded are really executed. There is no significant variation in this between languages, and the units are the same for the concrete UCC and the SCG. If the language captures sufficient information to reconstruct an abstract UCC (possibly only for simple types of UCC), this information can also be captured, allowing more general analysis.
   1.263 -Language constructs specify constraints on units. The connections between units can be very complex depending on the language, so the instrumentation needs to be tailored to the constructs.
   1.264 +As units are defined by scheduling decisions, the creation of a unit is easiest to register at the point where the unit is assigned to a processing element. This ensures that all units that are executed are recorded, and conversely, all units that are recorded are really executed. 
   1.265  
   1.266 -Using the concrete case of SSR, we will show the correspondence between the language's constructs and the elements of the UCC and the SCG.
   1.267 +There is no significant variation in this between languages, and the units are the same for the concrete UCC and the SCG. 
   1.268 +Language constructs specify constraints on units, where the connections between units can be very complex. Hence the instrumentation needs to be tailored to the constructs.
   1.269  
   1.270 -SSR, like threads, is based on the concept of virtual processors (VPs). These VPs execute sequential code that occasionally makes use of parallel language constructs, which will suspend the VP and switch to the runtime. The runtime checks the constraints defined by the language, performs some operations (e.g. transmitting a message) if required, and then calls the scheduler to decide upon the next VP to be run. With our definition of units as bounded by scheduling decisions, this means that each VP contains a sequence of units, with each unit the trace-segment between two SSR library calls. There is always a ``dependency'' (constraint that only frees the dependent after the propendent finished) between two successive tasks inside a same VP, since they are part of a continuous sequential code. This is a deterministic constraint, so it is the same in the UCC and the SCG. For non-deterministic constraints, all possibilities need to be added to the UCC, whereas only the actually realized causal dependencies are used to build the SCG.
   1.271 +Using the concrete case of SSR, we show the correspondence between the language's constructs and the elements of the UCC and the SCG.
   1.272 +
   1.273 +SSR, like threads, is based on the concept of virtual processors (VPs). These VPs execute sequential code that occasionally makes use of parallel language constructs, which suspend the VP and switch to the runtime.
   1.274 +
   1.275 + The runtime checks the constraints defined by the language, performs some operations (e.g. transmitting a message) if required, and then calls the scheduler to decide upon the next VP to be run. With our definition of units as bounded by scheduling decisions, this means that each VP contains a sequence of units, with each unit the trace-segment between two SSR library calls. 
   1.276 +
   1.277 +There is always a ``dependency'' (constraint that only frees the dependent after the propendent finished) between two successive tasks inside a same VP, since they are part of a continuous sequential code. This is a deterministic constraint, so it is the same in the UCC and the SCG. For non-deterministic constraints, such as singleton and send\_of\_type, all possibilities need to be added to the UCC. But only the actually realized causal dependencies are used in the SCG.
   1.278  
   1.279  Additional constraints can be placed on the tasks depending on the construct:
   1.280  \begin{itemize}
   1.281 @@ -725,7 +884,7 @@
   1.282  
   1.283  \section{Conclusion}
   1.284  \label{conclusion}
   1.285 -We have shown how to apply a computation model to instrument a language runtime for collecting measurements that connect: each to others, to application structure, to scheduling decisions, and to hardware. A simple visualization of the data has features that indicate lost performance, and features that visually link the lost performance to the cause, no matter if the cause is application structure, language runtime implementation, or hardware feature.  It is this linkage, due to the computation model, that sets this approach apart from others. 
   1.286 +We have shown how to apply a computation model to instrument a language runtime for collecting measurements that connect: each measurement to others, to application structure, to scheduling decisions, and to hardware. A simple visualization of the data has features that indicate lost performance, and features that visually link the lost performance to the cause, no matter if the cause is application structure, language runtime implementation, or hardware feature.  It is this linkage, due to the computation model, that sets this approach apart from others. 
   1.287   
   1.288  
   1.289  \bibliography{BibForPapers}
   1.290 @@ -982,3 +1141,5 @@
   1.291  
   1.292   destroying virtutal processors, and three kinds of send-receive pairs. The 
   1.293   graph. It depicts all the scheduling operations performed by the runtime, 
   1.294 +
   1.295 +