changeset 70:23e3c832b31a

perf tune -- rewrite abstract and levels of SCG
author Nina Engelhardt <nengel@mailbox.tu-berlin.de>
date Thu, 02 Aug 2012 18:59:47 +0200
parents 0e9165cb2c52
children 8ad64c3d3da1
files 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 0__Papers/Holistic_Model/Perf_Tune/latex/bib_for_papers_12_Jy_15.bib
diffstat 3 files changed, 54 insertions(+), 44 deletions(-) [+]
line diff
     1.1 Binary file 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf has changed
     2.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Sun Jul 15 00:17:30 2012 -0700
     2.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Thu Aug 02 18:59:47 2012 +0200
     2.3 @@ -68,9 +68,11 @@
     2.4  
     2.5  
     2.6  \begin{abstract}
     2.7 -Performance tuning is an important aspect of parallel programming. After all, why go parallel, if not for higher performance? Yet existing tools leave much to be desired because they don't make the cause of performance loss evident. Only once the cause has been determined can a solution be formulated.
     2.8 +Performance tuning is an important aspect of parallel programming. After all, why go parallel, if not for higher performance? Yet when trying to track down the causes of performance loss, a great deal of knowledge about the internal structure of both the application and the runtime is often needed to understand how the observed patterns of performance have come to pass.
     2.9  
    2.10 -We describe an approach that relies upon a new model of parallel computation to link performance loss to its cause, whether that be in the hardware, runtime, or application code. The visualizations produced clearly identify idle cores, and tie the idleness to causal interactions within the runtime and hardware, and from there to the parallelism constructs that constrained the runtime and hardware behavior.
    2.11 +The trend in parallel programming languages has been towards models that capture more structural information about the application, in an effort to increase both performance and ease of programming. We believe that this information can be used to improve performance tuning tools by making the causes of performance loss more readily apparent.
    2.12 +
    2.13 +We propose a universal but adaptable way of integrating more application structure into performance visualizations, relying on a model of parallel computation. The visualizations produced clearly identify idle cores, and tie the idleness to causal interactions within the runtime and hardware, and from there to the parallelism constructs that constrained the runtime and hardware behavior, thereby eliminating guesswork.
    2.14   
    2.15  This is implemented for multi-core hardware, and we walk through a tuning session on a large multi-core machine to illustrate how performance loss is identified and how hypotheses for the cause are generated. We also give a concise description of the implementation and the computation model. \end{abstract}
    2.16  
    2.17 @@ -84,7 +86,7 @@
    2.18  
    2.19  Performance visualizations and tuning tools for parallel programs are critical to achieving good performance, and yet current solutions leave something to be desired. All too often, performance tuning consists of staring at abstract views of usage or statistics, trying to \emph{guess} the cause of performance loss. Despite many different views of performance measurements, such as frequency by line of code, message sends and receives, and core usage timelines, the user doesn't know why a function runs in a particular spot on a particular core, nor whether that is desired behavior or erroneous behavior.
    2.20  
    2.21 -Examining these myriad views can feel like being advised by the fabled six blind men examining an elephant by touching the part nearest to each\ref{The poems of John Godfrey Saxe/The Blind Men and the Elephant}: one, reaching the side, says it is much like a wall; the second, feeling the tusk, compares it to a spear; the next, feeling the trunk, is reminded of a snake, and so on. All of these views are indeed correct and supply important information, but they don't connect to each other, to provide a view of how the parts interact to form a whole.
    2.22 +Examining these myriad views can feel like being advised by the fabled six blind men examining an elephant by touching the part nearest to each\cite{PerfToolPoem}: one, reaching the side, says it is like a wall; the second, feeling the tusk, compares it to a spear; the next, touching the trunk, is reminded of a snake, and so on. All of these views are indeed correct and supply important information, but they don't connect to each other, to provide a view of how the parts interact to form a whole.
    2.23  
    2.24  Current tools may cover all the parts of the application code, but fail to adequately connect their observations to the runtime behavior, scheduling decisions, and consequent hardware behavior. The decision about which task or virtual processor is assigned to which core at what point in time is at the heart of parallel performance, so these choices need to be highlighted and connected to the application features influencing them.
    2.25  
    2.26 @@ -104,7 +106,7 @@
    2.27  \section{Background and Related Work}
    2.28  \label{sec:related}
    2.29  
    2.30 -A quick review of the process of performance tuning will provide much needed context for the discussion of other tools. %too negative
    2.31 +%A quick review of the process of performance tuning will provide much needed context for the discussion of other tools.
    2.32  
    2.33   Performance tuning is an iterative process that involves a mental model. The programmer takes measurements during execution that are then compared to the desired outcome. A mental model, constructed through experience and knowledge of the mechanics of execution, is used to generate a hypothesis explaining any discrepancies between the measurement and expectations. This hypothesis is then linked, again through a mental model, to things within the programmer's control, to suggest a change to make to the code. The modified code is run again, and these steps are repeated until the programmer is satisfied with the performance of the program. Thus, the mental model is central to performance tuning.
    2.34  
    2.35 @@ -120,7 +122,9 @@
    2.36  \subsection{Thread-model based Approaches}
    2.37  Most of the early, more established, tools come from the threads world, and conceive of the application as a collection of virtual processors (threads) that perform actions, but don't include the concept of application-defined tasks nor constraints on them. This makes the tools unable to directly connect statistics they gather to scheduling choices and application features.  That lack of connection forces the user to guess at what aspect of the code is responsible for observed performance.
    2.38  
    2.39 -Tau \cite{PerfToolTau}   represents the thread-centric approach well. It integrates many data sources, and has rich displays. However it models cores and memories, and thread contexts, with actions taken on or by each. Because it had no well defined concept of unit of work, constraints on them, and scheduling choices, it was unable to help the user form hypotheses for the parallelism-specific causes of poor performance.
    2.40 +Tau \cite{PerfToolTau}   represents the thread-centric approach well. It integrates many data sources, and has rich displays. However it models cores and memories, and thread contexts, with actions taken on or by each. This means that it is entirely up to the user to understand both application and runtime in enough detail to recognize patterns in the observed performance and generate hypotheses about which constructs are implicated.
    2.41 +
    2.42 +%Because it had no well defined concept of unit of work, constraints on them, and scheduling choices, it was unable to help the user form hypotheses for the parallelism-specific causes of poor performance.
    2.43  
    2.44  \subsection{Event-centric approach}
    2.45  
    2.46 @@ -135,16 +139,16 @@
    2.47  %Note that Paraver and Vampir are just painting tools that take event measurements and paint them on the screen.
    2.48  
    2.49  \subsection{Performance-counter approaches}
    2.50 -Performance-counter approaches, such as VTune, POPI, and so on make up the fourth category. They concentrate on identifying hot-spots and potential false-sharing, using the hardware itself as the execution model. Thus, they have no concept of unit  nor scheduling event nor even runtime. As such,they do a good job of saying that something might be wrong, but don't help in pointing to what is causing the problem, and hence leave the user with no help determining what to change in their code to get better performance, with the exception of changing data structures to fix false sharing.
    2.51 +Performance-counter approaches, such as VTune, POPI, and so on make up the fourth category. They concentrate on identifying hot-spots and potential false-sharing, using the hardware itself as the execution model. Thus, they have no concept of unit  nor scheduling event nor even runtime. As such, they do a good job of saying that something might be wrong, but don't help in pointing to what is causing the problem, and hence leave the user with no help determining what to change in their code to get better performance.
    2.52  
    2.53  \subsection{Newer approaches}
    2.54   
    2.55  
    2.56   Recent parallel languages have begun exposing units and constraints on them, for example  CnC\cite{CnCInHotPar}  and StarSs \cite{StarSs}.  This provides the tools with the chance to link measurements to a model of parallel computation, such as the one we propose. The user can then link such measurements to the units defined in their code, and form hypothesis about what to change to improve performance. 
    2.57  
    2.58 -As an example, the performance tuning tool \cite{PerfToolStarSs} that comes with StarSs provides a  view of the code-defined units to the user. For a specific machine, it tells the user whether it thinks the task size is too small, just right, or too big. Too small has too much runtime overhead, while too big has too few tasks to keep the cores busy.  This view simplifies things for the user by providing the hypothesis of what's causing poor performance and directly linking it  to codde.
    2.59 +As an example, the performance tuning tool \cite{PerfToolStarSs} that comes with StarSs provides a  view of the code-defined units to the user. For a specific machine, it tells the user whether it thinks the task size is too small, just right, or too big. Too small has too much runtime overhead, while too big has too few tasks to keep the cores busy.  This view simplifies things for the user by providing the hypothesis of what's causing poor performance and directly linking it  to code.
    2.60  
    2.61 -This is a step in the right direction, but it gives the programmer only limited information. Subtle and complex interactions between code, runtime, and hardware are often to blame for performance loss, and the limited information supplied here doesn't help with such complexities. Also, the tool is language specific, and so doesn't apply to applications in other languages.
    2.62 +This is a step in the right direction, but it gives the programmer only limited information. Subtle and complex interactions between code, runtime, and hardware are often to blame for performance loss, and the limited information supplied here doesn't help with such complexities. Moreover, the view is language specific, and so cannot apply to applications in other languages.
    2.63  
    2.64  
    2.65  
    2.66 @@ -166,16 +170,16 @@
    2.67  
    2.68  The application code uses language features to create virtual processors (VP). The first VP created divides the work into a number of pieces and creates a new VP for each piece. 
    2.69  
    2.70 -How many pieces is set by a tuning parameter in the code, and the number of cores. The application uses language features to distribute the VPs  across the cores,  in a round-robin fashion (before tuning).
    2.71 +How many pieces is set by a tuning parameter in the code, and the number of cores. The application uses language features to distribute the VPs across the cores,  in a round-robin fashion unless otherwise specified.
    2.72  
    2.73 -It then creates   a results VP that receives a partial-result from each piece and accumulates the results. The  divider VP  then waits for the results VP to indicate completion, after which the language runtime shuts down.
    2.74 +It then creates a results VP that receives a partial-result from each piece and accumulates the results. The  divider VP  then waits for the results VP to indicate completion, after which the language runtime shuts down.
    2.75  
    2.76  \subsection{The language}
    2.77 -The language used is SSR, which is based on rendez-vous style send and receive operations made between virtual processors (VPs). It has commands for creating and destroying VPs, and three kinds of send-receive paired operations. 
    2.78 +The language used is SSR, which is based on rendez-vous style send and receive operations made between virtual processors (VPs), which are more commonly known as `threads'. It has commands for creating and destroying VPs, and three kinds of send-receive paired operations. 
    2.79  
    2.80  The first, \emph{send\_from\_to} specifies both sender and receiver VPs. It is used by the results VP to tell the divider VP that the work is complete. The second, \emph{send\_of\_type\_to}, specifies only a specific receiver, leaving the sender  anonymous, which increases flexibility while maintaining some control over scope. This  is used by the worker VPs doing the pieces to send their partial-result to the results processor. The third kind, \emph{send\_of\_type}, only specifies the type, and so acts as a global communication channel; this is not used in our application.
    2.81  
    2.82 - The language also includes a \emph{singleton} construct that designates a piece of code as to be executed only once, which we use to  rearrange and copy data to get better cache behavior. A given copy is shared by several virtual processors on different cores, but the copy is only to be performed once. 
    2.83 +The language also includes a \emph{singleton} construct that designates a piece of code as to be executed only once, which we use to  rearrange and copy data to get better cache behavior. A given copy is shared by several virtual processors on different cores, but the copy is only to be performed once. 
    2.84  
    2.85  Miscellaneous performance constructs are also available, such as one to force which core a virtual processor is assigned to. We use this in our example program to control scheduling.
    2.86  
    2.87 @@ -198,7 +202,7 @@
    2.88  \begin{figure}[ht]
    2.89    \centering
    2.90    \includegraphics[width = 3.2in, height = 1.8in]{../figures/UCC_stylized_for_expl.pdf}
    2.91 -  \caption{Two versions of a stylized Unit and Constraint Collection (UCC). The left UCC treats each unit as the same size, while the right UCC  is weighted by amount of work in a unit.}
    2.92 +  \caption{Two versions of a stylized Unit and Constraint Collection (UCC). The left UCC displays only the logical connections among units, while the right UCC also shows the amount of work in a unit.}
    2.93    \label{fig:UCC_expl}
    2.94  \end{figure}
    2.95  
    2.96 @@ -207,7 +211,7 @@
    2.97  
    2.98  Both versions of the UCC depict constraints on units. Starting  at the top, they  progress downwards,  placing units that constrain being above those constrained by them, and connecting them by a line representing the constraint. A unit is placed into the row just below the last of  the units constraining it, in the first available column. However, the rows and columns are just visual conveniences; the only information in the UCC is the lines between units, which represent constraints among the units. In practice, each unit is also labeled with a unique unit ID, from which the code of the unit and the parameters passed to it can be looked up. 
    2.99  
   2.100 -What distinguishes the two types of UCC is the size of the units. The left one treats all units as the same size, while the right weights the height of a unit by the number of instructions executed in it. By inspection, it can be seen that both have the same structure, but the rightmost does a better job of visually indicating the \textit{useable} parallelism available.
   2.101 +What distinguishes the two types of UCC is the size of the units. The left one shows only the logical graph-like structure of the application, while the right weights the height of a unit by the number of instructions executed in it. When information about unit size is available, it can visually indicate the \textit{useable} parallelism available.
   2.102  
   2.103  
   2.104  
   2.105 @@ -300,7 +304,7 @@
   2.106  
   2.107  
   2.108  
   2.109 -In this  subsection,  many  SCGs are shown, in Fig \ref{story}. which display the measurements collected on various runs during tuning, all on the same scale for direct comparison. They are  40 columns wide, one for each core, and relative height indicates relative execution time. They have lines in red, orange, and green, which represent application-code constructs. Red is creation of a virtual processor, green is the many-to-one \texttt{send\_of\_type\_to}, and orange is the singleton construct. For better visibility, only constraints that cross cores are shown.
   2.110 +The scheduling consequence graphs which display the measurements collected on various runs during tuning are displayed in Fig \ref{story}, using the same scale, for direct comparison. They are 40 columns wide, one for each core, and relative height indicates relative execution time. The lines in red, orange, and green represent application-code constructs. Red is creation of a virtual processor, green is the many-to-one \texttt{send\_of\_type\_to}, and orange is the singleton construct. For better visibility, only constraints that cross cores are shown.
   2.111  
   2.112  
   2.113  After functional debugging, the first tuning run produces the consequence graph seen in Figure \ref{fig:story:a}. The first thing to notice is that it is slimmer than expected: of the 40 available cores, only 13 are being used. Because the application places work on cores explicitly, this must be a bug in the dividing code. A cursory inspection reveals that a closing curly brace in the distribution loop had been misplaced. This may be a very simple bug, but it went unnoticed despite using this application as a test program for development of the language runtime, including performance, for several months.
   2.114 @@ -327,9 +331,9 @@
   2.115  
   2.116  Overall it is also noticeable that as work units become smaller, execution time becomes more irregular. Variability in work length correlates with the color, indicating cache behavior has worsened with smaller work size.
   2.117  
   2.118 -Note that the hypothesis, that cache behavior worsened with smaller work sizes, was generated directly from visual cues.    
   2.119 +%Note that the hypothesis that cache behavior worsened with smaller work sizes, was generated directly from visual cues.    
   2.120  
   2.121 -\subsubsection{holes in the core usage}
   2.122 +\subsubsection{Holes in the core usage}
   2.123  
   2.124  In Fig \ref{story:d}, ``holes'' are noticeable. Inspecting these holes closer, we can see that the stalled blocks are at the end of orange lines. This indicates they are waiting upon the completion of a singleton. The pattern of blocks shows that usually  the singleton unit runs before the work unit, but in these cases the singleton code was delayed until after the work on that core.  This is a runtime implementation fluke. The only thing an application programmer can do is change the work size to minimize the impact. (For those curious, the first VP to reach the singleton is granted control, but a ready work VP lands in the queue during the granting activity, so when the runtime finishes granting, the work VP is next, and the VP that now owns the singleton sits and waits for the work to end. All work VPs on other cores that pass through the same singleton  also wait.)  
   2.125  
   2.126 @@ -343,12 +347,12 @@
   2.127  
   2.128  Now that the usage has been seen, we expand on the model behind the visualizations. The model ties the information together, and understanding it helps in generating  hypotheses from the visualization features. 
   2.129  
   2.130 -As seen, the model has two parts, a \emph{Unit \&\ Constraint Collection (UCC)}, and a \emph{Scheduling Consequence Graph} (SCG or just consequence graph,  CG).  The UCC indicates the scheduling  choices  the application allows, and so shows what the programmer has control over. Whereas the consequence graph says which of those were actually taken during the run and the consequences of that set of choices.
   2.131 +As seen, the model has two parts, a \emph{Unit \&\ Constraint Collection (UCC)}, and a \emph{Scheduling Consequence Graph} (SCG or just consequence graph,  CG).  The UCC indicates the scheduling  choices  the application allows, and so shows what the programmer has control over. The consequence graph says which of those were actually taken during the run and the consequences of that set of choices.
   2.132  
   2.133  We give a more precise description of UCC, then consequence graph, in turn.
   2.134  However, space is too limited for a complete definition, which is given in a companion paper submitted to a longer format venue.
   2.135  \subsection{Unit \& Constraint Collection}
   2.136 -The UCC contains all the units of work that get scheduled during a run, and all constraints the application places on scheduling those units. That's a nice solid definition, but things aren't quite that simple. The complication is that different classes of application exist, with two degrees of freedom that determine how much of the UCC is actually defined in the application vs the input data, or even in the runtime.
   2.137 +The UCC contains all the units of work that get scheduled during a run, and all constraints the application places on scheduling those units. This is the simple definition, but unfortunately, this information is not always easy to obtain. The complication is that different classes of application exist, with two degrees of freedom that determine how much of the UCC is actually defined in the application vs the input data, or even in the runtime.
   2.138  
   2.139  Some applications have everything determined in the code, with all units fixed, and all constraints fixed. An example is matrix multiply with fixed size matrices.  But for others, the shape of the UCC is only partially defined by the application code.  Take matrix multiply used in Section \ref{sec:casestudy}, where an input parameter determines the number of units created. Here, the UCC is different for each parameter value. An extreme example is an NP complete problem, with redividable units, for which the units are a function of both the  input data \emph{and} decisions made by the runtime.
   2.140  
   2.141 @@ -356,7 +360,7 @@
   2.142    \centering
   2.143    \includegraphics[width = 3.2in, height = 1.8in]{../figures/UCC_weighted.pdf}
   2.144    \caption{Concrete UCC measured for matrix mulitply during the performance tuning in Section \ref{sec:casestudy}.}
   2.145 -  \label{fig:UCC_Concreteness}
   2.146 +  \label{fig:UCC_example}
   2.147  \end{figure}
   2.148  
   2.149      We call a fully specified UCC a \emph{concrete} UCC.  Every run of an application eventually winds up defining a concrete UCC, such as produced for the performance tuning  seen  in Fig \ref{fig:UCC_example}. But the amount of UCC made concrete by the application alone falls into a two-dimensional grid. One dimension covers the units, the other the constraints.
   2.150 @@ -374,7 +378,7 @@
   2.151  The closer the application-derived UCC is to the origin, the less additional information it needs to become concrete. The UCC labeled A in the figure is fully concrete just from the source code alone (representing for example, matrix multiply with fixed size matrices). The UCC labeled B requires the input data and parameters to be specified before its units are concrete, but just parameters to make its constraints fully concrete (as per ray-tracing, with bounce depth specified as a parameter). The UCC labeled C only has variability in its constraints, which require input data (for example, H.264 motion vectors).
   2.152  But even the least concrete UCC, out at the end of the diagonal (D in the figure), becomes concrete during a run of the application.
   2.153   
   2.154 -Notice, though, that  a fully concrete UCC still has degrees of freedom, in which units to which hardware and the order of execution.  The decisions fix interactions within the hardware, to yield the communication patterns and consequent performance seen during the run. 
   2.155 +Notice, though, that even a fully concrete UCC still has degrees of freedom, in which units to run on which hardware and in the order of execution. These decisions fix interactions within the hardware, to yield the communication patterns and consequent performance seen during the run. 
   2.156  
   2.157  An added twist is that an application has a life-line, spanning from code all the way through the run, and its representation  may change at the different stages of life. It starts as pristine source, then moves  into specialization where code is translated into different representations than the original, and finally the specialized code is run. The UCC often changes between these points in  the life-line.
   2.158  
   2.159 @@ -382,24 +386,22 @@
   2.160  
   2.161  The UCC still indicates what is inside the application's control vs under the runtime's control, even for applications that land far out on the diagonal. It thus indicates what can be done statically: the further out on the diagonal a UCC is, the less scheduling can be done statically in the toolchain.
   2.162  
   2.163 -In this paper, we do not suggest how to represent UCCs far out on the diagonal. One of those actually indicates a multi-verse of concrete-UCCs. Which of them materializes  depends on the data that shows up and what the scheduler does. We only represent the concrete UCC that materializes during a run and leave the question of representing less concrete ones to future work.
   2.164 +In this paper, we do not suggest how to represent UCCs far out on the diagonal. One of those actually indicates a multi-verse of concrete-UCCs. Which of them materializes  depends on the data and what the scheduler decides. We only represent the concrete UCC that materializes during a run and leave the question of representing less concrete ones to future work.
   2.165    
   2.166  
   2.167  \subsection{Scheduling Consequence Graph}
   2.168 -Whereas the UCC concentrates on application-derived information, the second part of the model adds-in effects of runtime  details and hardware. It's called the Scheduling Consequence Graph because it links scheduling decisions to their performance consequences. But it also indicates the role, in the decision, of application, runtime and hardware details.
   2.169 +Whereas the UCC concentrates on application-derived information, the second part of the model adds-in effects of runtime  details and hardware. We called it the Scheduling Consequence Graph because it links scheduling decisions to their performance consequences. But it also indicates the role, in the decision, of application, runtime and hardware details.
   2.170  As a result it identifies instances of lost performance, and links them to the cause of the loss, as seen in Section \ref{sec:casestudy}.
   2.171  
   2.172 -
   2.173 -
   2.174 -To distinguish from the UCC, the consequence graph shows the behavior resulting from scheduling decisions actually \emph{made},   from among those \emph{possible.} The UCC shows just the possibilities. Hence, a consequence graph shows \emph{one} of the possible choice-sets allowed by the UCC.
   2.175 +Contrary to the UCC, the consequence graph shows the behavior resulting from scheduling decisions actually \emph{made},   from among those \emph{possible.} The UCC shows just the possibilities. Hence, a consequence graph shows \emph{one} of the possible choice-sets allowed by the UCC.
   2.176  
   2.177  A consequence graph accounts for each bit of core time. It has boxes and arcs, with the boxes divided into regions. The boxes  each represent all core time assigned to one work unit, with each region inside representing a segment of time that the core was engaged in a specific type of activity.  An arc links  regions (or boxes)  and represents a causality of some kind.
   2.178  
   2.179  There is one kind of region for each reason that the core is being used (or being forced idle), and several kinds of arcs, one for each type of causality between regions.
   2.180  
   2.181 -The core activities associated with region types are: application work, waiting for communication of work data, managing constraints, choosing assignment of work onto cores, and runtime internals. The runtime internals have sub-categories but space is limited so we skip those here.
   2.182 +The core activities associated with region types are: application work, waiting for communication of work data, managing constraints, choosing assignment of work onto cores, and runtime internals. The runtime internals in turn have sub-categories, which we will not detail here, but some of which will be mentioned in section \ref{sec:Implementation}.
   2.183  
   2.184 -The arc types, representing the type of causal relationship, are: control dependency in the base language, parallel constraint that had to be satisfied (IE, one unit did something to satisfy a constraint on the other, causing it to be free to be scheduled), runtime internal causality such as a global lock (runtime on one core releases the lock, causing the other to acquire it), and arcs that represent hardware causal relationships (one work-unit finishing on a core causes another work-unit to start there, given the choice by the runtime). The formal details are given in the longer format companion paper.
   2.185 +The arc types, representing the type of causal relationship, are: control dependency in the base language, parallel constraint that had to be satisfied (i.e., one unit did something to satisfy a constraint on the other, causing it to be free to be scheduled), runtime internal causality such as a global lock (runtime on one core releases the lock, causing the other to acquire it), and arcs that represent hardware causal relationships (one work-unit finishing on a core causes another work-unit to start there, given the choice by the runtime). The formal details are given in the longer format companion paper.
   2.186  
   2.187  We will now look at each source of causal relationship.
   2.188  
   2.189 @@ -409,7 +411,7 @@
   2.190  
   2.191  \paragraph{Runtime internal causality} Runtime implementation details may introduce ``extra" causalities between units. For example, the version of VMS we instrumented for this paper runs separately on each core and relies upon a global lock for accessing shared runtime information. This lock introduces a causal relationship  when the runtime on one core is attempting to process one unit, but must wait for the runtime on a different core to finish with its unit.
   2.192  
   2.193 - Normally, these are not displayed explicitly, due to clutter, but can be turned on when needed, say, to determine the cause of a particular pattern of core usage.
   2.194 + Normally, these are not displayed explicitly, due to clutter, but can be turned on when needed, for instance to determine the cause of a particular pattern of core usage.
   2.195  
   2.196  \paragraph{Hardware causality} The physical fact that a given resource can only be used by one work-unit at a time introduces hardware causalities. When multiple units are free to execute, but all cores are busy, then completion of a unit  on one core causes (in part) the next ready unit to run on that core. 
   2.197  
   2.198 @@ -443,29 +445,38 @@
   2.199  
   2.200  Figure \ref{fig:UCC_Levels} shows that these UCCs are related to each other in the same hierarchy as the runtimes. A unit scheduled in one runtime is broken into smaller units in the one below it. Each of those units is  then separately scheduled, making a separate UCC just for them. So, as the figure shows, a unit in one UCC has an entire UCC inside it.  
   2.201  
   2.202 - Great, that makes sense, now what about the consequence graphs?
   2.203 + %Great, that makes sense, now what about the consequence graphs?
   2.204  
   2.205  \subsubsection{Levels of Consequence Graph}
   2.206  
   2.207 -A consequence graph ties together scheduling decisions made on units with the consequences in the hardware of those decisions. But there are now multiple levels of  consequence graph, one for each UCC. With multiple levels, a lower-level runtime is treated as a single  ``core" by the level above it. So, what does ``consequence'' mean in this case?   The answer is that for performance tuning, the consequence of  interest is the critical path.
   2.208 +With multiple levels of scheduling activities, a scheduling consequence graph can be constructed for each level. How, then, do these tie together?
   2.209  
   2.210 - That gives two  goals: first, to get the consequences to the critical path, then second,  to charge each segment of time on an actual physical core to exactly one box in one of the levels  of consequence graph.
   2.211 -We note that the critical path for one level  is in terms of the work-times of  its units, but those units now each have an entire consequence graph inside it. Hence the work time of a unit is  the critical path time for the consequence graph inside it.   
   2.212 +For a given level, a lower-level runtime is seen as a single processing unit. What is the consequence of deciding to schedule a given work unit on a given processing unit? When trying to improve performance, it is the time taken until completion of the work unit that is of interest.
   2.213  
   2.214 - Now, the question is, which portions of the physical core time should get counted towards one  higher-level unit? The answer is seen by looking at all the levels in the matrix multiply application, from the story in Section \ref{sec:casestudy}. Going up,  the level above the execution under study involves the invocation   of entire applications, via OS commands. At that level, a unit is an entire process, and the critical path of the SCG in Section  \ref{sec:casestudy} is the work-time of that unit.  That leaves the  time spent inside the OS as the runtime overhead assigned to that unit. 
   2.215 +If we now look at the lower level, the processing unit is often not in fact a single unit but itself a parallel system comprised of several processing units. When can the upper-level work unit be considered completed? When all the lower-level units of which it is comprised are completed. Once the lower-level consequence graph is established, this time can be easily determined: it is the critical path across this graph, the longest path between all units having no propendents and the units with no dependents.
   2.216  
   2.217 -In the other direction, the lower level is the operation of the out-of-order pipelines in the cores, which have the equivalent of a runtime, in hardware. The hardware runtime consists of  the dependency logic that determines which instructions are free, and the issue logic that determines which functional unit performs a free instruction. Hence, a unit is one instruction. The work time is the number of cycles it contributes to the critical path, due to dependencies limiting overlap. And the runtime overhead is the operation of the dependency and issue logic. Those don't contribute \textit{directly} to the critical path, so instructions effectively have no runtime overhead. 
   2.218 +Consider the concrete example of the SSR matrix multiply application, from the story in section \ref{sec:casestudy}. Going up,  the level above the execution under study involves the invocation of entire applications, via OS commands. At that level, a unit is an entire process, and the critical path of the SCG in section \ref{sec:casestudy} is the work-time of that unit.
   2.219  
   2.220 -We return now to   the question of  the  time a higher-level unit uses the cores outside of its sub-units. 
   2.221 -Consider, in the matrix multiply code, the core usage spent  while  dividing the work and handing it to other cores. This is not work of the application, but overhead spent breaking the single application-unit into multiple sub-units.   Even though it is in the application code, it's purpose is implementing the execution model, which makes it runtime overhead. But which runtime level? It's not part of the SSR language runtime, so not overhead of a unit inside the application,   but rather it's for the application itself, as a unit! So, the core time spent calculating the division  gets counted towards the application-level unit, while the time spent inside the SSR runtime creating the meta-units is counted towards those lower SSR-level units. But both are in the critical path, so both charged as work time of the higher-level unit.
   2.222 +One thing that is specific to the consequence graph and does not appear in the UCC is overhead. How is overhead dealt with across levels? Looking again at our SSR matrix-multiply, it is visible that the overhead recorded in the SSR-level consequence graph, when it contributes to the critical path, is counted as part of the work-time of the application unit. Only the overhead of the runtime level under investigation is distinguished in the SCG.
   2.223  
   2.224 +That this is congruent with how overhead is often intuitively treated becomes apparent when we consider the next lower level. Each core, indeed, is a parallel processing unit composed of several sub-units, namely functional units such as arithmetic or floating-point units. What is shown as one SSR work unit in section \ref{sec:casestudy} is further broken down into smaller work units: individual instructions. Here too, the execution time of a higher-level work unit is the critical path from the start of the first instruction in the unit to the end of the last instruction.
   2.225 +The issue logic in a modern out-of-order processor is in fact a relatively complicated scheduler that analyses the register use dependencies between instructions (=constraints between work units) and dispatches (=assigns) them to different functional units. Overhead, at this level, would be the issue logic, and it contributes to the critical path whenever dispatch is the bottleneck. Yet nobody would think to measure differently the execution time of a calculation depending on whether the limiting factor was the number of arithmetic units available or the size of the instruction buffer.
   2.226  
   2.227 - Another way to view this is that  by the process of elimination, the only core-time not accounted for elsewhere is the time spent dividing up a unit into smaller ones, and time spent accumulating the individual results back together. So this is what gets charged to the  higher-level unit.
   2.228 +Let us now consider how this mapping between higher and lower level appears when viewed from the lower level, `looking up' so to speak.
   2.229 +From inside the core, what is considered overhead at the SSR level are in fact just more instructions. They appear just like a different unit would.
   2.230 +One level up, the same holds true:
   2.231 +The overhead of the unique application-unit would be the time spent in the OS, doing things such as setting up the application's memory space. This time is not visible from inside the application. However, it uses the same processing units (in this case, cores), and to them appears not very different from application work. The noticeable difference, in this case, is that the OS runtime is written using a different programming model instead of SSR.
   2.232  
   2.233 -The last question is how to handle communication consequences, which result from decisions made in all levels? The decisions in higher-level runtimes set the context for decisions in lower-level ones, which links a higher-level choice  to the consequences resulting from the lower-level choices. But the value of a consequence graph comes from linking the size of boxes in it to the decisions made by the scheduler. It's not clear how to divide-up  the time  cores spend waiting for non-overlapped communication, to assign portions to different levels. We have no good answer at the moment and leave it for future work.
   2.234  
   2.235 +%Consider, in the matrix multiply code, the core usage spent  while  dividing the work and handing it to other cores. This is not work of the application, but overhead spent breaking the single application-unit into multiple sub-units.   Even though it is in the application code, it's purpose is implementing the execution model, which makes it runtime overhead. But which runtime level? It's not part of the SSR language runtime, so not overhead of a unit inside the application,   but rather it's for the application itself, as a unit! So, the core time spent calculating the division  gets counted towards the application-level unit, while the time spent inside the SSR runtime creating the meta-units is counted towards those lower SSR-level units. But both are in the critical path, so both charged as work time of the higher-level unit.
   2.236  
   2.237  
   2.238 +% Another way to view this is that  by the process of elimination, the only core-time not accounted for elsewhere is the time spent dividing up a unit into smaller ones, and time spent accumulating the individual results back together. So this is what gets charged to the  higher-level unit.
   2.239 +
   2.240 +%The last question is how to handle communication consequences, which result from decisions made in all levels? The decisions in higher-level runtimes set the context for decisions in lower-level ones, which links a higher-level choice  to the consequences resulting from the lower-level choices. But the value of a consequence graph comes from linking the size of boxes in it to the decisions made by the scheduler. It's not clear how to divide-up  the time  cores spend waiting for non-overlapped communication, to assign portions to different levels. We have no good answer at the moment and leave it for future work.
   2.241 +
   2.242 +In practice, the question of levels does not come up very often, as there is generally only one level of interest being investigated. However, the perspective is occasionally important, e.g. when identifying the different components of the model in a concrete runtime so as to insert instrumentation code in the right places.
   2.243 +
   2.244  
   2.245  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   2.246  \section{Implementation}
   2.247 @@ -522,7 +533,7 @@
   2.248  SSR is based on virtual processors. They execute sequential code that occasionally calls a parallel construct, which suspends the VP and switches to the runtime. This means that each VP contains a sequence of units, with each unit the trace-segment between two SSR library calls.
   2.249   
   2.250  
   2.251 -SSR has both deterministic constraints, which specify the source and destination VP, such as send\_from\_to, and non-deterministic ones, in which the runtime is what chooses which VPs interact, such as send\_of\_type\_to\ and singleton. Deterministic ones display the same in the UCC and the SCG. However, non-deterministic ones need all possibilities to be determined for the UCC, requiring extra instrumentation code.
   2.252 +SSR has both deterministic constraints, which specify the source and destination VP, such as send\_from\_to, and non-de\-ter\-mi\-nis\-tic ones, in which the runtime is what chooses which VPs interact, such as send\_of\_type\_to\ and singleton. Deterministic ones display the same in the UCC and the SCG. However, non-deterministic ones need all possibilities to be determined for the UCC, requiring extra instrumentation code.
   2.253  
   2.254  \subsubsection{Collecting a unit}
   2.255  Code to record a new unit is inserted into VMS at the transition out of the runtime and into application code. Code to record the unit end is inserted into the VMS primitive that switches back to the runtime. 
   2.256 @@ -576,7 +587,7 @@
   2.257  Both the UCC and the SCG are internally represented as directed graphs, with units as nodes.
   2.258  
   2.259  \subsubsection{UCC}
   2.260 -For the UCC, units can be either unweighted or weighted. Weighted units appear as rectangles with height proportional to the weight, unweighted units appear as circles. We weight the units with the number of instructions in the work. This removes some of the influence of scheduling and data, such as cache misses.
   2.261 +For the UCC, units can be either unweighted or weighted. Weighted units appear as rectangles with height proportional to the weight, unweighted units appear as circles. We weight the units with the number of instructions in the work. This is not entirely execution-independent, but removes some of the influence of scheduling and data, such as cache misses.
   2.262  
   2.263  A critical path algorithm calculates vertical position of a unit, as its position within the critical path. The nodes are spread horizontally such that none overlap.
   2.264  
     3.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/bib_for_papers_12_Jy_15.bib	Sun Jul 15 00:17:30 2012 -0700
     3.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/bib_for_papers_12_Jy_15.bib	Thu Aug 02 18:59:47 2012 +0200
     3.3 @@ -8,10 +8,9 @@
     3.4  	pages = {284--299}
     3.5  }
     3.6  @InBook{PerfToolPoem,
     3.7 -	title = {The Poems of John Godfrey Saxe, Complete edition},
     3.8 -	chapter = {The Blind Men and the Elephant},
     3.9 +	title = {The Blind Men and the Elephant},
    3.10  	author = {John Godfrey Saxe},
    3.11 -	publisher = {Boston: James R. Osgood and Company},
    3.12 +	publisher = {The Poems of John Godfrey Saxe, Complete edition. Boston: James R. Osgood and Company},
    3.13  	year = 1873,
    3.14  	pages = {77-78}
    3.15  }