VMS/0__Writings/kshalle

changeset 90:dbc65e36c366

perf tune: last minute comments from ben
author Nina Engelhardt <nengel@mailbox.tu-berlin.de>
date Fri, 17 Aug 2012 19:01:07 +0200
parents 196871d9eaef
children d8024c56ef61
files 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex
diffstat 2 files changed, 50 insertions(+), 52 deletions(-) [+]
line diff
     1.1 Binary file 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf has changed
     2.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Thu Aug 16 08:17:43 2012 -0700
     2.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Fri Aug 17 19:01:07 2012 +0200
     2.3 @@ -61,9 +61,9 @@
     2.4  %           {TU Berlin}
     2.5  %           {b.juurlink@tu-berlin.de}
     2.6  
     2.7 -\authorinfo{Anonymous}
     2.8 -           {No Institute}
     2.9 -           {email@noplace}
    2.10 +\authorinfo{~}
    2.11 +           {~}
    2.12 +           {~}
    2.13  
    2.14  % This \maketitle command is required from ieeepes version 4.0, to make
    2.15  % ieeepes work correctly with newer LaTeX versions.
    2.16 @@ -71,21 +71,17 @@
    2.17  
    2.18  
    2.19  \begin{abstract}
    2.20 -Performance tuning is an important aspect of parallel programming. Yet when trying to pinpoint the causes of performance loss, often times insufficient knowledge  of the internal structure of the application and the runtime is available to understand how the observed patterns of performance have come to pass.
    2.21 +Performance tuning is an important aspect of parallel programming. Yet when trying to pinpoint the causes of performance loss, often insufficient knowledge of the internal structure of the application and the runtime is available to understand how the observed patterns of performance have come to pass.
    2.22  A trend in parallel programming languages is towards models that capture more structural information about the application, in an effort to increase both performance and ease of programming. We propose using this structural information in  performance tuning tools to make the causes of performance loss more readily apparent.
    2.23  Our work produces a universal, adaptable set of performance visualizations that integrates this extra application structure, via a new model of parallel computation. The visualizations clearly identify idle cores, and tie the idleness to causal interactions within the runtime and hardware, and from there to the parallelism constructs that constrained the runtime and hardware behavior, thereby eliminating guesswork.
    2.24  This approach can be used to instrument the runtime of any parallel language or programming model without modifying the application. As a case study, we applied it to the SSR message-passing model, and we walk through a tuning session on a large multi-core machine to illustrate the improvements in identifying performance loss and generating hypotheses for the cause. 
    2.25  \end{abstract}
    2.26  
    2.27 -
    2.28 -
    2.29 -
    2.30 -
    2.31  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    2.32  \section{Introduction}
    2.33  \label{sec:intro}
    2.34  
    2.35 -Performance visualization and tuning tools for parallel programs are critical to achieving good performance, and yet the experience of using current solutions leaves something to be desired. Often, performance tuning requires interpreting abstract views of usage or statistics, trying to \emph{guess} the cause of performance loss. Despite many different views of performance measurements, such as frequency by line of code, message sends and receives, and core usage timelines, the user does not know why a function runs in a particular spot on a particular core, nor whether that is desired behavior or erroneous behavior.
    2.36 +Performance visualization and tuning tools for parallel programs are critical to achieving good performance, and yet the experience of using current solutions leaves something to be desired. Often, performance tuning requires interpreting abstract views of usage or statistics, trying to \emph{guess} the cause of performance loss. Despite many different views of performance measurements, such as frequency by line of code, message sends and receives, and core usage timelines, the user does not know why a function runs in a particular spot on a particular core, nor whether that is desired or erroneous behavior.
    2.37  
    2.38  Examining these myriad views can feel like being advised by the fabled six blind men examining an elephant by touching the part nearest to each\cite{PerfToolPoem}: one, reaching the side, compares it to a wall; the second, feeling the tusk, likens it to a spear; the next, touching the trunk, is reminded of a snake, and so on. All of these views are indeed correct and supply important information, but they don't connect to each other, to provide a view of how the parts interact to form a whole.
    2.39  
    2.40 @@ -93,32 +89,34 @@
    2.41  
    2.42  Our approach provides a more complete picture by following a new model of parallel computation. The model suggests that the relevant application features are the units of work performed and the  constraints on scheduling them, and that it is crucial to record and display the scheduling decisions taken by the runtime. The model simplifies hypothesis generation by connecting the measurements to each other and to the application, analogous to providing a view of the whole elephant.
    2.43  
    2.44 -Our instrumentation, based on the model, detects the  units within the application code when they are scheduled during a run, and visually links each to the units upon which its execution is conditional. The visual arrangement makes identifying idle cores immediate, enables quickly generating the correct hypotheses for the causes of the performance losses, and quickly pinpointing where in the application code changes can be made to improve performance. In effect, the visualization serves as a map showing idle cores linked back to the section of application code responsible.
    2.45 +Our instrumentation, based on the model, detects the units within the application code when they are scheduled during a run, and visually links them to the units upon which their execution is conditional. The visual arrangement makes identifying idle cores immediate, enables quickly generating the correct hypotheses for the causes of the performance losses, and quickly pinpointing where in the application code changes can be made to improve performance. In effect, the visualization serves as a map showing idle cores linked back to the section of application code responsible.
    2.46  
    2.47  The model of computation and our visual approach to using it is illustrated with a case study of performance tuning a standard parallel application on a large multi-core system.
    2.48  
    2.49 -We start with an overview of previous approaches in Section \ref{sec:related}. We show usage of our visualizations through a case study in Section \ref{sec:casestudy}, and then expand on the model behind it in Section \ref{sec:theory}. Section \ref{sec:Implementation} will relate the model to implementation details. Finally, we conclude in Section \ref{sec:conclusion}.
    2.50 +This paper is organized as follows: We start with an overview of previous approaches in Section \ref{sec:related}. We show usage of our visualizations through a case study in Section \ref{sec:casestudy}, and then expand on the model behind it in Section \ref{sec:theory}. Section \ref{sec:Implementation} will relate the model to implementation details. Finally, we conclude in Section \ref{sec:conclusion}.
    2.51  
    2.52  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    2.53  \section{Background and Related Work}
    2.54  \label{sec:related}
    2.55  
    2.56  When choosing a performance tuning tool, three primary factors to consider are:
    2.57 +\vspace{-3pt}
    2.58  \begin{itemize} 
    2.59 +\setlength{\itemsep}{0em}
    2.60  \item the ease of recognizing performance loss
    2.61  \item the ease of forming a hypothesis to explain the loss
    2.62 -\item the ease of linking the hypothesis to changes to make in application code
    2.63 +\item the ease of linking the hypothesis to changes to make in the application code
    2.64  \end{itemize}
    2.65 -
    2.66 +\vspace{-3pt}
    2.67  The mental model that underlies the tool plays a key role in these three usefulness factors. This can be seen  by examining the  activity of performance tuning.  It is an iterative activity. To start, the programmer takes measurements during execution and compares them to the desired outcome. A hypothesis is generated for what caused any differences, by using a mental model, of causal relationships within  execution. The mental model is then also used to link the hypothesis to code, to suggest a change to make to improve performance. After modification, the code is run again, and these steps are repeated until the programmer is satisfied with the performance of the program. Thus, the mental model is central during performance tuning.
    2.68  
    2.69 -As a result of its role when using  a tuning tool, the mental model underlying the tool is   a major focus when comparing  tools.  The mental model of our approach is based on a new model of parallel computation and is the main way  our approach differs. The new model makes it  easier to form a hypothesis of the cause of performance losses, and easier to link that back to changes to make in application code.
    2.70 +As a result of its role when using  a tuning tool, the mental model underlying the tool is   a major focus when comparing  tools.  The mental model of our approach is based on a new model of parallel computation and is the main way  our approach differs from other approaches. The new model makes it  easier to form a hypothesis of the cause of performance losses, and easier to link that back to changes to make in application code.
    2.71  
    2.72   Historically, the dominant mental model for parallel computation has been the software-thread abstraction. The model is very close to the hardware, without much higher-level semantics. This makes it  a problem when analyzing parallel performance. Yet as the earliest and most widespread parallel programming model, and one that serves as the basis on top of which many other models are implemented, it must be supported.
    2.73  
    2.74 -A popular version is  pthreads, which introduces randomness through OS time-slicing, and whose synchronization constructs  imply only indirect connections. For instance, very little conclusion can be drawn about the relationship between the computations in two separate threads from their accessing the same lock consecutively. Consequently, tools such as Paradyn \cite{PerfToolParadyn} or VTune \cite{PerfToolVTune} that rely on this model  have a hard time connecting measurements to the application code details. They collect a wealth of statistics, but the application is to the tool a foreign process, where things ``just happen'' without causal links between events and code. Such tools generate conclusions such as ``application bottleneck is synchronisation'', and  ``too much time is spent in spinlocks'', without help as to what in the code causes this. For example, if there is more than one use of spinlocks in the application, it is not clear what point in the code is the problem.
    2.75 +A popular version is  pthreads, which introduces randomness through OS time-slicing, and whose synchronization constructs  imply only indirect connections. For instance, very little conclusion can be drawn about the relationship between the computations in two separate threads when they access the same lock consecutively. Consequently, tools such as Paradyn \cite{PerfToolParadyn} or VTune \cite{PerfToolVTune} that rely on this model  have a hard time connecting measurements to the application code details. They collect a wealth of statistics, but the application is to the tool a foreign process, where things ``just happen'' without causal links between events and code. Such tools generate conclusions such as ``application bottleneck is synchronization'', and  ``too much time is spent in spinlocks'', without help as to what in the code causes this. For example, if there is more than one use of spinlocks in the application, it is not clear what point in the code is the problem.
    2.76  
    2.77 -One fix to these issues is to allow the users to introduce measuring points into their own code, at the cost of increased programmer effort.  Because instrumentation code is written in the source language, it has access to application concepts. This advantage can be kept with automated instrumentation, by providing an instrumenting compiler, like the Tau \cite{PerfToolTau} project does.
    2.78 +One fix to these issues is to allow users to introduce measuring points into their own code, at the cost of increased programmer effort.  Because instrumentation code is written in the source language, it has access to application concepts. This advantage can be kept with automated instrumentation, by providing an instrumenting compiler, like the Tau \cite{PerfToolTau} project does.
    2.79  
    2.80  However, as long as the underlying execution model is still threads, there is no meaningful structure common to all applications to use to generate expressive measurement quantities. Function boundaries and the call graph are not sufficient when parallel execution is performed. The sequence and frequency of function calls tell little about parallel performance impacts because they have no bearing on synchronization events.  Unfortunately, pthreads does not capture even hints as to \emph{why} a given function call ends up blocking or not blocking, and what the effects on other threads are.
    2.81  
    2.82 @@ -129,14 +127,16 @@
    2.83  Fortunately, parallel languages are evolving in the same direction, and it is widely believed that semantically rich parallel constructs, when they match the structure of the application well, are key to improved productivity. With languages such as CnC \cite{CnCInHotPar} and StarSs \cite{StarSs}, the information available to the language runtime offers a much clearer picture of the constraints  placed on the execution of tasks and where they come from. Consequently, the application structure can be well reflected in visualizations, as the Temanejo debugger  \cite{PerfToolStarSs} for StarSs demonstrates.
    2.84  
    2.85  In the following section, we will demonstrate how our approach overcomes the various challenges of these other approaches through a combination of the following features:
    2.86 +\vspace{-3pt}
    2.87  \begin{itemize}
    2.88 +\setlength{\itemsep}{0em}
    2.89  \item  it visualizes all the factors influencing scheduling decisions, given sufficient language and runtime pre-requisites
    2.90  \item it has a filtering mechanism that shows only the most relevant groups of constraints, as desired
    2.91  \item it has flexibility to capture and represent any type of constraints
    2.92  \item it works for any parallel programming library or language
    2.93  \item it instruments only the runtime, once,  for all applications
    2.94  \end{itemize}
    2.95 -
    2.96 +\vspace{-5pt}
    2.97  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    2.98  \section{Case Study: Analyzing an SSR Application}
    2.99  \label{sec:casestudy}
   2.100 @@ -158,11 +158,11 @@
   2.101  It then creates a results VP that receives a partial-result from each piece and accumulates the results. The  divider VP  then waits for the results VP to indicate completion, after which the language runtime shuts down.
   2.102  
   2.103  \subsection{Programming Model}
   2.104 -The language used is SSR, which is based on rendez-vous style send and receive operations made between virtual processors (VPs), which are equivalent to software threads. The example application uses commands for creating and destroying VPs, and  two kinds  of  send-receive paired operations. 
   2.105 +The language used was Synchronous Send-Receive (SSR), a self-implemented language based on rendez-vous style send and receive operations made between virtual processors (VPs), which are equivalent to software threads. The example application uses commands for creating and destroying VPs, and two kinds of send-receive paired operations. 
   2.106  
   2.107 -The first, \emph{send\_from\_to} specifies both sender and receiver VPs. The results VP uses this to tell the divider VP that the work is complete. The second, \emph{send\_of\_type\_to}, specifies only a specific receiver, leaving the sender  anonymous. It has increased flexibility while maintaining some control over scope. The worker VPs use this  to send their partial-result to the results VP. 
   2.108 +The first, \emph{send\_from\_to} specifies both sender and receiver VPs. The results VP uses this to tell the divider VP that the work is complete. The second, \emph{send\_of\_type\_to}, specifies only a specific receiver, leaving the sender anonymous. It has increased flexibility while maintaining some control over scope. The worker VPs use this to send their partial result to the results VP. 
   2.109  
   2.110 -The language also includes a \emph{singleton} construct that designates a piece of code as to be executed only once. It was used to shift work out of the divider, which is sequential, over to the worker-pieces, which spreads the work across many cores.  The work is setup that many pieces share but is only to be performed once. 
   2.111 +The language also includes a \emph{singleton} construct that designates a piece of code as to be executed only once. It is used to shift the work of copying matrix fragments out of the divider, which is sequential, over to the worker-pieces, which spreads the work across many cores. Several worker VPs share fragments, but the copy has only to be performed once. This construct allows whatever VP is ready first to perform the copy.
   2.112  
   2.113  Miscellaneous performance constructs are also available, such as one to force which core a virtual processor is assigned to. We use this in our example program to control scheduling.
   2.114  
   2.115 @@ -172,9 +172,9 @@
   2.116  
   2.117  \subsection{The Visualizations}
   2.118  \label{subsec:visualization_def}
   2.119 - The first visualization is what we refer to as a  Unit \& Constraint Collection, or UCC. It depicts the constraints on the scheduling decisions that come from the application. Constraints can be dependencies stated explicitly in the code, or can be implied by language constructs. These limit the choices the runtime is allowed to make. 
   2.120 + The first visualization is what we refer to as a  Unit \& Constraint Collection, or UCC. It depicts the constraints on the scheduling decisions that come from the application. Constraints can be dependencies stated explicitly in the code, or can be implied by language constructs. These limit the choices the runtime is allowed to make. We use the more general term ``constraint'' instead of the more common ``dependency'' because dependency implies an order, while we also want to include non-ordering constraints such as e.g. a mutual exclusion.
   2.121  
   2.122 -The second visualization we call  the scheduling consequence graph (SCG), or just consequence graph (CG). It depicts the scheduling operations performed by the runtime, and the consequent usage of the cores. 
   2.123 +The second visualization we call the scheduling consequence graph (SCG), or just consequence graph. It depicts the scheduling operations performed by the runtime, and the consequent usage of the cores. 
   2.124  
   2.125  The UCC shows only application-derived information, as opposed to the consequence graph, which combines the \textit{use} of the UCC-depicted constraints along with runtime-imposed dependencies and hardware-imposed constraints.  Hence, the UCC states the degrees of freedom enabled by the application, while the consequence graph states how those were made use of, by a particular runtime on particular hardware.
   2.126  
   2.127 @@ -214,7 +214,7 @@
   2.128  
   2.129  In general, lines may also be drawn that represent other kinds of interactions, which affect core usage. For example,  our runtime implementation only allows one core at a time to access shared  scheduling state. Visualization of this can be turned on, as additional lines linking the gray runtime regions of blocks (visualization of such interactions is turned off in this paper). 
   2.130  
   2.131 -Two work-unit blocks   that appear in sequence and have no lines drawn to them often have a causal dependency between them, due to the semantics of the base language. Visualization of these dependencies is also turned off, but can be inferred via the link to the code.
   2.132 +Two work-unit blocks that appear in sequence and have no lines drawn between them often have a causal dependency, due to the semantics of the base language. Visualization of these dependencies is also turned off, but can be inferred via the link to the code.
   2.133  
   2.134  Note that  many different orderings can be validly chosen. Which scheduler choices are valid is determined by three kinds of constraints: the application code constraints, hardware constraints, and runtime implementation imposed constraints.
   2.135  
   2.136 @@ -272,10 +272,10 @@
   2.137  
   2.138  The scheduling consequence graphs that display the measurements collected on various runs during tuning are displayed in Fig. \ref{story}. All use the same scale, for direct comparison. They have 40 columns, one for each core, and relative height indicates relative execution time. The lines in red, orange, and green represent application-code constructs. Red is creation of a virtual processor, green is the many-to-one \texttt{send\_of\_type\_to}, and orange is the singleton construct. For better visibility, only constraints that cross cores are shown.
   2.139  
   2.140 -After functional debugging, the first tuning run produces the consequence graph seen in Fig. \ref{fig:story:a}. The first thing to notice is that it is slimmer than expected: of the 40 available cores, only 13 are being used. Because the application places work on cores explicitly, this must be a bug in the dividing code. A cursory inspection reveals that a closing curly brace in the distribution loop had been misplaced. This may be a very simple bug, but it went unnoticed despite using this application as a test program for development of the language runtime (while using traditional thread-based performance tools).
   2.141 +After functional debugging, the first tuning run produces the consequence graph seen in Fig. \ref{fig:story:a}. The first thing to notice is that it is slimmer than expected: of the 40 available cores, only 13 were being used. Because the application places work on cores explicitly, this must be a bug in the dividing code. A cursory inspection revealed that a closing curly brace in the distribution loop had been misplaced. This may be a very simple bug, but it went unnoticed despite using this application as a test program for development of the language runtime (while using traditional thread-based performance tools).
   2.142  
   2.143  \subsubsection{Second Run}
   2.144 - After fixing this, the next run (Fig. \ref{story:b}) corresponds much more to the expected execution behaviour. However, there remains a noticeable section at the beginning where only 3 cores have work and the other 37 remain idle.
   2.145 +After fixing this, the next run (Fig. \ref{story:b}) corresponded much more to the expected execution behaviour. However, there remained a noticeable section at the beginning where only 3 cores had work and the other 37 remained idle.
   2.146  
   2.147  Zooming in on those cores, we see that creation code starts running on core 0, within the creation VP, and then the next block on the core is work. Creation stops, starving the other cores. Looking at the creation code, we see that the creation VP assigns the first work VP to its own core, so that work is now waiting in the queue to execute there. When it creates the second work VP, that creation call switches core 0 to the runtime. When done with creation, the runtime takes the next VP from the queue, which is that waiting work VP. Hence core 0 does the work next instead of continuing with creation (the merits of work stealing or other scheduling strategies are independent from this illustration of how to use this approach to performance tune).
   2.148  
   2.149 @@ -283,38 +283,34 @@
   2.150  
   2.151  Two solutions come to mind: assign work to the other cores first, so that they would be busy when the creator VP's core switches to work, or else dedicate a core to the creator VP. The first solution has the advantage of preserving performance of the application even when run on a machine with a single-digit number of cores, so we tried it first. 
   2.152  
   2.153 -\subsubsection{Third run} Assigning work to the other cores first gives us Fig. \ref{story:c}. The section that was at the top, with idle cores, has disappeared. However, an idle period can still be observed between the first and the second set of work units. The work units have roughly the same length and  the work on core 0 starts last, so finishing the core 0 work holds up creation until after the other cores have finished their work (note that work on some cores takes slightly longer because that core performs the copy-transpose singleton, and also variations are caused by cache misses).
   2.154 +\subsubsection{Third run} Assigning work to the other cores first gave us Fig. \ref{story:c}. The section that was at the top, with idle cores, has disappeared. However, an idle period can still be observed between the first and the second set of work units. The work units have roughly the same length and the work on core 0 starts last, so finishing the core 0 work holds up creation until after the other cores have finished their work (note that work on some cores takes slightly longer because that core performs the copy-transpose singleton, and also variations are caused by cache misses).
   2.155  
   2.156 -  It is also noticeable that in the second set of work units, not enough work pieces remain to fill all cores. 16 out of 40 remain idle at the bottom.
   2.157 +It is also noticeable that in the second set of work units, not enough work pieces remained to fill all cores. 16 out of 40 remained idle at the bottom.
   2.158  
   2.159 -\subsubsection{Fourth and fifth runs} To try to fill the empty columns at the end, we modified the size of the work units. However, as figures \ref{story:d} and \ref{story:e} show, this did not help. The blank areas between ``bands'' of work can be seen by the red lines to be due to  creation. The increased number of units causes creation to be the bottleneck again, and the time lost between sets grows larger than the time that previously was lost.
   2.160 +\subsubsection{Fourth and fifth runs} To try to fill the empty columns at the end, we modified the size of the work units using the divide factor that defines work unit size. However, as figures \ref{story:d} and \ref{story:e} show, this did not help. The blank areas between ``bands'' of work can be seen by the red lines to be due to  creation. The increased number of units causes creation to be the bottleneck again, and the time lost between sets grows larger than the time that previously was lost.
   2.161  
   2.162  \subsubsection{Sixth run} At this point we wanted to try the road not chosen, dedicating a core to the creation VP.
   2.163  Going back to version b of the code and implementing this solution, instead, leads to Fig. \ref{story:f}. The blank area between the two sets has disappeared, showing a 4\% shorter execution time.
   2.164  
   2.165 -\subsubsection{Seventh and eighth runs}Core 0 is now empty after the creation phase at the beginning, so we  moved the receive VP there (Fig. \ref{story:g}). This only minimally improved the time, at this size of work unit, but allows overlapping the result collection with other work, which is an advantage when cutting  the work into more pieces, requiring longer collection as in Fig. \ref{story:h}.
   2.166 -
   2.167 -Overall it is also noticeable that as work units become smaller, execution time becomes more irregular. Variability in work length correlates with the color, indicating cache behavior has worsened with smaller work size.
   2.168 -
   2.169 -%Note that the hypothesis that cache behavior worsened with smaller work sizes, was generated directly from visual cues.    
   2.170 +\subsubsection{Seventh and eighth runs}Core 0 was now idle after the creation phase at the beginning, so we moved the receive VP there (Fig. \ref{story:g}). This only minimally improved the time, at this size of work unit, but allows overlapping the result collection with other work, which is an advantage when cutting the work into more pieces, requiring longer collection as in Fig. \ref{story:h}. Overall it is also noticeable that as work units become smaller, execution time becomes more irregular.   
   2.171  
   2.172  \subsubsection{Holes in the core usage}
   2.173  
   2.174 -In Fig \ref{story:d}, ``holes'' are noticeable. Inspecting these holes closer, we can see that the stalled blocks are at the end of orange lines. This indicates they are waiting upon the completion of a singleton. The pattern of blocks shows that usually  the singleton unit runs before the work unit, but in these cases the singleton code was delayed until after the work on that core.  This is a runtime implementation fluke. The only thing an application programmer can do is change the work size to minimize the impact. (For those curious, the first VP to reach the singleton is granted control, but a ready work VP lands in the queue during the granting activity, so when the runtime finishes granting, the work VP is next, and the VP that now owns the singleton sits and waits for the work to end. All work VPs on other cores that pass through the same singleton  also wait.)  
   2.175 +In Fig \ref{story:d}, ``holes'' are noticeable. Inspecting these holes closer, we can see that the stalled blocks are at the end of orange lines. This indicates they were waiting upon the completion of a singleton. The pattern of blocks shows that usually  the singleton unit runs before the work unit, but in these cases the singleton code was delayed until after the work on that core. This is a runtime implementation fluke. The only thing an application programmer can do is change the work size to minimize the impact. (For those curious, the first VP to reach the singleton is granted control, but a ready work VP lands in the queue during the granting activity, so when the runtime finishes granting, the work VP is next, and the VP that now owns the singleton sits and waits for the work to end. All work VPs on other cores that pass through the same singleton also wait.)
   2.176  
   2.177  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   2.178  \section{The Model Behind the Visualization}
   2.179  \label{sec:theory}
   2.180  Now that the usage has been seen, we expand on the model behind the visualizations. The model ties the information together, and understanding it helps in generating  hypotheses from the visualization features. 
   2.181  
   2.182 -As seen, the model has two parts, a \emph{Unit \&\ Constraint Collection (UCC)}, and a \emph{Scheduling Consequence Graph} (SCG or just consequence graph,  CG).  The UCC indicates the scheduling  choices  the application allows, and so shows what the programmer has control over. The consequence graph says which of those were actually taken during the run and the consequences of that set of choices.
   2.183 +As seen, the model has two parts, a \emph{Unit \&\ Constraint Collection (UCC)}, and a \emph{Scheduling Consequence Graph} (SCG or just consequence graph).  The UCC indicates the scheduling  choices  the application allows, and so shows what the programmer has control over. The consequence graph says which of those were actually taken during the run and the consequences of that set of choices.
   2.184  
   2.185  We give a more precise description of UCC, then consequence graph, in turn.
   2.186  However, space is too limited for a complete definition, which is given in a companion paper submitted to a longer format venue.
   2.187  \subsection{Unit \& Constraint Collection}
   2.188  The UCC contains all the units of work that get scheduled during a run, and all constraints the application places on scheduling those units. This is the simple definition, but unfortunately, this information is not always easy to obtain. The complication is that different classes of application exist, with two degrees of freedom that determine how much of the UCC is actually defined in the application vs the input data, or even in the runtime.
   2.189  
   2.190 -Some applications have everything determined in the code, with all units fixed, and all constraints fixed. An example is matrix multiply with fixed size matrices.  But for others, the shape of the UCC is only partially defined by the application code.  Take matrix multiply used in Section \ref{sec:casestudy}, where an input parameter determines the number of units created. Here, the UCC is different for each parameter value. An extreme example is an NP complete problem, with redividable units, for which the units are a function of both the  input data \emph{and} decisions made by the runtime.
   2.191 +Some applications have everything determined in the code, with all units fixed, and all constraints fixed. An example is matrix multiply with fixed size matrices.  But for others, the shape of the UCC is only partially defined by the application code.  Take matrix multiply used in Section \ref{sec:casestudy}, where an input parameter determines the number of units created. Here, the UCC is different for each parameter value. An extreme example is a parallel search through a decision tree with pruning and redistribution of search space, for which the units are a function of both the input data \emph{and} decisions made by the runtime.
   2.192  
   2.193  \begin{figure}[ht]
   2.194    \centering
   2.195 @@ -343,7 +339,7 @@
   2.196  
   2.197  For example, specialization may perform a static scheduling, which fixes the units, moving the UCC towards the origin. Alternatively, the toolchain may inject manipulator code for the runtime to use, which lets it divide units during the run when it needs more. The injection of manipulator code makes the UCC less concrete, moving it further from the origin.
   2.198  
   2.199 -The progression of UCCs generated during the life-line indicate what entity has the control over which units and constraints appear in the final concrete UCC. Viewing the progression gives insight into what is inside the application programmer's control vs under control of each tool in the toolchain or  the runtime. For example, the original application-derived UCC  indicates what can be done statically: the further out on the diagonal that UCC is, the less scheduling can be done statically in the toolchain.
   2.200 +The progression of UCCs generated during the life-line indicates what entity has the control over which units and constraints appear in the final concrete UCC. Viewing the progression gives insight into what is inside the application programmer's control vs under control of each tool in the toolchain or  the runtime. For example, the original application-derived UCC  indicates what can be done statically: the further out on the diagonal that UCC is, the less scheduling can be done statically in the toolchain.
   2.201  
   2.202  In this paper, we do not suggest how to represent UCCs far out on the diagonal. One of those actually indicates a multi-verse of concrete-UCCs. Which of them materializes  depends on the data and what the scheduler decides. We only represent the concrete UCC that materializes during a run and leave the question of representing less concrete ones to future work.
   2.203  
   2.204 @@ -359,7 +355,7 @@
   2.205  
   2.206  The core activities of the region types are: application work, idling while waiting for communication of work data, managing constraints, choosing assignment of work onto cores, and runtime internals. The runtime internals in turn have sub-categories, which we will not detail here, but some of which will be mentioned in Section \ref{sec:Implementation}.
   2.207  
   2.208 -There is an arc-type for each category of causality. The arc-types (causality categories) represent: sequential dependency in the base language being satisfied, parallel constraint being satisfied (i.e., one unit did something to satisfy a constraint on the other, causing it to be free to be scheduled), runtime internal causality such as a global lock (runtime on one core releases the lock, causing the other to acquire it), and arcs that represent hardware causal relationships (one work-unit finishing on a core causes another work-unit to start there, given the choice by the runtime). The formal details are given in the longer format companion paper.
   2.209 +There is an arc-type for each category of causality. The arc-types (causality categories) represent: sequential dependency in the base language being satisfied, parallel constraint being satisfied (i.e., one unit did something to satisfy a constraint on the other, causing it to be free to be scheduled), runtime internal causality such as a global lock (runtime on one core releases the lock, causing the other to acquire it), and arcs that represent hardware causal relationships (one work-unit finishing on a core causes another work-unit to start there, given the choice by the runtime).
   2.210  
   2.211  We will now look at each source of causal relationship.
   2.212  
   2.213 @@ -367,7 +363,7 @@
   2.214  
   2.215  Sequential dependencies may add superfluous constraints that  eliminate some otherwise allowed choices in the UCC. An example would be a \texttt{for} loop that creates work-units -- no parallelism constructs cause the creations to be done in sequence, but the base C language sequentializes it nonetheless. 
   2.216  
   2.217 -\paragraph{Runtime internal causality} Runtime implementation details may introduce ``extra" causalities between units. For example, the version of VMS we instrumented for this paper runs separately on each core and relies upon a global lock for accessing shared runtime information. This lock introduces a causal relationship  when the runtime on one core is attempting to process one unit, but must wait for the runtime on a different core to finish with its unit.
   2.218 +\paragraph{Runtime internal causality} Runtime implementation details may introduce ``extra" causalities between units. For example, the runtime we instrumented for this paper runs separately on each core and relies upon a global lock for accessing shared runtime information. This lock introduces a causal relationship  when the runtime on one core is attempting to process one unit, but must wait for the runtime on a different core to finish with its unit.
   2.219  
   2.220   Normally, these are not displayed explicitly, due to clutter, but can be turned on when needed, for instance to determine the cause of a particular pattern of core usage.
   2.221  
   2.222 @@ -381,9 +377,9 @@
   2.223  There is one last twist to the story of UCCs and consequence graphs, which is that there are levels of them that correspond to the levels of scheduler in a hierarchical machine. We use an example involving a server machine with a hierarchy of runtimes to illustrate both, concentrating first on just the UCCs, then adding the consequence graph. 
   2.224  
   2.225  \subsubsection{Levels of UCC}
   2.226 -For the example, consider  a server with one rack,  having a back-plane that boards plug into. A board has its own memory with four sockets, each having a chip with four cores. So there is a back-plane network connecting the boards,  a bus on each board that connects the sockets to the DRAM, and inside the chip in each socket is a cache hierarchy that connects the cores.
   2.227 +As an example, consider a server with one rack, having a back-plane that boards plug into. A board has its own memory with four sockets, each having a chip with four cores. So there is a back-plane network connecting the boards, a bus on each board that connects the sockets to the DRAM, and inside the chip in each socket is a cache hierarchy that connects the cores.
   2.228  
   2.229 -The hardware is given a set of runtimes to match the hierarchy. Each network or bus has a runtime that schedules work onto the things connected below it. So the top runtime divides work among the boards, while a board's  runtime  divides work among the sockets, and a socket's  runtime  divides work among the cores.  
   2.230 +The hardware is given a set of runtimes to match the hierarchy. Each network or bus has a runtime that schedules work onto the things connected below it. So the top runtime divides work among the boards, while a board's runtime divides work among the sockets, and a socket's  runtime divides work among the cores.  
   2.231  
   2.232  To a runtime high up, each runtime below it looks like a complete machine. It schedules work-units to those machines, without knowing the internal details of how that machine is implemented. So the runtime at the top handles very large work-units that it schedules onto the runtimes on the boards. A board-level runtime  divides up the work-unit it gets into smaller work-units, then schedules one onto each socket's runtime, and so on.
   2.233  
   2.234 @@ -397,29 +393,27 @@
   2.235    \label{fig:UCC_Levels}
   2.236  \end{figure}
   2.237  
   2.238 -Fig. \ref{fig:UCC_Levels} shows that these UCCs are related to each other in the same hierarchy as the runtimes. A unit scheduled by one runtime is in turn broken into sub-units that exist at a lower level. Each of those units is  then separately scheduled by the lower-level runtime, according to a UCC just for them. So, as the figure shows,  a unit in one UCC has an entire lower-level UCC inside it.  
   2.239 -
   2.240 - %Great, that makes sense, now what about the consequence graphs?
   2.241 +Fig. \ref{fig:UCC_Levels} shows that these UCCs are related to each other in the same hierarchy as the runtimes. A unit scheduled by one runtime is in turn broken into sub-units that exist at a lower level. Each of those units is  then separately scheduled by the lower-level runtime, according to a UCC just for them. So, as the figure shows,  a unit in one UCC has an entire lower-level UCC inside it.
   2.242  
   2.243  \subsubsection{Levels of Consequence Graph}
   2.244  With multiple levels of scheduling activities, a scheduling consequence graph can be constructed for each level. 
   2.245  
   2.246  For a given level, a lower-level runtime is seen as a single processor entity. So, what is the consequence of deciding to schedule a given work unit onto one of those  processor entities? When the goal is  improving performance, the consequence of interest is the time taken until completion of the work unit.
   2.247  
   2.248 -Hence, the question becomes,  when can the upper-level work unit be considered completed? The answer is, when all the lower-level units of which it is comprised are completed, plus any additional work spent dividing up the work and combining results. Once the lower-level consequence graph is established, this  time can be easily determined: it is the critical path across the lower-level consequence graph plus the additional, non-overlapped, time spent creating the lower-level units and combining results into the format of the higher-level unit. The divide and recombine are the creation of    the lower level, and so not part of it.     
   2.249 +Hence, the question becomes, when can the upper-level work unit be considered completed? When all the lower-level units of which it is comprised are completed, plus any additional work spent dividing up the work and combining results. Once the lower-level consequence graph is established, this  time can be easily determined: it is the critical path across the lower-level consequence graph plus the additional, non-overlapped, time spent creating the lower-level units and combining results into the format of the higher-level unit. The divide and recombine are the creation of the lower level, and so not part of it.     
   2.250  
   2.251  Consider the concrete example of the SSR matrix multiply application from Section \ref{sec:casestudy}. To start we go up a level, above the process created to execute the matrix multiply. Here, a unit is the execution of an entire application, invoked via OS commands. The  work-time of that unit is the execution time of the application, from the first instruction of the application code until the last. This time equals  the critical path through the matrix multiply work, plus creation of the various VPs, and collection of the results by the results VP. 
   2.252  
   2.253  If the case study had been performed in hierarchical fashion, then the SCG seen there would be divided between levels, with creation and results accumulation  moved to the higher application-level SCG. The lower-level would only contain the work of the multiply. 
   2.254  
   2.255 -One twist is that the divide and recombine are themselves composed of  smaller work-units.  However, they are not visible to units in the work portion, although there are causalities between them. The divide units end with creation of work-level units, while the end of work-level units causes creation of recombine units.
   2.256 -Hence, divide and recombine are  inside of  the higher-level unit while not part of the lower-level SCG. They can be conveniently displayed in a single visualization because of the clean connection between levels provided by the create and end causalities.   
   2.257 +Note also that the divide and recombine are themselves composed of smaller work-units.  However, they are not visible to units in the work portion, although there are causalities between them. The divide units end with creation of work-level units, while the end of work-level units causes creation of recombine units.
   2.258 +Hence, divide and recombine are inside of the higher-level unit while not part of the lower-level SCG. They can be conveniently displayed in a single visualization because of the clean connection between levels provided by the create and end causalities.   
   2.259  
   2.260 -To complete the discussion, we consider going down another level, into the physical processor.  In modern super-scalar multi-cores, each core is a parallel processor composed of functional blocks such as arithmetic or floating-point blocks, along with logic that issues instructions to those blocks. We first break down a single  SSR work unit of the kind seen in Section \ref{sec:casestudy}, into smaller work units, namely individual instructions. Then we break instructions down into pipeline stages.
   2.261 +To complete the discussion, we consider going down another level, into the physical processor.  In modern super-scalar multi-cores, each core is a parallel processor composed of functional blocks such as arithmetic or floating-point blocks, along with logic that issues instructions to those blocks. We first break down a single SSR work unit of the kind seen in Section \ref{sec:casestudy} into smaller work units, namely individual instructions. Then we break instructions down into pipeline stages.
   2.262  
   2.263 -The shape of the UCC for the instructions  inside a single SSR level work-unit is a straight line of sequential dependencies. However, the pipeline stages inside a core modify the UCC. They analyse the register use dependencies between instructions and generate a new UCC consisting of  functions constrained by the data dependencies. The dependencies are enforced by the issue logic, which  goes on to assign ready functions to the functional blocks.  The choices made by the issue logic  determine the shape of the SCG for this level, and the critical path of that SCG is the work time of the higher, SSR level, work-unit.
   2.264 +The shape of the UCC for the instructions inside a single SSR level work-unit is a straight line of sequential dependencies. However, the pipeline stages inside a core modify the UCC. They analyse the register use dependencies between instructions and generate a new UCC consisting of  functions constrained by the data dependencies. The dependencies are enforced by the issue logic, which  goes on to assign ready functions to the functional blocks.  The choices made by the issue logic  determine the shape of the SCG for this level, and the critical path of that SCG is the work time of the higher, SSR level, work-unit.
   2.265  
   2.266 -The pipestages spent performing fetch, decode, analysis,  issue, and retirement correspond to functions in a runtime.  Fetch corresponds to creation of the instruction-level work-unit. Decode and analysis logic generate the new UCC, as part of the runtime,  Issue logic  performs  the scheduling function of a runtime. Retirement is also part of enforcing the dependencies, and so is  also part of the runtime's scheduler.  Most of the overhead of these pipestages is overlapped by the pipeline effect. but  does still contribute to the critical path during pipeline disruptions like mis-predicted branches. It is counted as non-overlapped runtime overhead.   
   2.267 +The stages spent performing fetch, decode, analysis, issue, and retirement correspond to functions in a runtime. Fetch corresponds to creation of the instruction-level work-unit. Decode and analysis logic generate the new UCC, as part of the runtime. Issue logic performs the scheduling function of a runtime. Retirement is also part of enforcing the dependencies, and so is also part of the runtime's scheduler. Most of the overhead of these stages is overlapped by the pipeline effect, but can contribute to the critical path during pipeline disruptions like mis-predicted branches. It is counted as non-overlapped runtime overhead.   
   2.268  
   2.269  %To give insight into going from the SSR work-unit level up to the application-level, let us now go from the instruction level up to the SSR unit level, `looking up' so to speak. Considering overhead, from inside the core,  the SSR level overhead looks just like more instructions. 
   2.270  
   2.271 @@ -460,7 +454,7 @@
   2.272  
   2.273  Every unit has a meta-unit that represents it in the runtime. A  unit is defined as the work that exists after leaving the runtime, up until re-entering it. For example, the trace of instructions on a core, from the point of leaving the runtime up until the next invocation. Looking at this in more detail, every runtime has some form of internal bookkeeping state for a unit, used to track constraints on it and make decisions about when and where to execute. This exists even if that state is just a pointer to a function that sits in a queue. We call this bookkeeping state for a unit the meta-unit.
   2.274  
   2.275 -Each  unit also has a life-line, which progresses so:  creation of the meta-unit \pointer , state updates that affect constraints on the unit \pointer,   the decision is made to animate the unit  \pointer, movement of the meta-unit plus data to physical resources that do the animation \pointer  , animation of the unit, which does the work \pointer,  communication of state-update, that unit has completed, and hardware is free \pointer ,  constraint updates within runtime, possibly causing new meta-unit creations or freeing other meta-units to be chosen for animation.  This repeats for each unit. Each step is part of the model.
   2.276 +Each  unit also has a life-line, which progresses so:  creation of the meta-unit~\pointer~state updates that affect constraints on the unit~\pointer~the decision is made to animate the unit~ \pointer~movement of the meta-unit plus data to physical resources that do the animation~\pointer~animation of the unit, which does the work~\pointer~communication of state-update, that unit has completed, and hardware is free~\pointer~constraint updates within runtime, possibly causing new meta-unit creations or freeing other meta-units to be chosen for animation.  This repeats for each unit. Each step is part of the model.
   2.277  
   2.278   Note a few implications: first, many activities internal to the runtime are part of a unit's life-line, and take place when only the meta-unit exists, before or after the work of the actual unit; second, communication that is internal to the runtime is part of the unit life-line, such as state updates; third, creation may be implied, such as in pthreads, or triggered such as in dataflow, or be by explicit command such as in StarSs. Once created, a meta-unit may languish before the unit it represents is free to be animated.
   2.279  
   2.280 @@ -502,6 +496,7 @@
   2.281  
   2.282  What information needs to be collected for SCG and UCC and how it is done depends on the construct:
   2.283  \begin{itemize}
   2.284 +\setlength{\itemsep}{0em}
   2.285  \item create\_VP:  We place code into the create\_VP\ handler, which records  the calling VP\ + unit, along with the newly created unit, and the VP it is assigned to. Both the SCG and UCC draw arcs between creating unit and created.
   2.286  \item send\_from\_to and receive\_from\_to: \textbf{} Code is placed into both handlers at the point that checks if both the rendez-vous requests are present. When true, it records the two units + VPs that connected. The UCC  and SCG both represent this by two crossing dependencies.
   2.287  \item Send\_to\_of\_type and receive\_to\_of\_type:   The same code is inserted to record  the  unit+VPs that connected. This is enough for the SCG.  But for the UCC, we want to capture all sending and receiving permutations available, so we add code that collects the group of senders and the corresponding group of receivers.
   2.288 @@ -529,6 +524,7 @@
   2.289  To cover each of those segments of a unit's life-line, code to read  the performance counters is inserted at:
   2.290  
   2.291  \begin{itemize}
   2.292 +\setlength{\itemsep}{0em}
   2.293  \item Construct handler: To measure 2 and 8, reading is done before and after VMS calls the language-supplied construct handler function.\item
   2.294  Assigner: To measure 3 and 4, reading is done before and after VMS calls  the language-supplied assigner function.
   2.295  \item Work: To measure 5 and 6, reading is done by reading inside the VMS switch-to-unit operation, and the switch-to-runtime operation.
   2.296 @@ -565,7 +561,7 @@
   2.297  \section{Conclusion}
   2.298  \label{sec:conclusion}
   2.299  
   2.300 -We have shown how to apply a new, and general, model of parallel computation to build  performance visualizations that simplify identifying instances of performance loss and linking them to details of application code responsible. They rely only on information collected through instrumenting the language runtime, with no modification to the application.
   2.301 +We have shown how to apply a new, general model of parallel computation to build  performance visualizations that simplify identifying instances of performance loss and linking them to details of the application code responsible. They rely only on information collected through instrumenting the language runtime, with no modification to the application.
   2.302  
   2.303  By integrating visual display of constraints due to application structure, language runtime implementation, and hardware features, all relevant causes for performance loss are covered. The semantic information collected allows filtering   for the relevant types of constraints, to avoid overcharging the display.
   2.304  
   2.305 @@ -691,6 +687,7 @@
   2.306  
   2.307  Additional constraints can be placed on the tasks depending on the construct:
   2.308  \begin{itemize}
   2.309 +\setlength{\itemsep}{0em}
   2.310  \item Create VP: The creation of a new VP creates a simple dependency: the first task in the new VP may only execute after the creating task has finished.
   2.311  \item Simple send and receive: Send to and receive from a specific VP is rendezvous based, so that the units following the communication in both VPs can only execute after the units preceding the rendezvous point in both VPs have finished. This can easily be represented by two crossing dependencies. These are deterministic, so the record is the same for the UCC or the SCG.
   2.312  \item Typed send and receive: Typed send/receive is also rendezvous based, but contrary to simple send/receive, the pairing of sender and receiver is not deterministic. For the SCG, which represents a specific run, the actual communications observed can be recorded in the same way as simple send/receive, but for the UCC, we want to capture all sending and receiving permutations available. In this case, since the construct specifies no further constraints beyond the type of the message, we simply record for each type a group of senders and a group of receivers.
   2.313 @@ -705,6 +702,7 @@
   2.314  
   2.315  We use hardware performance counters to measure for each unit the work (=time spent in application code) and the overhead (=time spent in runtime) in cycles and in instructions, as well as the number of last level cache misses during the work. The counter values are retrieved in the runtime before and after several sections in the lifetime of a unit: 
   2.316  \begin{itemize}
   2.317 +\setlength{\itemsep}{0em}
   2.318  \item Assign: Deciding which unit to execute next
   2.319  \item Work: Doing the work of the unit
   2.320  \item Invocation: Switching back from the work to the runtime