changeset 98:6b354d9aefb5

PPerf Tune -- candidate for final 2nd version
author Sean Halle <seanhalle@yahoo.com>
date Mon, 31 Dec 2012 16:07:54 -0800
parents 3f338effbfd9
children 9b9a2497757d
files 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 0__Papers/Holistic_Model/Perf_Tune/latex/bib_for_papers_12_Jy_15.bib
diffstat 2 files changed, 204 insertions(+), 122 deletions(-) [+]
line diff
     1.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Mon Dec 10 05:51:06 2012 -0800
     1.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Mon Dec 31 16:07:54 2012 -0800
     1.3 @@ -49,7 +49,7 @@
     1.4  
     1.5  
     1.6  %MOIRAI: MOdel for Integrated Runtime Analysis through Instrumentation
     1.7 -\title{An Approach to Augment Performance Tuning via Visualization of Scheduling Decisions}
     1.8 +\title{An Approach to Augment Performance Tuning\\ via Visualization of Causality Chains}
     1.9  
    1.10  %\authorinfo{Nina Engelhardt}
    1.11  %           {TU Berlin}
    1.12 @@ -71,8 +71,8 @@
    1.13  
    1.14  
    1.15  \begin{abstract}
    1.16 -Performance tuning is an important aspect of parallel programming that involves understanding both sequential behavior inside work-units and scheduling behavior that maps those units onto hardware. Many good tools exist for identifying hot-spots in code, where the code inside scheduled units needs to be adjusted.  These tools help to tune data layout, and to find  constructs that spend a long time blocked. However, they provide less help in understanding  why  constructs block for so long, and why no other work fills in.   The answer      often involves complex chain-reactions of scheduling decisions, and finding these  in turn requires seeing the patterns of constraints the application has placed on scheduling and how the runtime uses them.  We propose supplementing existing tools with an additional tool to identify these chain reactions and tie the behaviors back to specific spots in the code. To make such a tool, we apply a newly discovered model of the structure of parallel computation, as a guide to gathering both the constraints on scheduling, and the scheduling decisions made by the runtime. The visualizations of these are superficially similar to existing tools, but include details to tie idleness to chains of causal interactions within the runtime and hardware, and then to the responsible parallelism constructs within the code.
    1.17 -We show how to instrument a parallel language or programming model,   with no need to modify the application.  To simplify illustration, we instrumented the runtime of our own  pi-calculus inspired programming model, called SSR, and we walk through a tuning session on a large multi-core machine, which demonstrates the improvements in  generating hypotheses for the causes of idle cores, and how to reduce the idleness. 
    1.18 +Performance tuning is an important aspect of parallel programming that involves understanding both sequential behavior inside work-units and scheduling related behavior arising from how those units are assigned to hardware resources. Many good tools exist for identifying hot-spots in code, and tuning the sequential aspects there, as well as the data layout. They also identify constructs that spend a  long time blocked. However, they provide less help in understanding  why  constructs block for so long, and why no other work fills in.   The answer      often arises from chains of  causal interactions that involve the scheduling choices made in combination with runtime implementation details and  hardware behaviors. Identifying such a chain   requires visualizing each source of causality.  We propose supplementing existing tools with an additional tool to identify these chains and tie the behaviors back to specific spots in the code. To make such a tool, we apply a newly discovered model of the structure of parallel computation, as a guide to gathering each step in the scheduling and execution process. The visualizations of these are superficially similar to existing tools, but include additional  features  to identify unexpected behaviors within the chain and tie them to  related parallelism constructs within the code.
    1.19 +We show how to instrument a parallel language or programming model,   with no need to modify the application.  To simplify illustration, we instrumented the runtime of our own  pi-calculus inspired programming model, called SSR, and we walk through a tuning session on a large multi-core machine, which demonstrates the improvements in  generating hypotheses for both the causes of idle cores, and how to reduce the idleness. 
    1.20  \end{abstract}
    1.21  
    1.22  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    1.23 @@ -84,29 +84,38 @@
    1.24  
    1.25  
    1.26  
    1.27 -However, once such optimization has made progress, the main area available for performance improvement undergoes a shift towards the scheduling  choices made by the runtime. Existing  tools tend to provide less help here, in understanding the complicated interactions among runtime behavior, scheduling decisions, and consequent hardware behavior. The interactions often give rise to chain reactions that leave cores idle. These involve the constraints stated by the application, the way the work is divided, and the choices made during the run for which unit of work to assign to which hardware resource at which point in time. To tune such behaviors, the chains need to be identified and understood.   
    1.28 +However, once such optimization has made progress, the main area available for performance improvement undergoes a shift towards the scheduling  choices made by the runtime. Existing  tools tend to provide less help here, in understanding the complicated interactions among runtime behavior, scheduling decisions, and consequent hardware behavior. The interactions often give rise to unexpected chain reactions that leave cores idle. These involve the constraints stated by the application, the way the work is divided, and the choices made during the run for which unit of work to assign to which hardware resource at which point in time. To tune such behaviors, the chains need to be identified and understood.   
    1.29  
    1.30  %The level of complexity is reflected by the fact that finding the optimal set of scheduling choices is known to be an  NP hard problem [].   
    1.31  
    1.32 -Although current tools do a good job of identifying which constructs block, and which cores sit idle, they don't help much in deciphering such complex chain reactions. This needs a visualization of the possibilities open to the scheduler and which choices it made.
    1.33 +For example, a unit of work completes, which sends a signal to the runtime to update the state of the unit and the state of the hardware resource it occupied. This in turn causes the runtime to choose a different unit to own that hardware and sends the meta-information for that unit to the hardware. This in turn triggers communication, to move  the data consumed by the unit  to the hardware. Then the work of the new unit takes place there.
    1.34 +
    1.35 +Any one of these interactions could be individually abnormal, and an unexpected source of performance loss. For example, there may be congestion in the network that causes the data movement to be especially slow for that particular new unit. 
    1.36 +
    1.37 +Although current tools do a good job of identifying which constructs block, and which cores sit idle, they don't help much in deciphering such complex chains of causal interactions. This needs a visualization of each step in the causal chain.
    1.38 +
    1.39 +
    1.40  
    1.41      
    1.42  
    1.43  
    1.44 -To help with this particular aspect of tuning, we propose an additional tool that visualizes the constraints, the runtime choices, and interactions among them. Our approach  is based on  a newly discovered model of the structure of parallel computation. It can capture the parallel aspects of a program written in a parallel langauge through instrumentation of its runtime. It augments existing tools, adding illumination of the chains  of scheduling decisions that lead to lost performance and tying the causes back to particular code features. 
    1.45 +To help with this particular aspect of tuning, we propose an additional tool. It collects the causal interactions inside the runtime and hardware, and then visualizes them in a useful way. In addition, it helps to generate ideas for ways to reduce idleness by showing the scheduling-related structure of the application.
    1.46  
    1.47 -The theoretical model adds value by establishing a mental framework that simplifies hypothesis generation, by indicating particular quantities to measure that connect the scheduling measurements to each other and to specific  application features.
    1.48 +  
    1.49  
    1.50 -The information used during tuning is collected by the runtime as an application executes, which makes it practical and backwards compatible.  An existing runtime is instrumented, at key points that correspond to features of the computation model. The instrumentation detects the application-defined units of work and constraints on scheduling them, as well as recording the choices made by the runtime's scheduler.    
    1.51 +Our approach  is based on  a newly discovered model of the structure of parallel computation, which identifies each causal step. 
    1.52 +The theoretical model adds value   by indicating particular quantities to measure at specific points in the runtime system and by establishing a mental framework that simplifies hypothesis generation.
    1.53  
    1.54 -The information is used to visually link each unit to the units upon which its execution is conditional. This includes causes due to application constraints, runtime constraints, and hardware constraints. The visual links establish the chain-reactions of scheduling decisions that lead to lost processing opportunity. Semantics attached to the visual features  enable quickly generating the correct hypotheses for the causes of  lost computation opportunities, and quickly narrowing down what can be done, in the application code, to improve performance. In effect, the visualization serves as a map showing idle cores linked back to the section of application code responsible.
    1.55 +The information used during tuning is collected by the runtime as the application executes, which makes it practical and backwards compatible. No modification of applications takes place.
    1.56 +
    1.57 +The information is used to visually link each unit to the units upon which its execution is conditional, via a chain of causal steps.   Semantics attached to the visual features  enable quickly generating the correct hypotheses for the causes of  lost computation opportunities, and quickly narrowing down what can be done, in the application code, to improve performance. In effect, the visualization serves as a map showing idle cores linked back to the sections of application code related.
    1.58  
    1.59  
    1.60  To simplify seeing the value of the approach,  we walk through a session of tuning scheduling choices, in which we point out the semantics of features in the visualization, what each visual feature implies, and how these implications lead to hypotheses and fixes.   Although the views look similar to current tools, they differ in the semantics of the visual features.
    1.61  
    1.62    
    1.63  
    1.64 -The  contribution of this paper is the way the parallel model has been applied, rather than any aspects of the model itself.  The distinction is important because the space is used to convey the value gained by the  process of applying the model. Only enough of the model is stated to understand  where the value comes from and how to  instrument a runtime to gain it.
    1.65 +We intend the  contribution of this paper to be the way the parallel model has been applied, rather than any aspects of the model itself.  The distinction is important because the space is used to convey the value gained by the  process of applying the model. Only enough of the model is stated, in \S \ref{sec:theory} and \S \ref{sec:Implementation}, to understand  where the value comes from and how to  instrument a runtime to gain it.
    1.66  
    1.67  This paper is organized as follows:  we introduce the semantics of features of our visualization  by walking through a tuning session in Section \ref{sec:casestudy}, and then expand on the theoretical model behind it in Section \ref{sec:theory}. Section \ref{sec:Implementation} gives  details of instrumenting a runtime to collect data according to the model. We relate the approach to other tuning tools in Section \ref{sec:related} and draw conclusions in Section \ref{sec:conclusion}.
    1.68  
    1.69 @@ -128,9 +137,9 @@
    1.70  
    1.71  We run our example on a machine with 4 sockets by 10 cores each, for a total of 40 physical cores. They are Intel WestmereEx cores running at 3.0GHz, with TurboBoost turned off for reproducability. 
    1.72  
    1.73 -We wish to tune scheduling on a standard program that the reader  knows well. We chose matrix multiply, with which the reader should be familiar, which allows concentration on the tool without distraction about the application. 
    1.74 +We wish to tune scheduling on a standard program that the reader  knows well, so we chose matrix multiply, with which the reader should be familiar.  This allows concentration on the tool without distraction about the application. 
    1.75  
    1.76 -The application is structured as follows: it  creates one entity to divide the work into a number of pieces and  creates another entity for each piece of work. How many pieces is set by the combination of a tuning parameter in the code with the number of cores. The work is distributed across the cores  in a round-robin fashion unless otherwise specified.
    1.77 +The application is structured as follows: it  creates one entity to divide the work into a number of pieces and  creates another entity for each piece of work. How many pieces is determined by the combination of a tuning parameter in the code with the number of cores. The work is distributed across the cores  in a round-robin fashion, unless otherwise specified.
    1.78  
    1.79  The application also creates an entity that manages the partial-results. Each piece of work sends its contribution, which is accumulated into the overall  result. The entity that divides waits for the entity that accumulates to signal completion then the language runtime shuts down.
    1.80  
    1.81 @@ -149,7 +158,7 @@
    1.82  
    1.83  \subsection{The Visualizations}
    1.84  \label{subsec:visualization_def}
    1.85 - The approach has two kinds of visualization, each corresponds to an aspect of the model. One focuses on just the application, conveying   its scheduling related structure. Its main value is understanding what's possible. The other focuses on how this structure interacts with the runtime and hardware. Its value is displaying the chain reactions and linking each step in the chain back to application code. 
    1.86 + The approach has two kinds of visualization, each corresponds to an aspect of the model. One focuses on just the application, conveying   its scheduling related structure. Its main value is understanding what's possible. The other focuses on how this structure interacts with the runtime and hardware. Its value is displaying the causal chains  and linking each step in a chain back to application code. 
    1.87  
    1.88   We refer to the first visualization  as a  Unit \& Constraint Collection, or UCC, which is explained in \S \ref{sec:UCCExpl}. Its elements are the application's units of work  and the constraints on scheduling them. Constraints can be explicitly stated in the code, or  implied by language constructs. They limit the choices available to the runtime. We use the  general term ``constraint'' instead of the specific term ``dependency'' because dependency only covers  one pattern: this unit before that one. Meanwhile  constraints are general,  such as e.g. a mutual exclusion on scheduling a group of units: any order, but only one at a time.
    1.89  
    1.90 @@ -168,14 +177,14 @@
    1.91    \label{fig:UCC_expl}
    1.92  \end{figure}
    1.93  
    1.94 -Fig. \ref{fig:UCC_expl} shows two alternate versions of a UCC, both stylized for purposes of explanation.  One treats units as all the same size, the other adds a notion of amount of work in a unit. They start  at the top and  progress downwards. Each  unit is placed above the units it constrains, and is connected to them by a line representing the kind of constraint. A unit is placed into the row just below the last of  the units constraining it, in the first available column. However, the rows and columns are just visual conveniences; the only information in the UCC is the units and the constraints among them. In practice, each unit is also labeled with a unique unit ID, from which the code of the unit and the parameters passed to it can be looked up. 
    1.95 +Fig. \ref{fig:UCC_expl} shows two alternate versions of a UCC, both stylized for purposes of explanation.  One treats units as all the same size, the other adds a notion of amount of work in a unit, which helps to see the usable parallelism available. They start  at the top and  progress downwards. Each  unit is placed above the units it constrains, and is connected to them by a line representing the kind of constraint. A unit is placed into the row just below the last of  the units constraining it, in the first available column. However, the rows and columns are just visual conveniences; the only information in the UCC is the units and the constraints among them. In practice, each unit is also labeled with a unique unit ID, from which the code of the unit and the parameters passed to it can be looked up. 
    1.96  
    1.97  What distinguishes the two types of UCC is the size of the units. The left one shows only the logical graph-like structure of the application, while the right weights the height of a unit by the number of instructions executed in it. When information about unit size is available, it can visually indicate the \textit{useable} parallelism available.
    1.98  
    1.99  \subsubsection{SCG visualization}  \label{sec:SCGExpl}
   1.100 -The Scheduling Consequence Graph is the main visualization used to detect chains of causal interactions between elements of the system.  For example, a unit may end, which sends a signal to the runtime to update the state of the unit and the state of the hardware resource it occupied. This in turn causes the runtime to choose a different unit to own that hardware and sends the meta-information for that unit to the hardware. This in turn triggers communication of the data consumed by the unit to be sent to the hardware. Then the work of the new unit takes place there.
   1.101 +The Scheduling Consequence Graph is the main visualization used to detect chains of causal interactions between elements of the system.  As the  example from the introduction indicated, a unit may end, which sends a signal to the runtime to update the state of the unit and the state of the hardware resources it occupied. This  causes the runtime to choose a different unit to own that hardware and sends the meta-information for that unit to the hardware. This in turn triggers communication, to send the data consumed by the unit to the hardware. Then the work of the new unit takes place there.
   1.102  
   1.103 -Any one of these interactions could be individually abnormal, and an unexpected source of performance loss. For example, there may be congestion in the network that causes the data movement to be abnormally slow for that particular new unit. The SCG allows quickly seeing where idleness takes place, then turning on visualization of each kind of causal interaction, until the culprit for the unexpected slowdown is seen.  
   1.104 +Any one of these interactions could be individually abnormal, and an unexpected source of performance loss. The SCG allows selectively turning on visualization of each kind of causal interaction, until the culprit for the unexpected slowdown is seen.  
   1.105  \begin{figure}[ht]
   1.106    \centering
   1.107    \includegraphics[width = 2in, height = 1.8in]{../figures/SCG_stylized_for_expl.pdf}
   1.108 @@ -190,17 +199,17 @@
   1.109  
   1.110  The kinds of activities within a block are defined by the computation model that underlies the visualization. The first kind of activity is the actual work, plus  waiting for  cache misses. It is represented by a blue-to-red region where the color indicates  intensity of cache misses, with pure red representing at or above the maximum misses per instruction, and pure blue the minimum (the max and min are set in the tool that generates the visualization).
   1.111  
   1.112 -  The second kind of activity is runtime overhead, represented by a gray region. This is the overhead spent on that particular work-unit. When desired by the user, it is further broken into pieces representing activities inside the runtime. The options include time spent on: constraints, when determining readiness of the work-unit; deciding which ready unit to assign to which hardware; and time spent switching from virtual processor, to the runtime, and back. In this paper, we show all runtime overhead lumped together, however in other circumstances a breakdown can be key to seeing where unexpected slowdown is taking place. 
   1.113 +  The second kind of activity is runtime overhead, represented by a gray region. This is the overhead spent on that particular work-unit. When desired by the user, it is further broken into pieces representing activities inside the runtime. The options include time spent on: constraints, when determining readiness of a work-unit; deciding which ready unit to assign to which hardware; and time spent switching from virtual processor, to the runtime, and back. In this paper, we show all runtime overhead lumped together, however in other circumstances a breakdown can be key to seeing where unexpected slowdown is taking place. 
   1.114  
   1.115  The other type of visual feature seen in Fig. \ref{fig:SCG_expl} is lines. Each represents a construct that influenced scheduling, where the color indicates what kind of construct.   A line represents two things: a constraint, whose satisfaction made the lower unit ready, and a decision by the runtime to start the lower unit on that core. 
   1.116  
   1.117 -In general, lines may also be drawn that represent other kinds of interactions, which affect core usage. For example,  our runtime implementation only allows one core at a time to access shared  scheduling state. Visualization of this can be turned on, as additional lines linking the gray runtime regions of blocks (visualization of such interactions is turned off in this paper). 
   1.118 +In general, lines may also be drawn that represent other kinds of interactions, which affect core usage. For example,  our runtime implementation only allows one core at a time to access shared  scheduling state. Visualization of this can be turned on, as additional lines linking the gray runtime regions of blocks (visualization of such interactions is turned off in this paper for simplicity). 
   1.119  
   1.120 -Two work-unit blocks that appear in sequence and have no lines drawn between them often have a causal dependency, due to the semantics of the base language. Visualization of these dependencies is also turned off, but can be inferred via the link to the code.
   1.121 +Two work-unit blocks that appear in sequence and have no lines drawn between them often have a causal dependency, due to the semantics of the base language (visualization of these causalities is also turned off, but can be inferred via the link to the code).
   1.122  
   1.123 -Note that  many different orderings can be validly chosen. Which scheduler choices are valid is determined by three kinds of constraints: the application code constraints, hardware constraints, and runtime implementation imposed constraints.
   1.124 +Note that  many different orderings can be validly chosen by the runtime. The scheduler choices that are valid is determined by three kinds of constraints: the application code constraints, hardware constraints, and runtime implementation imposed constraints.
   1.125  
   1.126 -The visual features allow the user to see at a glance the total  execution time (height), idle cores during the run (empty columns), cache behavior (color of work regions), degree of overhead (size of gray regions), and which units constrained which other units (lines). All consequence graphs in this paper are at the same scale, so they can be compared directly.
   1.127 +The visual features allow the user to see at a glance the total  execution time (height), idle cores during the run (empty space), cache behavior (color of work regions), degree of overhead (size of gray regions), and which units constrained which other units (lines). All consequence graphs in this paper are at the same scale, so they can be compared directly.
   1.128  \begin{figure*}[t!]
   1.129    \begin{minipage}[b]{0.25\textwidth}
   1.130          \hfill\subfloat[35.8 Gcycles\\Original]
   1.131 @@ -211,10 +220,10 @@
   1.132    % \subfloat[]
   1.133    %  {\includegraphics[scale=0.015]{../figures/194.pdf} 
   1.134    %  }\quad
   1.135 -   \subfloat[13.0 Gcycles\\ after fixing the load balancer]
   1.136 +   \subfloat[13.0 Gcycles after fixing the load balancer]
   1.137      {\includegraphics[scale=0.015]{../figures/2.pdf} 
   1.138      \label{story:b}}\quad
   1.139 -   \subfloat[11.0 Gcycles\\ after changing so as to put work on core 1 first (solution 1)]
   1.140 +   \subfloat[11.0 Gcycles after changing so as to put work on core 1 first (solution 1)]
   1.141      {\includegraphics[scale=0.015]{../figures/5.pdf} 
   1.142      \label{story:c}}\quad
   1.143    % \subfloat[S1+divide factor 0.2]
   1.144 @@ -223,19 +232,19 @@
   1.145    % \subfloat[S1+divide factor 0.3]
   1.146    %  {\includegraphics[scale=0.015]{../figures/209.pdf} 
   1.147    %  }\quad
   1.148 -   \subfloat[10.9 Gcycles\\ after also changing the divide factor from 0.6 to  0.5]
   1.149 +   \subfloat[10.9 Gcycles after also changing the divide factor from 0.6 to  0.5]
   1.150      {\includegraphics[scale=0.015]{../figures/6.pdf} 
   1.151      \label{story:d}}\quad
   1.152 -   \subfloat[15.6 Gcycles\\ after further changing the divide factor to 0.4]
   1.153 +   \subfloat[15.6 Gcycles after further changing the divide factor to 0.4]
   1.154      {\includegraphics[scale=0.015]{../figures/7.pdf} 
   1.155      \label{story:e}}\quad\\
   1.156 -   \subfloat[10.4 Gcycles\\ after going back to put divider VP onto its own core (Solution~2)]
   1.157 +   \subfloat[10.4 Gcycles after going back to put divider VP onto its own core (Solution~2)]
   1.158      {\includegraphics[scale=0.015]{../figures/12.pdf} 
   1.159      \label{story:f}}\quad
   1.160 -   \subfloat[10.3 Gcycles\\ after then moving the receive VP to the same core as the divider VP]
   1.161 +   \subfloat[10.3 Gcycles after then moving the receive VP to the same core as the divider VP]
   1.162      {\includegraphics[scale=0.015]{../figures/10.pdf} 
   1.163      \label{story:g}}\quad
   1.164 -   \subfloat[9.7 Gcycles\\ after additionally changing the divide factor to 0.4]
   1.165 +   \subfloat[9.7 Gcycles after additionally changing the divide factor to 0.4]
   1.166      {\includegraphics[scale=0.015]{../figures/15.pdf} 
   1.167      \label{story:h}}\quad
   1.168    % \subfloat[S2+divide factor 0.3]
   1.169 @@ -245,54 +254,62 @@
   1.170    %  {\includegraphics[scale=0.015]{../figures/194.pdf} 
   1.171    %  }\quad
   1.172    \end{minipage}
   1.173 -\caption{Performance tuning with Scheduling Consequence Graphs (all figures drawn to the same scale)  }
   1.174 +\caption{Performance tuning with Scheduling Consequence Graphs (all figures drawn to the same scale). Gcycles means billions of cycles.  }
   1.175  \label{story}
   1.176  \end{figure*}
   1.177  
   1.178  \subsection{Walk-through}
   1.179  \label{subsec:walk-through}
   1.180  
   1.181 -The scheduling consequence graphs that display the measurements collected on various runs during tuning are displayed in Fig. \ref{story}. All use the same scale, for direct comparison. They have 40 columns, one for each core, and relative height indicates relative execution time. The lines in red, orange, and green represent application-code constructs. Red is creation of a virtual processor, green is the many-to-one \texttt{send\_of\_type\_to}, and orange is the singleton construct. For better visibility, only constraints that cross cores are shown.
   1.182 +We wish to show the visualizations in a simple  way, to enhance understanding. Hence, this walk through uses a slightly contrived example in which the application explicitly controls where each unit of work is assigned. As a result, the causal interactions of interest are all constraints stated by language constructs. We don't show causalities internal to the runtime system or hardware, although the tool is capable of turning on display of those. We chose to do so to simplify this introduction to the use of the visualizations.  
   1.183  
   1.184 -After functional debugging, the first tuning run produces the consequence graph seen in Fig. \ref{fig:story:a}. The first thing to notice is that it is slimmer than expected: of the 40 available cores, only 13 were being used. Because the application places work on cores explicitly, this must be a bug in the dividing code. A cursory inspection revealed that a closing curly brace in the distribution loop had been misplaced. This may be a very simple bug, but it went unnoticed despite using this application as a test program for development of the language runtime (while using traditional thread-based performance tools).
   1.185 +Fig. \ref{story} displays all of the scheduling consequence graphs generated during our tuning session. They all use the same scale, for direct comparison. All have 40 columns, one for each core, and relative height indicates relative execution time. The lines in red, orange, and green represent application-code constructs. Red is creation of a virtual processor, green is the many-to-one \texttt{send\_of\_type\_to}, and orange is the singleton construct. For better visibility, only constraints that cross cores are turned on.
   1.186 +
   1.187 +The first tuning run produced the consequence graph seen in Fig. \ref{fig:story:a}. The first thing to notice is that it is slimmer than expected: of the 40 available cores, only 13 were being used. 
   1.188 +
   1.189 +Checking the UCC, showed that the usable parallelism was limited. The application divides the work itself, so this was a bug in the dividing code. A cursory inspection revealed that a closing curly brace in the distribution loop had been misplaced. It is true that  a standard tuning tool would have also shown the idle cores, however,  the existence of the UCC  made the culprit clear at a mere glance, with high confidence.
   1.190  
   1.191  \subsubsection{Second Run}
   1.192  After fixing this, the next run (Fig. \ref{story:b}) corresponded much more to the expected execution behaviour. However, there remained a noticeable section at the beginning where only 3 cores had work and the other 37 remained idle.
   1.193  
   1.194 -Zooming in on those cores, we see that creation code starts running on core 0, within the creation VP, and then the next block on the core is work. Creation stops, starving the other cores. Looking at the creation code, we see that the creation VP assigns the first work VP to its own core, so that work is now waiting in the queue to execute there. When it creates the second work VP, that creation call switches core 0 to the runtime. When done with creation, the runtime takes the next VP from the queue, which is that waiting work VP. Hence core 0 does the work next instead of continuing with creation (the merits of work stealing or other scheduling strategies are independent from this illustration of how to use this approach to performance tune).
   1.195 +The first way we used the SCG was to zoom in on those cores, and check what code was running in each spot.  We saw that creation code starts running on core 0, and then the next block on the core is work, so it stops making work for the other cores. This points us to look at the creation code, and we see that the explicit assignment commands  in it put the first work created onto its own core!  Hence core 0 does the work next instead of continuing with creation.
   1.196  
   1.197 -The hypothesis was generated by looking at the code linked to each block and noting the visual pattern that creation code stopped running on core 0. Work code started running instead, and only after it finished did creation code start again. Hence, visual cues led directly to the hypothesis. 
   1.198 +This example may be just as easily solved with standard tuning tools, which also show the idle cores. Some  can additionally link a point in time on a core back to the function executing. However, later examples will show the value of the proposed approach. 
   1.199  
   1.200 -Two solutions come to mind: assign work to the other cores first, so that they would be busy when the creator VP's core switches to work, or else dedicate a core to the creator VP. The first solution has the advantage of preserving performance of the application even when run on a machine with a single-digit number of cores, so we tried it first. 
   1.201 +As for solutions, we exclude automatic load balancing such as work stealing, in order to contrive simple examples that demonstrate the use of the visualization features. Two  solutions come to mind: have the creator assign work to its own core last, or dedicate a core to the creator. We applied both the solutions, in order. 
   1.202 +%The first solution has the advantage of preserving performance of the application even when run on a machine with a single-digit number of cores, so we tried it first. 
   1.203  
   1.204 -\subsubsection{Third run} Assigning work to the other cores first gave us Fig. \ref{story:c}. The section that was at the top, with idle cores, has disappeared. However, an idle period can still be observed between the first and the second set of work units. The work units have roughly the same length and the work on core 0 starts last, so finishing the core 0 work holds up creation until after the other cores have finished their work (note that work on some cores takes slightly longer because that core performs the copy-transpose singleton, and also variations are caused by cache misses).
   1.205 +\subsubsection{Third run} We got Fig. \ref{story:c} when we changed the code to assign work to the other cores first. The SCG shows that the idle section that was at the top has disappeared. However, an idle period can still be observed between the first and the second set of work units. 
   1.206  
   1.207 -It is also noticeable that in the second set of work units, not enough work pieces remained to fill all cores. 16 out of 40 remained idle at the bottom.
   1.208 +We focused here on  the fact that for the second set of work units, not enough work pieces remained to fill all cores. 16 out of 40 remained idle at the bottom. This told us that not enough work pieces were being made to occupy all the cores, which pointed to the division code. (Again, other tools also reveal this conclusion)_
   1.209  
   1.210 -\subsubsection{Fourth and fifth runs} To try to fill the empty columns at the end, we modified the size of the work units using the divide factor that defines work unit size. However, as figures \ref{story:d} and \ref{story:e} show, this did not help. The blank areas between ``bands'' of work can be seen by the red lines to be due to  creation. The increased number of units causes creation to be the bottleneck again, and the time lost between sets grows larger than the time that previously was lost.
   1.211 +\subsubsection{Fourth and fifth runs} To try to fill the empty columns at the end, we modified the size of the work units using the divide factor that defines work unit size. However, as figures \ref{story:d} and \ref{story:e} show, this did not help, because of the blank areas between ``bands'' of work.  This is where the SCG's features began to differentiate their usefulness. The presence of the red lines told us definitively that   creation was the culprit. The increased number of units causes more creation work, and creation is performed in a sequential loop on one core. Further, the SCG reminds us that creation on core 0 begins after work on core 0 completes. The two factors combined cause the gaps.
   1.212  
   1.213 -\subsubsection{Sixth run} At this point we wanted to try the road not chosen, dedicating a core to the creation VP.
   1.214 +\subsubsection{Sixth and seventh runs} At this point we wanted to try the road not chosen, by dedicating\ a core to the creation VP.
   1.215  Going back to version b of the code and implementing this solution, instead, leads to Fig. \ref{story:f}. The blank area between the two sets has disappeared, showing a 4\% shorter execution time.
   1.216 +The SCG shows that Core 0 was now idle after the creation phase at the beginning, so we moved the receive VP there (Fig. \ref{story:g}). This gave a minimal improvement, from overlapping the result collection with other work, 
   1.217 +\subsubsection{Eighth run} The SCG shows there are still empty columns caused by too few work units.  However, core 0 no longer does work so we can now increase the number of work units without causing the gaps  between blocks that existed in Fig. \ref{story:e}. Also, we will avoid the large penalty when collecting results seen at the bottom, because collection has been assigned to the same core as creation and the core is free at the bottom.  Fig. \ref{story:h} shows the result of modifying the division factor to increase the number of work pieces, which gives the shortest execution time. One additional feature of the SCG has value here, which is the coloring that indicates cache behavior. It is noticeable that as work units become smaller, execution time becomes more irregular, and the variation in coloring tells us the reason is poorer cache behavior. This degrading places a lower bound on number of pieces.   
   1.218  
   1.219 -\subsubsection{Seventh and eighth runs}Core 0 was now idle after the creation phase at the beginning, so we moved the receive VP there (Fig. \ref{story:g}). This only minimally improved the time, at this size of work unit, but allows overlapping the result collection with other work, which is an advantage when cutting the work into more pieces, requiring longer collection as in Fig. \ref{story:h}. Overall it is also noticeable that as work units become smaller, execution time becomes more irregular.   
   1.220 +\subsubsection{Holes in the core usage}\label{subsec:holes}
   1.221  
   1.222 -\subsubsection{Holes in the core usage}
   1.223 +In Fig \ref{story:e}, the true value of the SCG starts to appear. In it, ``holes'' are noticeable. Inspecting these holes closer, we can see that the stalled blocks are at the ends of orange lines. This tells us definitively that they were waiting upon the completion of a singleton. Other tools may have indicated that singleton constructs have a mild time spent blocked, but they wouldn't show how a single blocked singleton had a  chain-reaction effect, holding up creation, then in turn causing many cores to sit idle. 
   1.224  
   1.225 -In Fig \ref{story:d}, ``holes'' are noticeable. Inspecting these holes closer, we can see that the stalled blocks are at the end of orange lines. This indicates they were waiting upon the completion of a singleton. The pattern of blocks shows that usually  the singleton unit runs before the work unit, but in these cases the singleton code was delayed until after the work on that core. This is a runtime implementation fluke. The only thing an application programmer can do is change the work size to minimize the impact. (For those curious, the first VP to reach the singleton is granted control, but a ready work VP lands in the queue during the granting activity, so when the runtime finishes granting, the work VP is next, and the VP that now owns the singleton sits and waits for the work to end. All work VPs on other cores that pass through the same singleton also wait.)
   1.226 + Zooming in on the singletons and tracing out the state inside the runtime shows that the empty space  is a runtime implementation issue. This  analysis required all the forms of information provided by the SCG.  such as  the knowledge of precisely which units came before and after the one blocked by the singleton, in combination with  the fact that it is a singleton construct blocking.  
   1.227  
   1.228  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   1.229  \section{The Model Behind the Visualization}
   1.230  \label{sec:theory}
   1.231 -Now that the usage has been seen, we expand on the model behind the visualizations. The model ties the information together, and understanding it helps in generating  hypotheses from the visualization features. 
   1.232 +The value of the visualizations comes from the causality linkages. Being able to trace causes from idle cores back through hardware, runtime implementation, and scheduling choices, provides the added insight that augments the other tools. This linkage tracing is in turn made possible by the model of the structure of computation.    In effect, the tool directly visualizes the model.
   1.233  
   1.234 -As seen, the model has two parts, a \emph{Unit \&\ Constraint Collection (UCC)}, and a \emph{Scheduling Consequence Graph} (SCG or just consequence graph).  The UCC indicates the scheduling  choices  the application allows, and so shows what the programmer has control over. The consequence graph says which of those were actually taken during the run and the consequences of that set of choices.
   1.235 +As seen, the model has two parts, a \emph{Unit \&\ Constraint Collection (UCC)}, and a \emph{Scheduling Consequence Graph} (SCG or just consequence graph).  The UCC depicts the scheduling  choices  the application allows, and so shows what the programmer has control over. The consequence graph says which of those were actually taken during the run and the causal linkage of consequences within the hardware, runtime, and succeeding choices.
   1.236  
   1.237 -We give a more precise description of UCC, then consequence graph, in turn.
   1.238 -However, this paper focuses on their application to performance tuning, so we abbreviate here and focus on a formal definition of the full model in a different paper.
   1.239 +In this section, we give a more precise description of UCC, then consequence graph.
   1.240 +However, this paper focuses on their application to performance tuning, so we abbreviate here and give a formal definition of the full model in a different paper.
   1.241  \subsection{Unit \& Constraint Collection}
   1.242 -The UCC contains all the units of work that get scheduled during a run, and all constraints the application places on scheduling those units. This is the simple definition, but unfortunately, this information is not always easy to obtain. The complication is that different classes of application exist, with two degrees of freedom that determine how much of the UCC is actually defined in the application vs the input data, or even in the runtime.
   1.243 +\label{sec:UCC}
   1.244 +A fully specified UCC contains the units of work that get scheduled during a run, and the constraints placed on scheduling those units.  In the simple case, all units and all constraints are fully specified in the application. However, many classes of application exist, and two degrees of freedom  determine how much of the UCC is actually defined in the application vs the input data, or even in the runtime.
   1.245  
   1.246 -Some applications have everything determined in the code, with all units fixed, and all constraints fixed. An example is matrix multiply with fixed size matrices.  But for others, the shape of the UCC is only partially defined by the application code.  Take matrix multiply used in Section \ref{sec:casestudy}, where an input parameter determines the number of units created. Here, the UCC is different for each parameter value. An extreme example is a parallel search through a decision tree with pruning and redistribution of search space, for which the units are a function of both the input data \emph{and} decisions made by the runtime.
   1.247 +An example of a simple application that has all units and all constraints fixed in the source  is matrix multiply with fixed size matrices.  For other applications, the shape of the UCC is only partially defined by the application code.  Take the matrix multiply used in Section \ref{sec:casestudy}, where an input parameter determined the number of units created. There, the UCC was different for each parameter value. At the extreme, would be parallel search  with load-driven pruning. Pruning the search space means that  the units themselves are a function  of both the input data \emph{and} the previous pruning decisions made by the runtime.
   1.248  
   1.249  \begin{figure}[ht]
   1.250    \centering
   1.251 @@ -301,7 +318,14 @@
   1.252    \label{fig:UCC_example}
   1.253  \end{figure}
   1.254  
   1.255 -    We call a fully specified UCC a \emph{concrete} UCC.  Every run of an application eventually winds up defining a concrete UCC, such as produced for the performance tuning, as seen  in Fig. \ref{fig:UCC_example}. In it, every unit that was scheduled in the SCG appears,  along with  the application-defined constraints on scheduling them.  For this application, parameters determined the work  division, and so determined the units. Hence, the application alone does not specify the concrete UCC, because the units remain unknown until parameter values are given. In general, the amount of UCC made concrete by the application alone falls into a two-dimensional grid. One dimension covers the units, the other the constraints.
   1.256 +    We call a fully specified UCC a \emph{concrete} UCC.  Every run of an application eventually winds up defining a concrete UCC. The example in   Fig. \ref{fig:UCC_example} was produced for the performance tuning. Every unit that was scheduled in the SCG appears in it,  along with  the application-defined constraints on scheduling them.  The division parameter  determined the units. Hence, the application alone does not specify the concrete UCC, because the units remain unknown until parameter values are given. 
   1.257 +
   1.258 +In general, the amount of UCC made concrete by the application  falls into a two-dimensional grid. One dimension covers the units, the other the constraints, as shown in 
   1.259 +Fig. \ref{fig:UCC_Concreteness}.  The two axes each have  four kinds of  information that has to be added in  order to determine the units and constraints,  in the final concrete UCC. 
   1.260 +
   1.261 +The UCC may change at multiple points in an application's lifecycle. The position  a UCC lands on the grid indicates how far it is from being fully concrete.  The horizontal   position indicates what inputs are still needed to determine the units, and vertical the constraints.  0 indicates that the units (constraints) are fully determined by the application code alone; 1 means parameter values also must be known; 2 means input data values also play a role, and 3 means runtime decisions play a role in determining the units (constraints).
   1.262 +
   1.263 +The concept of UCC, and  its concreteness can provide value in classifying applications and algorithms and predicting what types of scheduling approach and hardware they will perform best on. The concepts also help in understanding what optimizations do to code and what changes may improve performance on given hardware plus runtime. 
   1.264  
   1.265  \begin{figure}[ht]
   1.266    \centering
   1.267 @@ -310,64 +334,64 @@
   1.268    \label{fig:UCC_Concreteness}
   1.269  \end{figure}
   1.270  
   1.271 -Fig. \ref{fig:UCC_Concreteness} shows the two axes and the four sets of additional information that have to be added to the UCC in order to determine the units and constraints in the final fully concrete UCC. The position an application-derived UCC, or later in the life-line UCC, lands on the grid indicates how far it is from being fully concrete.  The horizontal   position indicates what inputs are still needed to determine the units, and vertical the constraints.  0 indicates that the units (constraints) are fully determined by the application code alone; 1 means parameter values also must be known; 2 means input data values also play a role, and 3 means the units (constraints) can only become known  after runtime scheduling decisions have been made. 
   1.272 +In the concreteness grid, the closer an application-derived UCC is to the origin, the less additional information is needed to obtain a  concrete UCC descendant of it. For example, the UCC labeled A in the figure is fully concrete just from the source code alone. It represents, for example, matrix multiply with fixed size matrices and fixed division. The UCC labeled B requires the input data plus parameters to be specified before its units are concrete, but just parameters to make its constraints fully concrete. Ray-tracing with bounce depth specified as a parameter may be like this. The UCC labeled C only has variability in its constraints, which require input data. An example would be H.264 motion vectors.
   1.273 +But even the least concrete UCC, D, at the end of the diagonal, generates a concrete descendant UCC while a run of the application unfolds.
   1.274 + 
   1.275 +Bear in mind that  even a fully concrete UCC still has degrees of freedom when deciding which units to run on which hardware and in what order of execution. Those decisions determine interactions within the hardware, to yield the communication patterns and consequent performance  during the run,  visualized by the SCG. 
   1.276  
   1.277 -The closer an application-derived UCC is to the origin, the less additional information is needed to obtain a  concrete UCC descendant of it. The UCC labeled A in the figure is fully concrete just from the source code alone (representing for example, matrix multiply with fixed size matrices and fixed division). The UCC labeled B requires the input data and parameters to be specified before its units are concrete, but just parameters to make its constraints fully concrete (as per ray-tracing, with bounce depth specified as a parameter). The UCC labeled C only has variability in its constraints, which require input data (for example, H.264 motion vectors).
   1.278 -But even the least concrete UCC, out at the end of the diagonal (D in the figure), generates a concrete descendant during a run of the application.
   1.279 - 
   1.280 -Notice, though, that even a fully concrete UCC still has degrees of freedom, in deciding which units to run on which hardware and in what order of execution. These decisions fix interactions within the hardware, to yield the communication patterns and consequent performance seen during the run. 
   1.281  
   1.282 -An added twist is that an application has a life-line, spanning from code all the way through the run, and its representation  may change at the different stages of life. It starts as pristine source, then moves  into specialization where code is translated into different representations than the original, and finally the specialized code is run. A different version of UCC may be generated at each of these points in  the life-line.
   1.283  
   1.284 -For example, specialization may perform a static scheduling, which fixes the units, moving the UCC towards the origin. Alternatively, the toolchain may inject manipulator code for the runtime to use, which lets it divide units during the run when it needs more. The injection of manipulator code makes the UCC less concrete, moving it further from the origin.
   1.285 +As noted, an application has a lifecycle, spanning from editing code all the way through  the run, and its representation  may change at the different stages of life, with corresponding changes to the UCC.
   1.286 + For example, specialization may perform a static scheduling, which fixes the units, moving the UCC towards the origin. Alternatively, the toolchain may inject manipulator code for the runtime to use, which lets it divide units during the run when it needs more units. The injection of manipulator code makes the UCC less concrete, moving it further from the origin.
   1.287  
   1.288 -The progression of UCCs generated during the life-line indicates what entity has the control over which units and constraints appear in the final concrete UCC. Viewing the progression gives insight into what is inside the application programmer's control vs under control of each tool in the toolchain or  the runtime. For example, the original application-derived UCC  indicates what can be done statically: the further out on the diagonal that UCC is, the less scheduling can be done statically in the toolchain.
   1.289 +The progression of UCCs has value in performance tuning  because it indicates what is inside the application programmer's control vs under control of each tool in the toolchain or  the runtime. For example, the original application-derived UCC  shows what can be done statically: the further out on the diagonal that UCC is, the less scheduling can be done statically in the toolchain.
   1.290  
   1.291  In this paper, we do not suggest how to represent UCCs far out on the diagonal. One of those actually indicates a multi-verse of concrete-UCCs. Which of them materializes  depends on the data and what the scheduler decides. We only represent the concrete UCC that materializes during a run and leave the question of representing less concrete ones to future work.
   1.292  
   1.293  \subsection{Scheduling Consequence Graph}
   1.294 -Whereas the UCC concentrates on application-derived information, the second part of the model adds-in effects of runtime  details and hardware. We called it the Scheduling Consequence Graph because it links scheduling decisions to their performance consequences. But it also indicates the role, in the decision, of application, runtime and hardware details.
   1.295 -As a result it identifies instances of lost performance, and links them to the cause of the loss, as seen in Section \ref{sec:casestudy}.
   1.296 +Whereas the UCC has value in understanding the structure of the application,  the second part of the model shows how that structure  was made use of during a run. We called it the Scheduling Consequence Graph because it links scheduling decisions to their performance consequences. It does so by indicating distinct activities in the fundamental logical structures, and linking them causally to each other. As a result it provides a link from each instance of lost performance, back through each element that contributed to causing the loss, as illustrated in Section \ref{subsec:holes}.
   1.297  
   1.298 -Contrary to the UCC, the consequence graph shows the behavior resulting from scheduling decisions actually \emph{made},   from among those \emph{possible.} The UCC shows just the possibilities. Hence, a consequence graph shows \emph{one} of the possible choice-sets allowed by the UCC.
   1.299 +Contrary to the UCC, the consequence graph shows the behavior resulting from scheduling decisions actually \emph{made},   from among those \emph{possible.} The UCC shows just the possibilities. Hence, a consequence graph shows \emph{one} of the possible choice-sets allowed by a concrete UCC.
   1.300  
   1.301 -A consequence graph accounts for each bit of core time. It has boxes and arcs, with the boxes divided into regions. The boxes  each represent all core time assigned to one work unit, with each region inside representing a segment of time that the core was engaged in a specific type of activity.  An arc links  regions (or boxes)  and represents a causality of some kind.
   1.302 +A consequence graph accounts for each bit of core time, arranged into boxes and arcs, with boxes further divided into regions. A box  represents all core time assigned to one work unit. Each region inside represents a segment of time that the core was engaged in a specific type of activity.  An arc links  regions (or boxes)  and represents a causality of some kind.
   1.303  
   1.304 -There is one kind of region for each reason that the core is being used (or being forced idle), and several kinds of arcs, one for each type of causality between regions.
   1.305 +There is one kind of region for each kind of activity that can source or sink a causal link, and a region may source or sink multiple causalities. Each type of causal link is represented by a corresponding type of arc.
   1.306  
   1.307 -The core activities of the region types are: application work, idling while waiting for communication of work data, managing constraints, choosing assignment of work onto cores, and runtime internals. The runtime internals in turn have sub-categories, which we will not detail here, but some of which will be mentioned in Section \ref{sec:Implementation}.
   1.308 +The core activities/\ region-types are: doing application work, waiting idly for communication of work data, managing constraints, deciding assignment of work onto cores, and doing internal runtime activities. The runtime region has sub-activities, but we do not detail them here due to space. However, some will be stated in Section \ref{sec:Implementation} when we talk about instrumenting a runtime.
   1.309  
   1.310 -There is an arc-type for each category of causality. The arc-types (causality categories) represent: sequential dependency in the base language being satisfied, parallel constraint being satisfied (i.e., one unit did something to satisfy a constraint on the other, causing it to be free to be scheduled), runtime internal causality such as a global lock (runtime on one core releases the lock, causing the other to acquire it), and arcs that represent hardware causal relationships (one work-unit finishing on a core causes another work-unit to start there, given the choice by the runtime).
   1.311 +Arcs are gathered into groups according to the nature of the causality they represent.  The kinds of causal links are: satisfaction of a sequential dependency in the base language; satisfaction of a parallel constraint  (i.e., one unit did something to satisfy a constraint on the other, causing it to be free to be scheduled); a causal link internal to the runtime (for example, the runtime on one core releasing a shared lock, causing the other core to acquire it); and causal links in the hardware (for example, one work-unit finishes on a core, causing another work-unit to start there, modulo a choice by the runtime).
   1.312  
   1.313 -We will now look at each source of causal relationship.
   1.314 +We will now expand on each of those kinds of causal link.
   1.315  
   1.316 -\paragraph{Constraint  causality} There is a constraint causality when  action by one unit causes (or contributes to) satisfaction of the constraint blocking the other unit. This includes sequential dependencies from the base language.
   1.317 +\paragraph{Constraint  causal link} Two entire boxes (units) are linked this way when   action by one unit contributes to satisfaction of a constraint blocking the other unit. This includes sequential dependencies from the base language (which are noted in the tool but normally not displayed).
   1.318  
   1.319  Sequential dependencies may add superfluous constraints that  eliminate some otherwise allowed choices in the UCC. An example would be a \texttt{for} loop that creates work-units -- no parallelism constructs cause the creations to be done in sequence, but the base C language sequentializes it nonetheless. 
   1.320  
   1.321 -\paragraph{Runtime internal causality} Runtime implementation details may introduce ``extra" causalities between units. For example, the runtime we instrumented for this paper runs separately on each core and relies upon a global lock for accessing shared runtime information. This lock introduces a causal relationship  when the runtime on one core is attempting to process one unit, but must wait for the runtime on a different core to finish with its unit.
   1.322 +\paragraph{Runtime internal causal link} Runtime implementation details may introduce ``extra" causalities between units. For example, the runtime we instrumented for this paper runs separately on each core and relies upon a global lock for accessing shared runtime information. This lock introduces a causal relationship  when the runtime on one core is attempting to process one unit, but must wait for the runtime on a different core to finish with its unit.
   1.323  
   1.324   Normally, these are not displayed explicitly, due to clutter, but can be turned on when needed, for instance to determine the cause of a particular pattern of core usage.
   1.325  
   1.326 -\paragraph{Hardware causality} The physical fact that a given resource can only be used by one work-unit at a time introduces hardware causalities. When multiple units are free to execute, but all cores are busy, then completion of a unit  on one core causes (in part) the next ready unit to run on that core. 
   1.327 +\paragraph{Hardware causal link} The physical fact that a given resource can only be used by one work-unit at a time introduces hardware causalities. When multiple units are free to execute, but all cores are busy, then completion of a unit  on one core causes (in part) the next ready unit to run on that core. 
   1.328  
   1.329  These are also not normally displayed, due to clutter, and not all hardware dependencies are directly measured. Future work will focus on using the performance counters and other instrumentation to add more information about communication paths taken as a consequence of the scheduling decisions made. It will start with the current linkage of application-code to runtime decisions, and add consequent usage of communication hardware. This gives an end-to-end linkage between runtime choices and caused behavior on the hardware. 
   1.330  
   1.331 -Consequence graph features each tie back to features in the UCC and thence to specific segments of code or constructs.
   1.332  
   1.333 -\subsection{Levels of UCC and Consequence Graph}
   1.334 -There is one last twist to the story of UCCs and consequence graphs, which is that there are levels of them that correspond to the levels of scheduler in a hierarchical machine. We use an example involving a server machine with a hierarchy of runtimes to illustrate both, concentrating first on just the UCCs, then adding the consequence graph. 
   1.335 +\subsection{Breadth of Applicability of UCC and SCG}
   1.336  
   1.337 -\subsubsection{Levels of UCC}
   1.338 -As an example, consider a server with one rack, having a back-plane that boards plug into. A board has its own memory with four sockets, each having a chip with four cores. So there is a back-plane network connecting the boards, a bus on each board that connects the sockets to the DRAM, and inside the chip in each socket is a cache hierarchy that connects the cores.
   1.339 +The question arises of how widely applicable the model is. Especially for heterogeneous machines, such as ones that have multiple levels of runtime, which implies multiple levels of scheduler. What do the UCC and SCG look like in this case?
   1.340  
   1.341 -The hardware is given a set of runtimes to match the hierarchy. Each network or bus has a runtime that schedules work onto the things connected below it. So the top runtime divides work among the boards, while a board's runtime divides work among the sockets, and a socket's  runtime divides work among the cores.  
   1.342  
   1.343 -To a runtime high up, each runtime below it looks like a complete machine. It schedules work-units to those machines, without knowing the internal details of how that machine is implemented. So the runtime at the top handles very large work-units that it schedules onto the runtimes on the boards. A board-level runtime  divides up the work-unit it gets into smaller work-units, then schedules one onto each socket's runtime, and so on.
   1.344  
   1.345 -The application in this example has been written in a language that allows work to be divided. The toolchain inserted a manipulator that allows each runtime to divide up the work it is given into smaller work-units, such as via the DKU pattern \cite{DKUTechRep_09}. This pushed the UCC of the application all the way to the right on the unit axis.  
   1.346  
   1.347 -So what does the concrete UCC produced during a run look like? Well, a unit is defined as the work resulting from one scheduling decision. Each runtime has its own scheduler, which means units are defined for each runtime.  That in turn means that each runtime has its own concrete UCC.
   1.348 +To answer this, let's explore a specific case of a cluster in one rack. The rack has a back-plane that boards plug into, while a board has its own memory and four sockets, each having a chip with four cores. Correspondingly it has separate physical networks:  a back-plane network connecting the boards; an inter-socket network on each board; and a  cache hierarchy connecting cores in a socket.
   1.349 +
   1.350 +A runtime is provided for each network, to distribute work among the networked components. The top runtime divides work among the boards, then a board's runtime divides work among the sockets, and a socket's  runtime divides work among the cores.  
   1.351 +
   1.352 +To a given runtime, each runtime below it is treated as an indivisible  machine. So the backplane runtime treats each board as a complete machine, creates work for each board, and schedules it. To send work to a board, it  communicates with the runtime that manages that board, but without knowing the internal details of that runtime.  It just sees the runtime as an opaque ``machine" that it gives work to. Hence the top runtime schedules onto board-level runtimes, which each schedule onto socket-level runtimes, and so on.This multi-level nature should be reflected in the UCC and SCG. 
   1.353 +
   1.354 +To facilitate this approach, the application  has been written in a language that provides a construct for work-unit division. For each construct instance, the toolchain inserts a corresponding manipulator, which enables each runtime to divide up the work it is given into smaller work-units. Examples include the DKU pattern \cite{DKUTechRep_09}, TBB's loop constructs \cite{LangTBBRef}, HPF's matrix division facilities \cite{LangHPF}, and so on. The effect of insertion of the manipulator on the \textit{application} UCC is to push it all the way to the right on the unit axis.  
   1.355 +
   1.356 +\subsubsection{Levels of UCC}So what does the \textit{concrete} UCC produced during a run on this machine look like? Well,  scheduling decisions bound a work-unit, and the decisions take place inside a runtime. Hence the  units are defined relative to a runtime. Each runtime, then has its own concrete UCC for the units it schedules.
   1.357  \begin{figure}[ht]
   1.358    \centering
   1.359    \includegraphics[width = 3.2in, height = 1.8in]{../figures/UCC_Levels_2.pdf}
   1.360 @@ -375,27 +399,32 @@
   1.361    \label{fig:UCC_Levels}
   1.362  \end{figure}
   1.363  
   1.364 -Fig. \ref{fig:UCC_Levels} shows that these UCCs are related to each other in the same hierarchy as the runtimes. A unit scheduled by one runtime is in turn broken into sub-units that exist at a lower level. Each of those units is  then separately scheduled by the lower-level runtime, according to a UCC just for them. So, as the figure shows,  a unit in one UCC has an entire lower-level UCC inside it.
   1.365 +As seen in Fig. \ref{fig:UCC_Levels}, the concrete UCCs are related to each other in the same hierarchy as the runtimes. A single unit scheduled by one runtime has an entire concrete UCC in the level below. The unit is broken into sub-units, each separately scheduled by the lower-level runtime, which defines a new concrete UCC. Its units are those sub-units, and the constraints are the constraints amongst those. As the figure shows,  each unit in one UCC  has an entire lower-level UCC inside it, in a hierarchy matching the runtimes.
   1.366  
   1.367  \subsubsection{Levels of Consequence Graph}
   1.368 -With multiple levels of scheduling activities, a scheduling consequence graph can be constructed for each level. 
   1.369 +What does this multi-level UCC mean for the SCG? Well, each level of scheduling activity has an associated scheduling consequence graph, hence each concrete UCC has a corresponding SCG. 
   1.370  
   1.371 -For a given level, a lower-level runtime is seen as a single processor entity. So, what is the consequence of deciding to schedule a given work unit onto one of those  processor entities? When the goal is  improving performance, the consequence of interest is the time taken until completion of the work unit.
   1.372 +But, in the  SCG we've seen, each column represents time on a core, so what does it mean in a higher-level SCG, for which a column represents a runtime instead of a core? This comes down to discovering the consequence of scheduling a given work unit onto  a particular runtime. When the goal is  improving performance, the consequence of interest is the time taken until completion of the work unit.
   1.373  
   1.374 -Hence, the question becomes, when can the upper-level work unit be considered completed? When all the lower-level units of which it is comprised are completed, plus any additional work spent dividing up the work and combining results. Once the lower-level consequence graph is established, this  time can be easily determined: it is the critical path across the lower-level consequence graph plus the additional, non-overlapped, time spent creating the lower-level units and combining results into the format of the higher-level unit. The divide and recombine are the creation of the lower level, and so not part of it.     
   1.375 +Hence, the question becomes, when can the upper-level work unit be considered completed? When all the lower-level units of which it is comprised are completed, plus any additional work spent dividing up the work before and combining results after. Once the lower-level consequence graph is established, this  time can be easily determined: it is the critical path across the lower-level consequence graph plus the additional, non-overlapped, time spent creating the lower-level units and combining results into the format of the higher-level unit. The divide and recombine are not part of the  lower level UCC/SCG, because that is creation of them.
   1.376  
   1.377 -Consider the concrete example of the SSR matrix multiply application from Section \ref{sec:casestudy}. To start we go up a level, above the process created to execute the matrix multiply. Here, a unit is the execution of an entire application, invoked via OS commands. The  work-time of that unit is the execution time of the application, from the first instruction of the application code until the last. This time equals  the critical path through the matrix multiply work, plus creation of the various VPs, and collection of the results by the results VP. 
   1.378 +\subsubsection{Applying the multi-level UCC and SCG}     
   1.379  
   1.380 -If the case study had been performed in hierarchical fashion, then the SCG seen there would be divided between levels, with creation and results accumulation  moved to the higher application-level SCG. The lower-level would only contain the work of the multiply. 
   1.381 +Let us apply this multi-level definition  to  the matrix multiply  from Section \ref{sec:casestudy}. We start at a level above the process that was created to execute the application. Here, a unit is the execution of an entire application, invoked via OS commands. The UCC at that level  treats each application execution as a unit. The SCG at that level treats the "runtime" region as the time spent by the OS creating a new process and running the loader and linker. The work time of a unit is the execution time of the application, from the first instruction until it exits.
   1.382  
   1.383 -Note also that the divide and recombine are themselves composed of smaller work-units.  However, they are not visible to units in the work portion, although there are causalities between them. The divide units end with creation of work-level units, while the end of work-level units causes creation of recombine units.
   1.384 -Hence, divide and recombine are inside of the higher-level unit while not part of the lower-level SCG. They can be conveniently displayed in a single visualization because of the clean connection between levels provided by the create and end causalities.   
   1.385 +The application then contains code that breaks up the work, creates virtual processors to perform the work, and code that collects the results.  Breaking up the work is the calculations that determine the dimensions of each sub-matrix, and pairs up the sub-matrices. Each pair makes one work-unit, and has a virtual processor created to perform the multiply. These work-units comprise the second-level UCC, which is the UCC shown in \S\ref{sec:UCC}. The corresponding second-level SCGs are the ones shown during the walk-through.   
   1.386  
   1.387 -To complete the discussion, we consider going down another level, into the physical processor.  In modern super-scalar multi-cores, each core is a parallel processor composed of functional blocks such as arithmetic or floating-point blocks, along with logic that issues instructions to those blocks. We first break down a single SSR work unit of the kind seen in Section \ref{sec:casestudy} into smaller work units, namely individual instructions. Then we break instructions down into pipeline stages.
   1.388  
   1.389 -The shape of the UCC for the instructions inside a single SSR level work-unit is a straight line of sequential dependencies. However, the pipeline stages inside a core modify the UCC. They analyse the register use dependencies between instructions and generate a new UCC consisting of  functions constrained by the data dependencies. The dependencies are enforced by the issue logic, which  goes on to assign ready functions to the functional blocks.  The choices made by the issue logic  determine the shape of the SCG for this level, and the critical path of that SCG is the work time of the higher, SSR level, work-unit.
   1.390  
   1.391 -The stages spent performing fetch, decode, analysis, issue, and retirement correspond to functions in a runtime. Fetch corresponds to creation of the instruction-level work-unit. Decode and analysis logic generate the new UCC, as part of the runtime. Issue logic performs the scheduling function of a runtime. Retirement is also part of enforcing the dependencies, and so is also part of the runtime's scheduler. Most of the overhead of these stages is overlapped by the pipeline effect, but can contribute to the critical path during pipeline disruptions like mis-predicted branches. It is counted as non-overlapped runtime overhead.   
   1.392 +%The divide and recombine, in this case, are themselves composed of smaller work-units.  However, they are not visible to units in the work portion, although there are causalities between them. The divide units end with creation of work-level units, while the end of work-level units causes creation of recombine units.
   1.393 +
   1.394 +%Hence, divide and recombine are inside of the higher-level unit while not part of the lower-level SCG. They can be conveniently displayed in a single visualization because of the clean connection between levels provided by the create and end causalities.   
   1.395 +
   1.396 +%To complete the discussion, we consider going down another level, into the physical processor.  In modern super-scalar multi-cores, each core is a parallel processor composed of functional blocks such as arithmetic or floating-point blocks, along with logic that issues instructions to those blocks. We first break down a single SSR work unit of the kind seen in Section \ref{sec:casestudy} into smaller work units, namely individual instructions. Then we break instructions down into pipeline stages.
   1.397 +
   1.398 +%The shape of the UCC for the instructions inside a single SSR level work-unit is a straight line of sequential dependencies. However, the pipeline stages inside a core modify the UCC. They analyse the register use dependencies between instructions and generate a new UCC consisting of  functions constrained by the data dependencies. The dependencies are enforced by the issue logic, which  goes on to assign ready functions to the functional blocks.  The choices made by the issue logic  determine the shape of the SCG for this level, and the critical path of that SCG is the work time of the higher, SSR level, work-unit.
   1.399 +
   1.400 +%The stages spent performing fetch, decode, analysis, issue, and retirement correspond to functions in a runtime. Fetch corresponds to creation of the instruction-level work-unit. Decode and analysis logic generate the new UCC, as part of the runtime. Issue logic performs the scheduling function of a runtime. Retirement is also part of enforcing the dependencies, and so is also part of the runtime's scheduler. Most of the overhead of these stages is overlapped by the pipeline effect, but can contribute to the critical path during pipeline disruptions like mis-predicted branches. It is counted as non-overlapped runtime overhead.   
   1.401  
   1.402  %To give insight into going from the SSR work-unit level up to the application-level, let us now go from the instruction level up to the SSR unit level, `looking up' so to speak. Considering overhead, from inside the core,  the SSR level overhead looks just like more instructions. 
   1.403  
   1.404 @@ -411,8 +440,8 @@
   1.405  
   1.406  %The last question is how to handle communication consequences, which result from decisions made in all levels? The decisions in higher-level runtimes set the context for decisions in lower-level ones, which links a higher-level choice  to the consequences resulting from the lower-level choices. But the value of a consequence graph comes from linking the size of boxes in it to the decisions made by the scheduler. It's not clear how to divide-up  the time  cores spend waiting for non-overlapped communication, to assign portions to different levels. We have no good answer at the moment and leave it for future work.
   1.407  
   1.408 -In practice, the question of levels does not come up very often for multi-core machines, as there is generally only one level of interest being investigated. However, the perspective is occasionally important here, e.g. when identifying the different components of the model in a concrete runtime so as to insert instrumentation code in the right places.
   1.409 -
   1.410 +In practice, the perspective of levels should become more relevant as machine complexity increases. For example, when considering  heterogeneous machines, large servers, and clusters.
   1.411 +Viewing the system in this way may lead to insights about runtime design, which can benefit from a hierarchical architecture \cite{HWSupportStanford10}. Incidentally, the multi-level view extends down into the compilation process, and below to the instruction level and the hardware itself.
   1.412  
   1.413  
   1.414  
   1.415 @@ -548,48 +577,67 @@
   1.416  
   1.417  
   1.418  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   1.419 -\section{Background and Related Work}
   1.420 +\section{Related Work}
   1.421  \label{sec:related}
   1.422  
   1.423 -When choosing a performance tuning tool, three primary factors to consider are:
   1.424 -\vspace{-3pt}
   1.425 -\begin{itemize} 
   1.426 -\setlength{\itemsep}{0em}
   1.427 -\item the ease of recognizing performance loss
   1.428 -\item the ease of forming a hypothesis to explain the loss
   1.429 -\item the ease of linking the hypothesis to changes to make in the application code
   1.430 -\end{itemize}
   1.431 -\vspace{-3pt}
   1.432 -The mental model that underlies the tool plays a key role in these three usefulness factors. This can be seen  by examining the  activity of performance tuning.  It is an iterative activity. To start, the programmer takes measurements during execution and compares them to the desired outcome. A hypothesis is generated for what caused any differences, by using a mental model, of causal relationships within  execution. The mental model is then also used to link the hypothesis to code, to suggest a change to make to improve performance. After modification, the code is run again, and these steps are repeated until the programmer is satisfied with the performance of the program. Thus, the mental model is central during performance tuning.
   1.433 +Many good tools exist for identifying hot-spots in code, and tuning the sequential aspects there, as well as the data layout.  Some also do a good job of identifying specific constructs in the code that spend a long time blocked, and identifying cores that sit idle. However, when it comes to determining why those cores sit   idle in those spots, it would be  desirable to have an additional tool with the following characteristics:
   1.434  
   1.435 -As a result of its role when using  a tuning tool, the mental model underlying the tool is   a major focus when comparing  tools.  The mental model of our approach is based on a new model of parallel computation and is the main way  our approach differs from other approaches. The new model makes it  easier to form a hypothesis of the cause of performance losses, and easier to link that back to changes to make in application code.
   1.436  
   1.437 - Historically, the dominant mental model for parallel computation has been the software-thread abstraction. The model is very close to the hardware, without much higher-level semantics. This makes it  a problem when analyzing parallel performance. Yet as the earliest and most widespread parallel programming model, and one that serves as the basis on top of which many other models are implemented, it must be supported.
   1.438  
   1.439 -A popular version is  pthreads, which introduces randomness through OS time-slicing, and whose synchronization constructs  imply only indirect connections. For instance, very little conclusion can be drawn about the relationship between the computations in two separate threads when they access the same lock consecutively. Consequently, tools such as Paradyn \cite{PerfToolParadyn} or VTune \cite{PerfToolVTune} that rely on this model  have a hard time connecting measurements to the application code details. They collect a wealth of statistics, but the application is to the tool a foreign process, where things ``just happen'' without causal links between events and code. Such tools generate conclusions such as ``application bottleneck is synchronization'', and  ``too much time is spent in spinlocks'', without help as to what in the code causes this. For example, if there is more than one use of spinlocks in the application, it is not clear what point in the code is the problem.
   1.440 -
   1.441 -One fix to these issues is to allow users to introduce measuring points into their own code, at the cost of increased programmer effort.  Because instrumentation code is written in the source language, it has access to application concepts. This advantage can be kept with automated instrumentation, by providing an instrumenting compiler, like the Tau \cite{PerfToolTau} project does.
   1.442 -
   1.443 -However, as long as the underlying execution model is still threads, there is no meaningful structure common to all applications to use to generate expressive measurement quantities. Function boundaries and the call graph are not sufficient when parallel execution is performed. The sequence and frequency of function calls tell little about parallel performance impacts because they have no bearing on synchronization events.  Unfortunately, pthreads does not capture even hints as to \emph{why} a given function call ends up blocking or not blocking, and what the effects on other threads are.
   1.444 -
   1.445 -Higher level parallel abstractions help alleviate that particular problem. For instance, in an application with MPI message passing \cite{MPI}, the information ``thread 2 spends little time waiting for messages from thread 0 but a lot of time waiting for messages from thread 1'' can be recorded, where in pthreads only ``thread 2 spends a lot of time waiting for a signal'' would be visible. It is much easier to reach the conclusion that the bottleneck is the slow rate at which thread 1 produces data.
   1.446 -
   1.447 -Leveraging the MPI library is a well-trodden path, with many tools available \cite{PerfToolParaver} \cite{PerfToolVampir} \cite{PerfToolParadyn} \cite{PerfToolTau}. But a frequent complaint of users of these tools is that in complex programs, the large number of messages sent makes it difficult to see problems. As there is little that distinguishes messages besides size, it is difficult to implement good filtering. Once again, we find that the limitations come from a parallel model that captures insufficient information.
   1.448 -
   1.449 -Fortunately, parallel languages are evolving in the same direction, and it is widely believed that semantically rich parallel constructs, when they match the structure of the application well, are key to improved productivity. With languages such as CnC \cite{CnCInHotPar} and StarSs \cite{StarSs}, the information available to the language runtime offers a much clearer picture of the constraints  placed on the execution of tasks and where they come from. Consequently, the application structure can be well reflected in visualizations, as the Temanejo debugger  \cite{PerfToolStarSs} for StarSs demonstrates.
   1.450 -
   1.451 -In the following section, we will demonstrate how our approach overcomes the various challenges of these other approaches through a combination of the following features:
   1.452 +    
   1.453  \vspace{-3pt}
   1.454  \begin{itemize}
   1.455  \setlength{\itemsep}{0em}
   1.456 -\item  it visualizes all the factors influencing scheduling decisions, given sufficient language and runtime pre-requisites
   1.457 -\item it has a filtering mechanism that shows only the most relevant groups of constraints, as desired
   1.458 -\item it has flexibility to capture and represent any type of constraints
   1.459 -\item it works for any parallel programming library or language
   1.460 +\item  it visualizes the essential factors influencing scheduling decisions and resulting causal chains, given sufficient language and runtime pre-requisites
   1.461 +\item it has a filtering mechanism that shows only the most relevant groups of constraints and causal interactions, as desired
   1.462 +\item it has flexibility to capture and represent any type of constraint
   1.463 +
   1.464  \item it instruments only the runtime, once,  for all applications
   1.465  \end{itemize}
   1.466  \vspace{-5pt}
   1.467  
   1.468 + The tool we proposed here has these and would be used  to augment the existing tools when the sequence of causal interactions leading to  idle cores is not readily apparent. 
   1.469 +
   1.470 +Of the existing tools, thread oriented ones such as Paradyn \cite{PerfToolParadyn} and VTune \cite{PerfToolVTune} collect profiling information. This gives powerful insight into which functions dominate execution, and aids in tuning things related to the sequential performance of the functions, including data layout.  They clearly display periods during which cores sit idle, but help less in determining the chain of interactions that cause the idleness.
   1.471 +
   1.472 +For message passing based software, many tools are available such as Paraver \cite{PerfToolParaver}, Vampir \cite{PerfToolVampir}, Paradign \cite{PerfToolParadyn}, and Tau \cite{PerfToolTau}. These give insight into which tasks communicate to which other tasks, which adds more causality information than available in the thread oriented tools.  
   1.473 +
   1.474 +But a frequent complaint of users of these tools is that the cause of idleness is difficult to deduce.  One reason is that semantic information about messages is missing.  Little distinguishes them besides size, which also makes it difficult to filter the messages to prevent uninteresting ones from causing clutter. Although the tool we propose measures the same message interactions, plus additional sources of causal interactions, it has the ability to declutter based on the kind of causality, such as runtime-implementation internal causality, and also filter based on the semantic purpose of a given message.  The potential also exists to pick a particular idle period and display a trace of the causal chain that contributed to it.
   1.475 +
   1.476 +The HPCToolkit \cite{PerfBlameShifting12} uses blame-shifting to identify which constructs spend a  long time blocked. However, once identified, the user  could still use help in understanding  why  constructs block for so long, and why no other work fills in.
   1.477 +
   1.478 +With languages such as CnC \cite{CnCInHotPar} and StarSs \cite{StarSs}, the information available to the language runtime offers a much clearer picture of the constraints  placed on the execution of tasks and where they come from. Consequently, the application structure can be well reflected in visualizations, as the Temanejo debugger  \cite{PerfToolStarSs} for StarSs demonstrates.  However, this one provides limited insight into causes.  It assumes that the only options available are changing task size, and so only displays which tasks are too big and which too small, based on measuring runtime overhead.
   1.479 +
   1.480 +
   1.481 +%When choosing a performance tuning tool, three primary factors to consider are:
   1.482 +%\vspace{-3pt}
   1.483 +%\begin{itemize} 
   1.484 +%\setlength{\itemsep}{0em}
   1.485 +%\item the ease of recognizing performance loss
   1.486 +%\item the ease of forming a hypothesis to explain the loss
   1.487 +%\item the ease of linking the hypothesis to changes to make in the application code
   1.488 +%\end{itemize}
   1.489 +%\vspace{-3pt}
   1.490 +%The mental model that underlies the tool plays a key role in these three usefulness factors. This can be seen  by examining the  activity of performance tuning.  It is an iterative activity. To start, the programmer takes measurements during execution and compares them to the desired outcome. A hypothesis is generated for what caused any differences, by using a mental model, of causal relationships within  execution. The mental model is then also used to link the hypothesis to code, to suggest a change to make to improve performance. After modification, the code is run again, and these steps are repeated until the programmer is satisfied with the performance of the program. Thus, the mental model is central during performance tuning.
   1.491 +
   1.492 +%As a result of its role when using  a tuning tool, the mental model underlying the tool is   a major focus when comparing  tools.  The mental model of our approach is based on a new model of parallel computation and is the main way  our approach differs from other approaches. The new model makes it  easier to form a hypothesis of the cause of performance losses, and easier to link that back to changes to make in application code.
   1.493 +
   1.494 +% Historically, the dominant mental model for parallel computation has been the software-thread abstraction. The model is very close to the hardware, without much higher-level semantics. This makes it  a problem when analyzing parallel performance. Yet as the earliest and most widespread parallel programming model, and one that serves as the basis on top of which many other models are implemented, it must be supported.
   1.495 +
   1.496 +%A popular version is  pthreads, which introduces randomness through OS time-slicing, and whose synchronization constructs  imply only indirect connections. For instance, very little conclusion can be drawn about the relationship between the computations in two separate threads when they access the same lock consecutively. Consequently, tools such as Paradyn \cite{PerfToolParadyn} or VTune \cite{PerfToolVTune} that rely on this model  have a hard time connecting measurements to the application code details. They collect a wealth of statistics, but the application is to the tool a foreign process, where things ``just happen'' without causal links between events and code. Such tools generate conclusions such as ``application bottleneck is synchronization'', and  ``too much time is spent in spinlocks'', without help as to what in the code causes this. For example, if there is more than one use of spinlocks in the application, it is not clear what point in the code is the problem.
   1.497 +
   1.498 +%One fix to these issues is to allow users to introduce measuring points into their own code, at the cost of increased programmer effort.  Because instrumentation code is written in the source language, it has access to application concepts. This advantage can be kept with automated instrumentation, by providing an instrumenting compiler, like the Tau \cite{PerfToolTau} project does.
   1.499 +
   1.500 +%However, as long as the underlying execution model is still threads, there is no meaningful structure common to all applications to use to generate expressive measurement quantities. Function boundaries and the call graph are not sufficient when parallel execution is performed. The sequence and frequency of function calls tell little about parallel performance impacts because they have no bearing on synchronization events.  Unfortunately, pthreads does not capture even hints as to \emph{why} a given function call ends up blocking or not blocking, and what the effects on other threads are.
   1.501 +
   1.502 +%Higher level parallel abstractions help alleviate that particular problem. For instance, in an application with MPI message passing \cite{MPI}, the information ``thread 2 spends little time waiting for messages from thread 0 but a lot of time waiting for messages from thread 1'' can be recorded, where in pthreads only ``thread 2 spends a lot of time waiting for a signal'' would be visible. It is much easier to reach the conclusion that the bottleneck is the slow rate at which thread 1 produces data.
   1.503 +
   1.504 +%Leveraging the MPI library is a well-trodden path, with many tools available \cite{PerfToolParaver} \cite{PerfToolVampir} \cite{PerfToolParadyn} \cite{PerfToolTau}. But a frequent complaint of users of these tools is that in complex programs, the large number of messages sent makes it difficult to see problems. As there is little that distinguishes messages besides size, it is difficult to implement good filtering. Once again, we find that the limitations come from a parallel model that captures insufficient information.
   1.505 +
   1.506 +%Fortunately, parallel languages are evolving in the same direction, and it is widely believed that semantically rich parallel constructs, when they match the structure of the application well, are key to improved productivity. With languages such as CnC \cite{CnCInHotPar} and StarSs \cite{StarSs}, the information available to the language runtime offers a much clearer picture of the constraints  placed on the execution of tasks and where they come from. Consequently, the application structure can be well reflected in visualizations, as the Temanejo debugger  \cite{PerfToolStarSs} for StarSs demonstrates.
   1.507 +
   1.508 +%In the following section, we will demonstrate how our approach overcomes the various challenges of these other approaches through a combination of the following features:
   1.509 +
   1.510  
   1.511  
   1.512  
   1.513 @@ -598,11 +646,11 @@
   1.514  \section{Conclusion}
   1.515  \label{sec:conclusion}
   1.516  
   1.517 -We have shown how to apply a new, general model of parallel computation to build  performance visualizations that simplify identifying instances of performance loss and linking them to details of the application code responsible. They rely only on information collected through instrumenting the language runtime, with no modification to the application.
   1.518 +We have shown how to apply a recently discovered  model of the structure of parallel computation to generate  visualizations of the sequence of interactions that cause  cores to sit idle during a run. Visualization features help identify suspect causal interactions and link them back to related application code. The approach relies only on information collected through instrumenting the language runtime, with no modification to the application.
   1.519  
   1.520 -By integrating visual display of constraints due to application structure, language runtime implementation, and hardware features, all relevant causes for performance loss are covered. The semantic information collected allows filtering   for the relevant types of constraints, to avoid overcharging the display.
   1.521 +The essential causes for idleness are collected and available for display, including constraints due to application structure, language runtime implementation, and hardware features. The semantic information collected allows filtering   for the relevant types of constraints, to avoid overcharging the display.
   1.522  
   1.523 -We demonstrated, with a case study, how this  improves usability and eliminates  guesswork, by providing a direct path to details in the application code where changes should be made. These benefits derive from the computation model, which focuses on the aspects of parallelism relevant to performance in a way that makes generation of the correct hypothesis for performance loss straight forward.  
   1.524 +We demonstrated, with a case study, how to use the approach, which can provide a valuable addition to other performance tuning tools once tuning has shifted to scheduling related phenomena. The approach can reduce  guesswork, providing a  path to details in the application code where changes should be made.  
   1.525  
   1.526  %I'd like to avoid weaknesses of our approach, in the conclusion.. and this wasn't discussed much in the body.
   1.527  %As the approach relies on information available to the runtime, we expect that even better results will be observed for ``high-level'' parallel languages that more closely match application concepts instead of hardware concepts.
     2.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/bib_for_papers_12_Jy_15.bib	Mon Dec 10 05:51:06 2012 -0800
     2.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/bib_for_papers_12_Jy_15.bib	Mon Dec 31 16:07:54 2012 -0800
     2.3 @@ -1,3 +1,37 @@
     2.4 +@book{LangTBBRef,
     2.5 +  title={Intel threading building blocks: outfitting C++ for multi-core processor parallelism},
     2.6 +  author={Reinders, J.},
     2.7 +  year={2007},
     2.8 +  publisher={O'Reilly Media, Incorporated}
     2.9 +}
    2.10 +@article{LangHPF,
    2.11 +  title={High performance fortran},
    2.12 +  author={Loveman, D.B.},
    2.13 +  journal={Parallel \& Distributed Technology: Systems \& Applications, IEEE},
    2.14 +  volume={1},
    2.15 +  number={1},
    2.16 +  pages={25--42},
    2.17 +  year={1993},
    2.18 +}
    2.19 +@article{PerfBlameShifting12,
    2.20 +  title={Using Sampling to Understand Parallel Program Performance},
    2.21 +  author={Tallent, N.R. and Mellor-Crummey, J.},
    2.22 +  journal={Tools for High Performance Computing 2011},
    2.23 +  pages={13--25},
    2.24 +  year={2012},
    2.25 +  publisher={Springer}
    2.26 +}
    2.27 +@inproceedings{HWSupportStanford10,
    2.28 + author = {Sanchez, Daniel and Yoo, Richard M. and Kozyrakis, Christos},
    2.29 + title = {Flexible architectural support for fine-grain scheduling},
    2.30 + booktitle = {Proceedings of ASPLOS on Architectural support for programming languages and operating systems},
    2.31 + series = {ASPLOS '10},
    2.32 + year = {2010},
    2.33 + isbn = {978-1-60558-839-1},
    2.34 + location = {Pittsburgh, Pennsylvania, USA},
    2.35 + pages = {311--322},
    2.36 + numpages = {12},
    2.37 +} 
    2.38  @misc{PerfToolVTune,
    2.39      author = {Intel},
    2.40      howpublished = {http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/},