### changeset 85:972bdaea83b7

perf tune: update conclusion
author Nina Engelhardt Tue, 14 Aug 2012 19:53:43 +0200 d5f65e311f64 6ddc44f95303 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 2 files changed, 47 insertions(+), 60 deletions(-) [+]
line diff
     1.1 Binary file 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf has changed

     2.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Tue Aug 14 05:53:06 2012 -0700
2.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Tue Aug 14 19:53:43 2012 +0200
2.3 @@ -56,7 +56,7 @@
2.4             {nengel@mailbox.tu-berlin.de}
2.5  \authorinfo{Sean Halle}
2.6             {Open Source Research Institute}
2.7 -           {seanhalle@OpenSourceResearchInstitute.org}
2.8 +           {seanhalle@opensourceresearchinstitute.org}
2.10             {TU Berlin}
2.12 @@ -82,36 +82,30 @@
2.13  \section{Introduction}
2.14  \label{sec:intro}
2.15
2.16 -Performance visualizations and tuning tools for parallel programs are critical to achieving good performance, and yet the experience of using current solutions leaves something to be desired. All too often, performance tuning consists of staring at abstract views of usage or statistics, trying to \emph{guess} the cause of performance loss. Despite many different views of performance measurements, such as frequency by line of code, message sends and receives, and core usage timelines, the user doesn't know why a function runs in a particular spot on a particular core, nor whether that is desired behavior or erroneous behavior.
2.17 +Performance visualization and tuning tools for parallel programs are critical to achieving good performance, and yet the experience of using current solutions leaves something to be desired. Often, performance tuning requires interpreting abstract views of usage or statistics, trying to \emph{guess} the cause of performance loss. Despite many different views of performance measurements, such as frequency by line of code, message sends and receives, and core usage timelines, the user doesn't know why a function runs in a particular spot on a particular core, nor whether that is desired behavior or erroneous behavior.
2.18
2.19 -Examining these myriad views can feel like being advised by the fabled six blind men examining an elephant by touching the part nearest to each\cite{PerfToolPoem}: one, reaching the side, says it is like a wall; the second, feeling the tusk, compares it to a spear; the next, touching the trunk, is reminded of a snake, and so on. All of these views are indeed correct and supply important information, but they don't connect to each other, to provide a view of how the parts interact to form a whole.
2.20 +Examining these myriad views can feel like being advised by the fabled six blind men examining an elephant by touching the part nearest to each\cite{PerfToolPoem}: one, reaching the side, compares it to a wall; the second, feeling the tusk, likens it to a spear; the next, touching the trunk, is reminded of a snake, and so on. All of these views are indeed correct and supply important information, but they don't connect to each other, to provide a view of how the parts interact to form a whole.
2.21
2.22 -Likewise, current tools may cover all the parts of the application code, but fail to adequately connect their observations to the runtime behavior, scheduling decisions, and consequent hardware behavior. The decision about which task or virtual processor is assigned to which core at what point in time is at the heart of parallel performance, so these choices need to be highlighted and connected to the application features influencing them.
2.23 +Likewise, current tools may cover all the parts of the application code, but they do not always adequately connect their observations to the runtime behavior, scheduling decisions, and consequent hardware behavior. The decision about which task or thread is assigned to which core at what point in time is at the heart of parallel performance, so these choices need to be highlighted and connected to the application features influencing them.
2.24
2.25 -In this paper we describe a model and framework that makes these connections, to make identifying the presence of less-than-optimal performance straight-forward, make it easier to generate a hypothesis for losses, and easier to link the hypothesis  to specific points in application code.
2.26 +Our approach provides a more complete picture by following a model of parallel computation that suggests what information to gather and provide to the user. According to the model, the relevant application features are work units and the execution constraints, and it is crucial to record and present the scheduling decisions taken by the runtime.
2.27
2.28 -Our approach provides a more complete picture by following a model of parallel computation that suggests what information to gather and provide to the user. The model suggests that the relevant application features are units of work and the constraints governing their execution, and that it is crucial to record and present  the decisions taken by the runtime on scheduling those units onto hardware. The model provides a structure within which to interpret these measurements, and thereby simplifies hypothesis generation by relating the measurements to each other and to the application.
2.29 -
2.30 +Our instrumentation detects the schedulable units in the application code, and visually links them to the units upon which its execution is conditional. The visual arrangement makes identifying idle cores immediate, enables quickly generating the correct hypotheses for the causes of the performance losses, and pinpointing where in the application code changes can be made to improve performance.
2.31
2.32 -Our instrumentation detects the schedulable units in the application code, and presents them visually in a way that indicates how causal activities related to the units result in the observed performance.  The effect is like having a map showing idle cores linked back to the spots in the application code responsible.  The visual arrangement makes identifying idle cores immediate, and  enables  quickly generating the correct hypotheses for the causes of the performance losses, and pinpointing where in the application code changes can be made to improve performance.
2.33 +The model of computation and our visual approach to using it is illustrated with a case study of performance tuning a standard parallel application on a large multi-core system.
2.34
2.35 -The model of computation and our visual approach to using it is illustrated with a story line of performance tuning a standard parallel application on a large multi-core system.
2.36 -
2.37 -We start with an overview of previous approaches in section \ref{sec:related}. We show usage of our visualizations through a case study in section \ref{sec:casestudy}, and then expand on the model behind it in section \ref{sec:theory}. Section \ref{sec:Implementation} will tie the model to implementation details. Finally, we conclude in section \ref{sec:conclusion}.
2.38 +We start with an overview of previous approaches in Section \ref{sec:related}. We show usage of our visualizations through a case study in Section \ref{sec:casestudy}, and then expand on the model behind it in Section \ref{sec:theory}. Section \ref{sec:Implementation} will relate the model to implementation details. Finally, we conclude in Section \ref{sec:conclusion}.
2.39
2.40  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2.41  \section{Background and Related Work}
2.42  \label{sec:related}
2.43
2.44 - When choosing a performance tuning tool, three primary factors to consider are:
2.45 -
2.46 +When choosing a performance tuning tool, three primary factors to consider are:
2.47  \begin{itemize}
2.48  \item the ease of recognizing performance loss
2.49  \item the ease of forming a hypothesis to explain the loss
2.50  \item the ease of linking the hypothesis to changes to make in application code
2.51  \end{itemize}
2.52 -
2.53 -
2.54
2.55  The mental model that underlies the tool plays a key role in these three usefulness factors. This can be seen  by examining the  activity of performance tuning.  It is an iterative activity. To start, the programmer takes measurements during execution and compares them to the desired outcome. A hypothesis is generated for what caused any differences, by using a mental model, of causal relationships within  execution. The mental model is then also used to link the hypothesis to code, to suggest a change to make to improve performance. After modification, the code is run again, and these steps are repeated until the programmer is satisfied with the performance of the program. Thus, the mental model is central during performance tuning.
2.56
2.57 @@ -138,18 +132,17 @@
2.58  \item it has flexibility to capture and represent any type of constraints
2.59  \item it works for any parallel programming library or language
2.60  \item it instruments only the runtime, once,  for all applications
2.61 -
2.62  \end{itemize}
2.63
2.64  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2.65 -\section{Illustrative Story of Performance Tuning}
2.66 +\section{Case Study: Analyzing an SSR Application}
2.67  \label{sec:casestudy}
2.68
2.69  To concretely illustrate the benefits of connecting measurements to an effective computation model, in this section we walk through a typical performance tuning session using our approach. This shows the features  in action, and indicates how having those features provide an advantage to the user over methods that lack them.
2.70
2.71 -We first describe the program and language used, and then the features of our visualization. After this preparation, we show a sequence of the visualizations. In each,  we point out how the  performance loss is identified, and which visual features suggest the hypothesis of the cause of the loss.  We show how the visual features direct the user to the specific sections of code that need to be changed, and how the model helps suggest what changes to try.
2.72 +We first describe the program and language used, and then the features of our visualization. After this preparation, we show a sequence of the visualizations. In each,  we point out how the  performance loss is identified, and which visual features suggest the hypothesis of the cause of the loss.  We show how the visual features direct the user to the specific sections of code that need to be changed, and how the model suggests what changes to try.
2.73
2.74 -\subsection{The Application, and Target Hardware}
2.75 +\subsection{Application and Target Hardware}
2.76
2.77  In our session, we wish to tune a standard program that the reader  knows well. The best example is likely matrix multiply, with which the reader should be familiar, allowing concentration on the tool without distraction about the application.
2.78
2.79 @@ -161,8 +154,8 @@
2.80
2.81  It then creates a results VP that receives a partial-result from each piece and accumulates the results. The  divider VP  then waits for the results VP to indicate completion, after which the language runtime shuts down.
2.82
2.83 -\subsection{The Language}
2.84 -The language used is SSR, which is based on rendez-vous style send and receive operations made between virtual processors (VPs), which are equivalent to software threads'. The example application uses its commands for creating and destroying VPs, and  two kinds  of  send-receive paired operations.
2.85 +\subsection{Programming Model}
2.86 +The language used is SSR, which is based on rendez-vous style send and receive operations made between virtual processors (VPs), which are equivalent to software threads. The example application uses commands for creating and destroying VPs, and  two kinds  of  send-receive paired operations.
2.87
2.88  The first, \emph{send\_from\_to} specifies both sender and receiver VPs. The results VP uses this to tell the divider VP that the work is complete. The second, \emph{send\_of\_type\_to}, specifies only a specific receiver, leaving the sender  anonymous. It has increased flexibility while maintaining some control over scope. The worker VPs use this  to send their partial-result to the results VP.
2.89
2.90 @@ -194,7 +187,7 @@
2.91  \end{figure}
2.92
2.93
2.94 -Fig \ref{fig:UCC_expl} shows two versions of a UCC, both stylized for purposes of explanation.  These are used less often than the SCG,  mainly as a helper when brainstorming ways to get more usable parallelism out of an application.
2.95 +Fig. \ref{fig:UCC_expl} shows two versions of a UCC, both stylized for purposes of explanation.  These are used less often than the SCG,  mainly as a helper when brainstorming ways to get more usable parallelism out of an application.
2.96
2.97  Both versions of the UCC depict constraints on units. Starting  at the top, they  progress downwards,  placing units that constrain  above those constrained by them, and connecting them by a line representing the constraint. A unit is placed into the row just below the last of  the units constraining it, in the first available column. However, the rows and columns are just visual conveniences; the only information in the UCC is the lines between units, which represent constraints among the units. In practice, each unit is also labeled with a unique unit ID, from which the code of the unit and the parameters passed to it can be looked up.
2.98
2.99 @@ -212,7 +205,7 @@
2.100
2.101
2.102
2.103 -Fig \ref{fig:SCG_expl} shows a consequence graph, stylized for purposes of explanation. This is the main visualization used during performance tuning, because it shows the consequent uses of the cores in the machine, and links those to the constraints and code in the application.
2.104 +Fig. \ref{fig:SCG_expl} shows a consequence graph, stylized for purposes of explanation. This is the main visualization used during performance tuning, because it shows the consequent uses of the cores in the machine, and links those to the constraints and code in the application.
2.105
2.106  It is composed of a number of columns, one for each core. A column represents time on the core,  increasing as one goes down, measured in clock cycles. It is broken into blocks, each representing the time accounted to one work-unit. Each block is further divided into regions, each a different color, which indicates the kind of activity the core was engaged in during that  region's time-span.
2.107
2.108 @@ -223,7 +216,7 @@
2.109    The second kind of activity is runtime overhead, represented by a gray region. This is the overhead spent on that particular work-unit. When desired by the user, it is further broken into pieces representing activities inside the runtime. The options include time spent on: constraints, when determining readiness of the work-unit; deciding which ready unit to assign to which hardware; and time spent switching from virtual processor, to the runtime, and back. In this paper, we show all runtime overhead lumped together, however in other circumstances a breakdown can be key to understanding interaction between runtime and application.
2.110
2.111
2.112 -The other type of visual feature seen in Fig \ref{fig:SCG_expl} is lines. Each represents a construct that influenced scheduling, where the color indicates what kind of construct.   A line represents two things: a constraint, whose satisfaction made the lower unit ready, and a decision by the runtime to start the lower unit on that core.
2.113 +The other type of visual feature seen in Fig. \ref{fig:SCG_expl} is lines. Each represents a construct that influenced scheduling, where the color indicates what kind of construct.   A line represents two things: a constraint, whose satisfaction made the lower unit ready, and a decision by the runtime to start the lower unit on that core.
2.114
2.115  In general, lines may also be drawn that represent other kinds of interactions, which affect core usage. For example,  our runtime implementation only allows one core at a time to access shared  scheduling state. Visualization of this can be turned on, as additional lines linking the gray runtime regions of blocks (visualization of such interactions is turned off in this paper).
2.116
2.117 @@ -235,14 +228,11 @@
2.118  Note that  many different orderings can be validly chosen. Which scheduler choices are valid is determined by three kinds of constraints: the application code constraints, hardware constraints, and runtime implementation imposed constraints.
2.119
2.120  The visual features allow the user to see at a glance the total  execution time (height), idle cores during the run (empty columns), cache behavior (color of work regions), degree of overhead (size of gray regions), and which units constrained which other units (lines). All consequence graphs in this paper are at the same scale, so they can be compared directly.
2.121 -
2.122 -
2.123 -
2.124  \begin{figure*}[t!]
2.125 -  \begin{minipage}[b]{0.2\textwidth}
2.126 -        \subfloat[35.8 Gcycles\\Original]
2.129 +  \begin{minipage}[b]{0.25\textwidth}
2.130 +        \hfill\subfloat[35.8 Gcycles\\Original]
2.133    \end{minipage}
2.134    \begin{minipage}[b]{0.7\textwidth}
2.135    % \subfloat[]
2.136 @@ -289,32 +279,30 @@
2.137  \subsection{Walk-through}
2.138  \label{subsec:walk-through}
2.139
2.140 +The scheduling consequence graphs that display the measurements collected on various runs during tuning are displayed in Fig. \ref{story}. All use the same scale, for direct comparison. They have 40 columns, one for each core, and relative height indicates relative execution time. The lines in red, orange, and green represent application-code constructs. Red is creation of a virtual processor, green is the many-to-one \texttt{send\_of\_type\_to}, and orange is the singleton construct. For better visibility, only constraints that cross cores are shown.
2.141
2.142
2.143 -The scheduling consequence graphs that display the measurements collected on various runs during tuning are displayed in Fig \ref{story}. All use the same scale, for direct comparison. They have 40 columns, one for each core, and relative height indicates relative execution time. The lines in red, orange, and green represent application-code constructs. Red is creation of a virtual processor, green is the many-to-one \texttt{send\_of\_type\_to}, and orange is the singleton construct. For better visibility, only constraints that cross cores are shown.
2.144 -
2.145 -
2.146 -After functional debugging, the first tuning run produces the consequence graph seen in Figure \ref{fig:story:a}. The first thing to notice is that it is slimmer than expected: of the 40 available cores, only 13 are being used. Because the application places work on cores explicitly, this must be a bug in the dividing code. A cursory inspection reveals that a closing curly brace in the distribution loop had been misplaced. This may be a very simple bug, but it went unnoticed despite using this application as a test program for development of the language runtime, including performance, for several months.
2.147 +After functional debugging, the first tuning run produces the consequence graph seen in Fig. \ref{fig:story:a}. The first thing to notice is that it is slimmer than expected: of the 40 available cores, only 13 are being used. Because the application places work on cores explicitly, this must be a bug in the dividing code. A cursory inspection reveals that a closing curly brace in the distribution loop had been misplaced. This may be a very simple bug, but it went unnoticed despite using this application as a test program for development of the language runtime, including performance, for several months.
2.148
2.149  \subsubsection{Second Run}
2.150 - After fixing this, the next run (Fig \ref{story:b}) corresponds much more to the expected execution behaviour. However, there remains a noticeable section at the beginning where only 3 cores have work and the other 37 remain idle.
2.151 + After fixing this, the next run (Fig. \ref{story:b}) corresponds much more to the expected execution behaviour. However, there remains a noticeable section at the beginning where only 3 cores have work and the other 37 remain idle.
2.152
2.153 -Zooming in on those cores, we see that creation code starts running on core 0, within the creation VP, and then the next block on the core is work. Creation stops, starving the other cores. Looking at the creation code, we see that the creation VP assigns the first work VP to its own core, so that work is now waiting in the queue to execute there. When it creates the second work VP, that creation call switches core 0 to the runtime. When done with creation, the runtime takes the next VP from the queue, which is that waiting work VP. Hence core 0 does the work next instead of continuing with creation  (the merits of work stealing or other scheduling strategies are independent from this illustration of how to use this approach to performance tune).
2.154 +Zooming in on those cores, we see that creation code starts running on core 0, within the creation VP, and then the next block on the core is work. Creation stops, starving the other cores. Looking at the creation code, we see that the creation VP assigns the first work VP to its own core, so that work is now waiting in the queue to execute there. When it creates the second work VP, that creation call switches core 0 to the runtime. When done with creation, the runtime takes the next VP from the queue, which is that waiting work VP. Hence core 0 does the work next instead of continuing with creation (the merits of work stealing or other scheduling strategies are independent from this illustration of how to use this approach to performance tune).
2.155
2.156  The hypothesis was generated by looking at the code linked to each block and noting the visual pattern that creation code stopped running on core 0. Work code started running instead, and only after it finished did creation code start again. Hence, visual cues led directly to the hypothesis.
2.157
2.158  Two solutions come to mind: assign work to the other cores first, so that they would be busy when the creator VP's core switches to work, or else dedicate a core to the creator VP. The first solution has the advantage of preserving performance of the application even when run on a machine with a single-digit number of cores, so we tried it first.
2.159
2.160 -\subsubsection{Third run} Assigning work to the other cores first gives us Fig \ref{story:c}. The section that was at the top, with idle cores, has disappeared. However, an idle period can still be observed between the first and the second set of work units. The work units have roughly the same length and  the work on core 0 starts last, so finishing the core 0 work holds up creation until after the other cores have finished their work (note that work on some cores takes slightly longer because that core performs the copy-transpose singleton, and also variations are caused by cache misses).
2.161 +\subsubsection{Third run} Assigning work to the other cores first gives us Fig. \ref{story:c}. The section that was at the top, with idle cores, has disappeared. However, an idle period can still be observed between the first and the second set of work units. The work units have roughly the same length and  the work on core 0 starts last, so finishing the core 0 work holds up creation until after the other cores have finished their work (note that work on some cores takes slightly longer because that core performs the copy-transpose singleton, and also variations are caused by cache misses).
2.162
2.163    It is also noticeable that in the second set of work units, not enough work pieces remain to fill all cores. 16 out of 40 remain idle at the bottom.
2.164
2.165  \subsubsection{Fourth and fifth runs} To try to fill the empty columns at the end, we modified the size of the work units. However, as figures \ref{story:d} and \ref{story:e} show, this did not help. The blank areas between bands'' of work can be seen by the red lines to be due to  creation. The increased number of units causes creation to be the bottleneck again, and the time lost between sets grows larger than the time that previously was lost.
2.166
2.167  \subsubsection{Sixth run} At this point we wanted to try the road not chosen, dedicating a core to the creation VP.
2.168 -Going back to version b of the code and implementing this solution, instead, leads to fig. \ref{story:f}. The blank area between the two sets has disappeared, showing a 4\% shorter execution time.
2.169 +Going back to version b of the code and implementing this solution, instead, leads to Fig. \ref{story:f}. The blank area between the two sets has disappeared, showing a 4\% shorter execution time.
2.170
2.171 -\subsubsection{Seventh and eighth runs}Core 0 is now empty after the creation phase at the beginning, so we  moved the receive VP there (fig. \ref{story:g}). This  only minimally improved the time, at this size of work unit, but allows overlapping the result collection with other work, which is an advantage when cutting  the work into more pieces, requiring longer collection as in Fig. \ref{story:h}.
2.172 +\subsubsection{Seventh and eighth runs}Core 0 is now empty after the creation phase at the beginning, so we  moved the receive VP there (Fig. \ref{story:g}). This  only minimally improved the time, at this size of work unit, but allows overlapping the result collection with other work, which is an advantage when cutting  the work into more pieces, requiring longer collection as in Fig. \ref{story:h}.
2.173
2.174  Overall it is also noticeable that as work units become smaller, execution time becomes more irregular. Variability in work length correlates with the color, indicating cache behavior has worsened with smaller work size.
2.175
2.176 @@ -350,7 +338,7 @@
2.177    \label{fig:UCC_example}
2.178  \end{figure}
2.179
2.180 -    We call a fully specified UCC a \emph{concrete} UCC.  Every run of an application eventually winds up defining a concrete UCC, such as produced for the performance tuning, seen  in Fig \ref{fig:UCC_example}. In it, every unit scheduled in the SCG appears,  along with  the application-defined constraints on scheduling them.  For this application, parameters determine how the work is divided, and so determine the units. Hence, the application alone does not specify the concrete UCC, because the parameter values must be known before the units can be determined. In general, the amount of UCC made concrete by the application alone falls into a two-dimensional grid. One dimension covers the units, the other the constraints.
2.181 +    We call a fully specified UCC a \emph{concrete} UCC.  Every run of an application eventually winds up defining a concrete UCC, such as produced for the performance tuning, seen  in Fig. \ref{fig:UCC_example}. In it, every unit scheduled in the SCG appears,  along with  the application-defined constraints on scheduling them.  For this application, parameters determine how the work is divided, and so determine the units. Hence, the application alone does not specify the concrete UCC, because the parameter values must be known before the units can be determined. In general, the amount of UCC made concrete by the application alone falls into a two-dimensional grid. One dimension covers the units, the other the constraints.
2.182
2.183
2.184  \begin{figure}[ht]
2.185 @@ -360,7 +348,7 @@
2.186    \label{fig:UCC_Concreteness}
2.187  \end{figure}
2.188
2.189 -Figure \ref{fig:UCC_Concreteness} shows the two axes and the four sets of additional information that have to be added to the UCC in order to determine the units and constraints in the final fully concrete UCC. The position an application-derived UCC, or later in the life-line UCC, lands on the grid indicates how far it is from being fully concrete.  The horizontal   indicates what inputs are still needed to determine the units, and vertical the constraints.  0 indicates that the units (constraints) are fully determined by the application code alone; 1 means parameter values also must be known; 2 means input data values also play a role, and 3 means the units (constraints) can only become known  after runtime scheduling decisions have been made.
2.190 +Fig. \ref{fig:UCC_Concreteness} shows the two axes and the four sets of additional information that have to be added to the UCC in order to determine the units and constraints in the final fully concrete UCC. The position an application-derived UCC, or later in the life-line UCC, lands on the grid indicates how far it is from being fully concrete.  The horizontal   indicates what inputs are still needed to determine the units, and vertical the constraints.  0 indicates that the units (constraints) are fully determined by the application code alone; 1 means parameter values also must be known; 2 means input data values also play a role, and 3 means the units (constraints) can only become known  after runtime scheduling decisions have been made.
2.191
2.192  The closer the application-derived UCC is to the origin, the less additional information is needed to obtain a  concrete UCC descendant of it. The UCC labeled A in the figure is fully concrete just from the source code alone (representing for example, matrix multiply with fixed size matrices and fixed division). The UCC labeled B requires the input data and parameters to be specified before its units are concrete, but just parameters to make its constraints fully concrete (as per ray-tracing, with bounce depth specified as a parameter). The UCC labeled C only has variability in its constraints, which require input data (for example, H.264 motion vectors).
2.193  But even the least concrete UCC, out at the end of the diagonal (D in the figure), generates a concrete descendant during a run of the application.
2.194 @@ -386,7 +374,7 @@
2.195
2.196  There is one kind of region for each reason that the core is being used (or being forced idle), and several kinds of arcs, one for each type of causality between regions.
2.197
2.198 -The core activities of the region types are: application work, idling while waiting for communication of work data, managing constraints, choosing assignment of work onto cores, and runtime internals. The runtime internals in turn have sub-categories, which we will not detail here, but some of which will be mentioned in section \ref{sec:Implementation}.
2.199 +The core activities of the region types are: application work, idling while waiting for communication of work data, managing constraints, choosing assignment of work onto cores, and runtime internals. The runtime internals in turn have sub-categories, which we will not detail here, but some of which will be mentioned in Section \ref{sec:Implementation}.
2.200
2.201  There is an arc-type for each category of causality. The arc-types (causality categories) represent: sequential dependency in the base language being satisfied, parallel constraint being satisfied (i.e., one unit did something to satisfy a constraint on the other, causing it to be free to be scheduled), runtime internal causality such as a global lock (runtime on one core releases the lock, causing the other to acquire it), and arcs that represent hardware causal relationships (one work-unit finishing on a core causes another work-unit to start there, given the choice by the runtime). The formal details are given in the longer format companion paper.
2.202
2.203 @@ -420,7 +408,7 @@
2.204
2.205  The application in this example has been written in a language that allows work to be divided. The toolchain inserted a manipulator that allows each runtime to divide up the work it is given into smaller work-units, such as via the DKU pattern \cite{DKUTechRep_09}. This pushed the UCC of the application all the way to the right on the unit axis.
2.206
2.207 -So what does the concrete UCC produced during a run look like? Well, a unit is defined as the work resulting from one scheduling decision. Each runtime has its own scheduler, which means units are defined for each runtime.  That in turn means that each runtime has its own concrete UCC!
2.208 +So what does the concrete UCC produced during a run look like? Well, a unit is defined as the work resulting from one scheduling decision. Each runtime has its own scheduler, which means units are defined for each runtime.  That in turn means that each runtime has its own concrete UCC.
2.209
2.210
2.211  \begin{figure}[ht]
2.212 @@ -430,7 +418,7 @@
2.213    \label{fig:UCC_Levels}
2.214  \end{figure}
2.215
2.216 -Figure \ref{fig:UCC_Levels} shows that these UCCs are related to each other in the same hierarchy as the runtimes. A unit scheduled in one runtime is broken into smaller units in the one below it. Each of those units is  then separately scheduled, making a separate UCC just for them. So, as the figure shows, a unit in one UCC has an entire UCC inside it.
2.217 +Fig. \ref{fig:UCC_Levels} shows that these UCCs are related to each other in the same hierarchy as the runtimes. A unit scheduled in one runtime is broken into smaller units in the one below it. Each of those units is  then separately scheduled, making a separate UCC just for them. So, as the figure shows, a unit contains another entire UCC.
2.218
2.219   %Great, that makes sense, now what about the consequence graphs?
2.220
2.221 @@ -443,9 +431,9 @@
2.222  Hence, the question becomes,  when can the upper-level work unit be considered completed? The answer is, when all the lower-level units of which it is comprised are completed, plus any additional work spent dividing up the work and combining results. Once the lower-level consequence graph is established, this  time can be easily determined: it is the critical path across the lower-level consequence graph plus the additional, non-overlapped, time spent creating the lower-level units and combining results into the format of the higher-level unit. The divide and recombine are the creation of    the lower level, and so not part of it.
2.223
2.224
2.225 -Consider the concrete example of the SSR matrix multiply application, from the story in section \ref{sec:casestudy}. To start we go up a level, above the process created to execute the matrix multiply. Here, a unit is the execution of an entire application, invoked via OS commands. The  work-time of that unit is the execution time of the application, from the first instruction of the application code until the last. This time equals  the critical path through the matrix multiply work, plus creation of the various VPs, and collection of the results by the results VP.
2.226 +Consider the concrete example of the SSR matrix multiply application from Section \ref{sec:casestudy}. To start we go up a level, above the process created to execute the matrix multiply. Here, a unit is the execution of an entire application, invoked via OS commands. The  work-time of that unit is the execution time of the application, from the first instruction of the application code until the last. This time equals  the critical path through the matrix multiply work, plus creation of the various VPs, and collection of the results by the results VP.
2.227
2.228 -If the story in section \ref{sec:casestudy} were done in hierarchical fashion, the SCG seen there would be divided between levels, with creation and results accumulation  moved to the higher application-level SCG. The lower-level would only contain the work of the multiply.
2.229 +If the example was analyzed in hierarchical fashion, the SCG seen there would be divided between levels, with creation and results accumulation  moved to the higher application-level SCG. The lower-level would only contain the work of the multiply.
2.230
2.231  One twist is that the divide and recombine are themselves composed of  smaller work-units.  However, they are not visible to units in the work portion, although there are causalities between them. The divide units end with creation of work-level units, while the end of work-level units causes creation of recombine units.
2.232  Hence, divide and recombine are  inside of  the higher-level unit while not part of the lower-level SCG. They can be conveniently displayed in a single visualization because of the clean connection between levels provided by the create and end causalities.
2.233 @@ -454,7 +442,7 @@
2.234
2.235
2.236
2.237 -To complete the discussion, we consider going down another level, into the physical processor.  In modern super-scalar multi-cores, each core is a parallel processor composed of functional blocks such as arithmetic or floating-point blocks, along with logic that issues instructions to those blocks. We first break down a single  SSR work unit of the kind seen in section \ref{sec:casestudy}, into smaller work units, namely individual instructions. Then we break instructions down into pipeline stages.
2.238 +To complete the discussion, we consider going down another level, into the physical processor.  In modern super-scalar multi-cores, each core is a parallel processor composed of functional blocks such as arithmetic or floating-point blocks, along with logic that issues instructions to those blocks. We first break down a single  SSR work unit of the kind seen in Section \ref{sec:casestudy}, into smaller work units, namely individual instructions. Then we break instructions down into pipeline stages.
2.239
2.240  The shape of the UCC for the instructions  inside a single SSR level work-unit is a straight line of sequential dependencies. However, the pipeline stages inside a core modify the UCC. They analyse the register use dependencies between instructions and generate a new UCC consisting of  functions constrained by the data dependencies. The dependencies are enforced by the issue logic, which  goes on to assign ready functions to the functional blocks.  The choices made by the issue logic  determine the shape of the SCG for this level, and the critical path of that SCG is the work time of the higher, SSR level, work-unit.
2.241
2.242 @@ -551,7 +539,7 @@
2.243  \subsubsection{Recording time, instructions, and cache misses }
2.244   Just recording the units and connections between them is not enough for the SCG. Because it represents core usage, the SCG also needs  the cycles spent on each activity to be collected, including internal runtime activities. The size of each interval of core usage is recorded and assigned to a  segment of some  unit's life-line.
2.245
2.246 -The UCC also makes use  of the number of instructions in a unit, as an estimate of size of work in the unit, as illustrated by Fig \ref{fig:UCC_expl}. Without knowing the relative size of the units, it is hard to estimate the amount of parallelism \emph{usefully} available in the application.
2.247 +The UCC also makes use  of the number of instructions in a unit, as an estimate of size of work in the unit, as illustrated by Fig. \ref{fig:UCC_expl}. Without knowing the relative size of the units, it is hard to estimate the amount of parallelism \emph{usefully} available in the application.
2.248
2.249  To measure the instructions, cycles, and communication (cache misses), we use hardware performance counters. Readings are inserted into the runtime code to capture core time spent on each segment of the life-line of a unit:
2.250  \begin{enumerate}
2.251 @@ -588,8 +576,7 @@
2.252  \subsubsection{UCC}
2.253  For the UCC, units can be either unweighted or weighted. Weighted units appear as rectangles with height proportional to the weight, unweighted units appear as circles. We weight the units with the number of instructions in the work. This is not entirely execution-independent, but removes some of the influence of scheduling and data, such as cache misses.
2.254
2.255 -A critical path algorithm calculates vertical position of a unit, as its position within the critical path. The nodes are spread horizontally such that none overlap.
2.256 -
2.257 +A critical path algorithm calculates vertical position of a unit, as its position within the critical path. The nodes are spread horizontally such that none overlap.
2.258  Simple constraints (dependencies) are painted as arcs. Complicated constraints are for now displayed as an additional node bearing information on the constraint, with incoming arcs from all units whose execution status affects the constraint and outgoing arcs to the constrained units.
2.259
2.260
2.261 @@ -601,19 +588,19 @@
2.262
2.263  All this information is taken purely from the runtime, leading to a rich, configurable visualization without needing to add anything to the application.
2.264
2.265 -
2.266 -
2.267 -
2.268 -
2.269 -
2.270  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2.271  \section{Conclusion}
2.272  \label{sec:conclusion}
2.273 -We have shown how to apply a computation model to instrument a language runtime for collecting measurements that connect: each measurement to others, to application structure, to scheduling decisions, and to hardware. A simple visualization of the data has features that indicate lost performance, and features that visually link the lost performance to the cause, no matter if the cause is application structure, language runtime implementation, or hardware feature.  It is this linkage, due to the computation model, that sets this approach apart from others.
2.274 +We have examined the application of a generalized model of parallel computation to build adaptable performance visualizations, relying only on information collected through instrumenting the language runtime, with no modification to the application.
2.275 +The approach is demonstrated through the case study of instrumenting the SSR message-passing language runtime and using it to tune a parallel application.
2.276 +
2.277 +The resulting visualizations show that the focus on the parallelism-relevant concepts of work units and constraints on their execution allows a clearer view of parallelism-specific issues.
2.278 +By integrating visual display of constraints stemming from application structure, language runtime implementation, or hardware features, the various possible causes for performance loss are covered. A flexible filtering system for different types of constraints avoids overcharging the display.
2.279 +
2.280 +Any approach relying only on information available to the runtime is intrinsically limited by the expressiveness of the parallel programming model. We expect that even better results will be observed for high-level'' parallel languages that more closely match application concepts instead of hardware concepts.
2.281
2.282  \bibliography{bib_for_papers_12_Jy_15}
2.283
2.284 -
2.285  \end{document}
2.286
2.287  ===============
2.288 @@ -732,7 +719,7 @@
2.289
2.290  In addition to identifying the units and the relations between them, we need to capture how much work a unit represents and how many resources (mainly core-time) it occupies.
2.291
2.292 -For performance tuning it is obviously critical to know the actual performance, but even for the UCC these measurements are important. Indeed, as evidenced by Fig [fig:UCC_expl], without knowing the relative size of the units, it is hard so estimate the amount of parallelism in the application.
2.293 +For performance tuning it is obviously critical to know the actual performance, but even for the UCC these measurements are important. Indeed, as evidenced by Fig. [fig:UCC_expl], without knowing the relative size of the units, it is hard so estimate the amount of parallelism in the application.
2.294
2.295  We use hardware performance counters to measure for each unit the work (=time spent in application code) and the overhead (=time spent in runtime) in cycles and in instructions, as well as the number of last level cache misses during the work. The counter values are retrieved in the runtime before and after several sections in the lifetime of a unit:
2.296  \begin{itemize}
2.297 @@ -947,7 +934,7 @@
2.298  Hence, each step of performance debugging involves several aspects, including mental model of computation, application code, runtime implementation, scheduling choices, and hardware. In order to be effective, a tool used during performance tuning must be part of a complete model of computation that ties all aspects of the debugging/tuning steps together. Current tools fall short, both because they lack an encompassing model of computation, and because the tools are isolated from each other. Without integration, the user gets an incomplete picture of the computation and must resort to guesses either of where the problem lies or of what to do to fix it.
2.299
2.300
2.301 -We introduce in this paper a model of computation that ties all aspects of performance together, along with instrumentation and visualization that is guided by the model and that links all relevant performance tuning information together. The model and visualization tools are illustrated with a story line, which shows how they are used to performance tune the standard matrix-multiply application on two multi-core systems.
2.302 +We introduce in this paper a model of computation that ties all aspects of performance together, along with instrumentation and visualization that is guided by the model and that links all relevant performance tuning information together. The model and visualization tools are illustrated with a case study, which shows how they are used to performance tune the standard matrix-multiply application on two multi-core systems.
2.303
2.304  Although we use standard visualization techniques [cite], our approach differs from previous work in both theoretical and practical aspects. The theory we  use is The Holistic Model of Parallel Computation, which ties together parallelism construct semantics with scheduling choices made during a run, and specific measurements made on the cores.  When put into practice, new kinds of measurements are taken, which complete the picture presented to the user, and each measurement is tied to a specific segment of code. The resulting combination not only identifies each source of performance loss, but ties it back to specific causes and suggests precise fixes, all of which is illustrated in our story line.
2.305
`