### changeset 84:d5f65e311f64

perf tuning -- bit more massaging, dropped to 10 pages
author Sean Halle Tue, 14 Aug 2012 05:53:06 -0700 78a1ee9b06f1 972bdaea83b7 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 1 files changed, 26 insertions(+), 26 deletions(-) [+]
line diff
     1.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Tue Aug 14 03:28:16 2012 -0700
1.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Tue Aug 14 05:53:06 2012 -0700
1.3 @@ -90,7 +90,7 @@
1.4
1.5  In this paper we describe a model and framework that makes these connections, to make identifying the presence of less-than-optimal performance straight-forward, make it easier to generate a hypothesis for losses, and easier to link the hypothesis  to specific points in application code.
1.6
1.7 -Our approach provides a more complete picture by following a model of parallel computation that suggests what information to gather and provide to the user. The model suggests that the relevant application features are units of work and the constraints governing their execution, and that it is crucial to record and present  the decisions taken by the runtime on scheduling those units onto hardware. The model provides a structure to which to attach these measurements captured during execution, that simplifies hypothesis generation by relating the measurements to each other and to the application.
1.8 +Our approach provides a more complete picture by following a model of parallel computation that suggests what information to gather and provide to the user. The model suggests that the relevant application features are units of work and the constraints governing their execution, and that it is crucial to record and present  the decisions taken by the runtime on scheduling those units onto hardware. The model provides a structure within which to interpret these measurements, and thereby simplifies hypothesis generation by relating the measurements to each other and to the application.
1.9
1.10
1.11  Our instrumentation detects the schedulable units in the application code, and presents them visually in a way that indicates how causal activities related to the units result in the observed performance.  The effect is like having a map showing idle cores linked back to the spots in the application code responsible.  The visual arrangement makes identifying idle cores immediate, and  enables  quickly generating the correct hypotheses for the causes of the performance losses, and pinpointing where in the application code changes can be made to improve performance.
1.12 @@ -113,13 +113,13 @@
1.13
1.14
1.15
1.16 -The mental model that underlies the tool plays a key role in these three usefulness factors. This can be seen  by examining the  activity of performance tuning.  It is an iterative process. The programmer takes measurements during execution and compares them to the desired outcome. A hypothesis is generated for what caused any differences, by using a mental model, of causal relationships within  execution. The mental model is then also used to link the hypothesis to code, to suggest a change to make to improve performance. After modification, the code is run again, and these steps are repeated until the programmer is satisfied with the performance of the program. Thus, the mental model is central to performance tuning.
1.17 +The mental model that underlies the tool plays a key role in these three usefulness factors. This can be seen  by examining the  activity of performance tuning.  It is an iterative activity. To start, the programmer takes measurements during execution and compares them to the desired outcome. A hypothesis is generated for what caused any differences, by using a mental model, of causal relationships within  execution. The mental model is then also used to link the hypothesis to code, to suggest a change to make to improve performance. After modification, the code is run again, and these steps are repeated until the programmer is satisfied with the performance of the program. Thus, the mental model is central during performance tuning.
1.18
1.19 -As a result of its role when using  a tuning tool, the mental model underlying the tool is   a major focus when comparing  tools.  The main way  our approach differs  is its mental model, which is based on a new model of parallel computation. The new model makes it  easier to form a hypothesis of the cause of performance losses, and easier to link that back to changes to make in application code.
1.20 +As a result of its role when using  a tuning tool, the mental model underlying the tool is   a major focus when comparing  tools.  The mental model of our approach is based on a new model of parallel computation and is the main way  our approach differs. The new model makes it  easier to form a hypothesis of the cause of performance losses, and easier to link that back to changes to make in application code.
1.21
1.22 - Historically, a dominant mental model for parallel computation has been the software-thread abstraction. It is very close to the hardware, without much higher-level semantics. This makes it  a problem when analyzing parallel performance. Yet as the earliest and most widespread parallel programming model, and one that serves as the basis on top of which many other models are implemented, it must be supported.
1.23 + Historically, the dominant mental model for parallel computation has been the software-thread abstraction. The model is very close to the hardware, without much higher-level semantics. This makes it  a problem when analyzing parallel performance. Yet as the earliest and most widespread parallel programming model, and one that serves as the basis on top of which many other models are implemented, it must be supported.
1.24
1.25 -A popular version is  pthreads, which introduces randomness through OS time-slicing, and whose synchronization constructs  imply only indirect connections. For instance, very little conclusion can be drawn about the relationship between the computations in two separate threads from their accessing the same lock consecutively. Tools such as Paradyn \cite{PerfToolParadyn} or VTune \cite{PerfToolVTune} that rely on this model consequently have a hard time connecting measurements to the application code details. They collect a wealth of statistics, but the application is to the tool a foreign process, where things just happen'' without causal links between events and code. Such tools generate conclusions such as application bottleneck is synchronisation'', and  too much time is spent in spinlocks'', without help as to what in the code causes this. If there is more than one use of spinlocks in the application, it is not clear what point in the code is the problem.
1.26 +A popular version is  pthreads, which introduces randomness through OS time-slicing, and whose synchronization constructs  imply only indirect connections. For instance, very little conclusion can be drawn about the relationship between the computations in two separate threads from their accessing the same lock consecutively. Consequently, tools such as Paradyn \cite{PerfToolParadyn} or VTune \cite{PerfToolVTune} that rely on this model  have a hard time connecting measurements to the application code details. They collect a wealth of statistics, but the application is to the tool a foreign process, where things just happen'' without causal links between events and code. Such tools generate conclusions such as application bottleneck is synchronisation'', and  too much time is spent in spinlocks'', without help as to what in the code causes this. For example, if there is more than one use of spinlocks in the application, it is not clear what point in the code is the problem.
1.27
1.28  One fix to these issues is to allow the users to introduce measuring points into their own code, at the cost of increased programmer effort.  Because instrumentation code is written in the source language, it has access to application concepts. This advantage can be kept with automated instrumentation, by providing an instrumenting compiler, like the Tau \cite{PerfToolTau} project does.
1.29
1.30 @@ -127,11 +127,11 @@
1.31
1.32  Higher level parallel abstractions help alleviate that particular problem. For instance, in an application with MPI message passing \cite{MPI}, the information thread 2 spends little time waiting for messages from thread 0 but a lot of time waiting for messages from thread 1'' can be recorded, where in pthreads only thread 2 spends a lot of time waiting for a signal'' would be visible. It is much easier to reach the conclusion that the bottleneck is the slow rate at which thread 1 produces data.
1.33
1.34 -Leveraging the MPI library is a well-trodden path, with many tools available \cite{PerfToolParaver} \cite{PerfToolVampir} \cite{PerfToolParadyn} \cite{PerfToolTau}. A frequent complaint of users of these tools is that in complex programs, the large number of messages sent makes it difficult to see problems. As there is little that distinguishes messages besides size, it is difficult to implement good filtering. Once again, we find that the limitations come from a parallel model that captures insufficient information.
1.35 +Leveraging the MPI library is a well-trodden path, with many tools available \cite{PerfToolParaver} \cite{PerfToolVampir} \cite{PerfToolParadyn} \cite{PerfToolTau}. But a frequent complaint of users of these tools is that in complex programs, the large number of messages sent makes it difficult to see problems. As there is little that distinguishes messages besides size, it is difficult to implement good filtering. Once again, we find that the limitations come from a parallel model that captures insufficient information.
1.36
1.37  Fortunately, parallel languages are evolving in the same direction, and it is widely believed that semantically rich parallel constructs, when they match the structure of the application well, are key to improved productivity. With languages such as CnC \cite{CnCInHotPar} and StarSs \cite{StarSs}, the information available to the language runtime offers a much clearer picture of the constraints  placed on the execution of tasks and where they come from. Consequently, the application structure can be well reflected in visualizations, as the Temanejo debugger  \cite{PerfToolStarSs} for StarSs demonstrates.
1.38
1.39 -In the following section, we will demonstrate how our approach overcomes the various challenges of these approaches through a combination of the following features:
1.40 +In the following section, we will demonstrate how our approach overcomes the various challenges of these other approaches through a combination of the following features:
1.41  \begin{itemize}
1.42  \item  it visualizes all the factors influencing scheduling decisions, given sufficient language and runtime pre-requisites
1.43  \item it has a filtering mechanism that shows only the most relevant groups of constraints, as desired
1.44 @@ -145,7 +145,7 @@
1.45  \section{Illustrative Story of Performance Tuning}
1.46  \label{sec:casestudy}
1.47
1.48 -In this section, we illustrate the benefits of connecting measurements to an effective computation model by walking through a typical performance tuning session using our approach. It shows the features  in action, and indicates how having those features provide an advantage to the user over methods that lack them.
1.49 +To concretely illustrate the benefits of connecting measurements to an effective computation model, in this section we walk through a typical performance tuning session using our approach. This shows the features  in action, and indicates how having those features provide an advantage to the user over methods that lack them.
1.50
1.51  We first describe the program and language used, and then the features of our visualization. After this preparation, we show a sequence of the visualizations. In each,  we point out how the  performance loss is identified, and which visual features suggest the hypothesis of the cause of the loss.  We show how the visual features direct the user to the specific sections of code that need to be changed, and how the model helps suggest what changes to try.
1.52
1.53 @@ -153,7 +153,7 @@
1.54
1.55  In our session, we wish to tune a standard program that the reader  knows well. The best example is likely matrix multiply, with which the reader should be familiar, allowing concentration on the tool without distraction about the application.
1.56
1.57 -We run it on a machine with 4 sockets by 10 cores each, for a total of 40 physical cores. They are Intel WestmereEx cores running at 3.0GHz, with TurboBoost turned off.
1.58 +We run it on a machine with 4 sockets by 10 cores each, for a total of 40 physical cores. They are Intel WestmereEx cores running at 3.0GHz, with TurboBoost turned off for reproducability.
1.59
1.60  The application code uses language features to create virtual processors (VP). The first VP created divides the work into a number of pieces and creates a new VP for each piece.
1.61
1.62 @@ -162,11 +162,11 @@
1.63  It then creates a results VP that receives a partial-result from each piece and accumulates the results. The  divider VP  then waits for the results VP to indicate completion, after which the language runtime shuts down.
1.64
1.65  \subsection{The Language}
1.66 -The language used is SSR, which is based on rendez-vous style send and receive operations made between virtual processors (VPs), which are more commonly known as threads'. It has commands for creating and destroying VPs, and three kinds of send-receive paired operations.
1.67 +The language used is SSR, which is based on rendez-vous style send and receive operations made between virtual processors (VPs), which are equivalent to software threads'. The example application uses its commands for creating and destroying VPs, and  two kinds  of  send-receive paired operations.
1.68
1.69 -The first, \emph{send\_from\_to} specifies both sender and receiver VPs. It is used by the results VP to tell the divider VP that the work is complete. The second, \emph{send\_of\_type\_to}, specifies only a specific receiver, leaving the sender  anonymous, which increases flexibility while maintaining some control over scope. This  is used by the worker VPs doing the pieces to send their partial-result to the results processor. The third kind, \emph{send\_of\_type}, only specifies the type, and so acts as a global communication channel; this is not used in our application.
1.70 +The first, \emph{send\_from\_to} specifies both sender and receiver VPs. The results VP uses this to tell the divider VP that the work is complete. The second, \emph{send\_of\_type\_to}, specifies only a specific receiver, leaving the sender  anonymous. It has increased flexibility while maintaining some control over scope. The worker VPs use this  to send their partial-result to the results VP.
1.71
1.72 -The language also includes a \emph{singleton} construct that designates a piece of code as to be executed only once, which we use to  rearrange and copy data to get better cache behavior. A given copy is shared by several virtual processors on different cores, but the copy is only to be performed once.
1.73 +The language also includes a \emph{singleton} construct that designates a piece of code as to be executed only once. It was used to shift work out of the divider, which is sequential, over to the worker-pieces, which spreads the work across many cores.  The work is setup that many pieces share but is only to be performed once.
1.74
1.75  Miscellaneous performance constructs are also available, such as one to force which core a virtual processor is assigned to. We use this in our example program to control scheduling.
1.76
1.77 @@ -189,14 +189,14 @@
1.78  \begin{figure}[ht]
1.79    \centering
1.80    \includegraphics[width = 3.2in, height = 1.8in]{../figures/UCC_stylized_for_expl.pdf}
1.81 -  \caption{Two versions of a stylized Unit and Constraint Collection (UCC). The left UCC displays only the logical connections among units, while the right UCC also shows the amount of work in a unit.}
1.82 +  \caption{Two versions of a stylized Unit and Constraint Collection (UCC). Both display the  logical connections among units, but the right UCC adds an indication of the amount of work in a unit.}
1.83    \label{fig:UCC_expl}
1.84  \end{figure}
1.85
1.86
1.87  Fig \ref{fig:UCC_expl} shows two versions of a UCC, both stylized for purposes of explanation.  These are used less often than the SCG,  mainly as a helper when brainstorming ways to get more usable parallelism out of an application.
1.88
1.89 -Both versions of the UCC depict constraints on units. Starting  at the top, they  progress downwards,  placing units that constrain being above those constrained by them, and connecting them by a line representing the constraint. A unit is placed into the row just below the last of  the units constraining it, in the first available column. However, the rows and columns are just visual conveniences; the only information in the UCC is the lines between units, which represent constraints among the units. In practice, each unit is also labeled with a unique unit ID, from which the code of the unit and the parameters passed to it can be looked up.
1.90 +Both versions of the UCC depict constraints on units. Starting  at the top, they  progress downwards,  placing units that constrain  above those constrained by them, and connecting them by a line representing the constraint. A unit is placed into the row just below the last of  the units constraining it, in the first available column. However, the rows and columns are just visual conveniences; the only information in the UCC is the lines between units, which represent constraints among the units. In practice, each unit is also labeled with a unique unit ID, from which the code of the unit and the parameters passed to it can be looked up.
1.91
1.92  What distinguishes the two types of UCC is the size of the units. The left one shows only the logical graph-like structure of the application, while the right weights the height of a unit by the number of instructions executed in it. When information about unit size is available, it can visually indicate the \textit{useable} parallelism available.
1.93
1.94 @@ -248,10 +248,10 @@
1.95    % \subfloat[]
1.96    %  {\includegraphics[scale=0.015]{../figures/194.pdf}
1.98 -   \subfloat[13.0 Gcycles\\After fixing the load balancer]
1.99 +   \subfloat[13.0 Gcycles\\ after fixing the load balancer]
1.100      {\includegraphics[scale=0.015]{../figures/2.pdf}
1.102 -   \subfloat[11.0 Gcycles\\After changing so as to put work on core 1 first (solution 1)]
1.103 +   \subfloat[11.0 Gcycles\\ after changing so as to put work on core 1 first (solution 1)]
1.104      {\includegraphics[scale=0.015]{../figures/5.pdf}
1.106    % \subfloat[S1+divide factor 0.2]
1.107 @@ -260,19 +260,19 @@
1.108    % \subfloat[S1+divide factor 0.3]
1.109    %  {\includegraphics[scale=0.015]{../figures/209.pdf}
1.111 -   \subfloat[10.9 Gcycles\\plus changing the divide factor from 0.6 to  0.5]
1.112 +   \subfloat[10.9 Gcycles\\ after also changing the divide factor from 0.6 to  0.5]
1.113      {\includegraphics[scale=0.015]{../figures/6.pdf}
1.115 -   \subfloat[15.6 Gcycles\\further changing the divide factor to 0.4]
1.116 +   \subfloat[15.6 Gcycles\\ after further changing the divide factor to 0.4]
1.117      {\includegraphics[scale=0.015]{../figures/7.pdf}
1.119 -   \subfloat[10.4 Gcycles\\Going back to put divider VP onto its own core (Solution~2)]
1.120 +   \subfloat[10.4 Gcycles\\ after going back to put divider VP onto its own core (Solution~2)]
1.121      {\includegraphics[scale=0.015]{../figures/12.pdf}
1.123 -   \subfloat[10.3 Gcycles\\plus moving the receive VP to same core as divider VP]
1.124 +   \subfloat[10.3 Gcycles\\ after then moving the receive VP to the same core as the divider VP]
1.125      {\includegraphics[scale=0.015]{../figures/10.pdf}
1.127 -   \subfloat[9.7 Gcycles\\plus changing the divide factor to 0.4]
1.128 +   \subfloat[9.7 Gcycles\\ after additionally changing the divide factor to 0.4]
1.129      {\includegraphics[scale=0.015]{../figures/15.pdf}
1.131    % \subfloat[S2+divide factor 0.3]
1.132 @@ -291,7 +291,7 @@
1.133
1.134
1.135
1.136 -The scheduling consequence graphs which display the measurements collected on various runs during tuning are displayed in Fig \ref{story}, using the same scale, for direct comparison. They are 40 columns wide, one for each core, and relative height indicates relative execution time. The lines in red, orange, and green represent application-code constructs. Red is creation of a virtual processor, green is the many-to-one \texttt{send\_of\_type\_to}, and orange is the singleton construct. For better visibility, only constraints that cross cores are shown.
1.137 +The scheduling consequence graphs that display the measurements collected on various runs during tuning are displayed in Fig \ref{story}. All use the same scale, for direct comparison. They have 40 columns, one for each core, and relative height indicates relative execution time. The lines in red, orange, and green represent application-code constructs. Red is creation of a virtual processor, green is the many-to-one \texttt{send\_of\_type\_to}, and orange is the singleton construct. For better visibility, only constraints that cross cores are shown.
1.138
1.139
1.140  After functional debugging, the first tuning run produces the consequence graph seen in Figure \ref{fig:story:a}. The first thing to notice is that it is slimmer than expected: of the 40 available cores, only 13 are being used. Because the application places work on cores explicitly, this must be a bug in the dividing code. A cursory inspection reveals that a closing curly brace in the distribution loop had been misplaced. This may be a very simple bug, but it went unnoticed despite using this application as a test program for development of the language runtime, including performance, for several months.
1.141 @@ -303,18 +303,18 @@
1.142
1.143  The hypothesis was generated by looking at the code linked to each block and noting the visual pattern that creation code stopped running on core 0. Work code started running instead, and only after it finished did creation code start again. Hence, visual cues led directly to the hypothesis.
1.144
1.145 -Two solutions come to mind: assign work to the other cores first, so that they would be busy when the creator VP gets interrupted, or else dedicate a core to the creator VP. The first solution has the advantage of preserving performance of the application even when run on a machine with a single-digit number of cores, so we tried it first.
1.146 +Two solutions come to mind: assign work to the other cores first, so that they would be busy when the creator VP's core switches to work, or else dedicate a core to the creator VP. The first solution has the advantage of preserving performance of the application even when run on a machine with a single-digit number of cores, so we tried it first.
1.147
1.148 -\subsubsection{Third run} Assigning work to the other cores first gives us Fig \ref{story:c}. The section that was at the top, with idle cores, has disappeared. A small idle period can still be observed between the first and the second set of work tasks, because the work tasks have roughly the same length and  the work on core 0 starts last. It thus holds up creation, which re-starts after all the others have finished work (note that work on some cores takes slightly longer because that core performs the copy-transpose singleton, and also variations are caused by cache misses).
1.149 +\subsubsection{Third run} Assigning work to the other cores first gives us Fig \ref{story:c}. The section that was at the top, with idle cores, has disappeared. However, an idle period can still be observed between the first and the second set of work units. The work units have roughly the same length and  the work on core 0 starts last, so finishing the core 0 work holds up creation until after the other cores have finished their work (note that work on some cores takes slightly longer because that core performs the copy-transpose singleton, and also variations are caused by cache misses).
1.150
1.151 -  It is also noticeable that in the second set of work units to be distributed, not enough work pieces remain to fill all cores. 16 out of 40 remain idle at the bottom.
1.152 +  It is also noticeable that in the second set of work units, not enough work pieces remain to fill all cores. 16 out of 40 remain idle at the bottom.
1.153
1.154  \subsubsection{Fourth and fifth runs} To try to fill the empty columns at the end, we modified the size of the work units. However, as figures \ref{story:d} and \ref{story:e} show, this did not help. The blank areas between bands'' of work can be seen by the red lines to be due to  creation. The increased number of units causes creation to be the bottleneck again, and the time lost between sets grows larger than the time that previously was lost.
1.155
1.156  \subsubsection{Sixth run} At this point we wanted to try the road not chosen, dedicating a core to the creation VP.
1.157  Going back to version b of the code and implementing this solution, instead, leads to fig. \ref{story:f}. The blank area between the two sets has disappeared, showing a 4\% shorter execution time.
1.158
1.159 -\subsubsection{Seventh and eighth runs}As core 0 is now empty after the creation phase at the beginning, we also moved the receive VP there (fig. \ref{story:g}). This added only a minimal improvement at this size of work unit, but allows overlapping the result collection with other work, which is an advantage when cutting  the work into more pieces, requiring longer collection (fig. \ref{story:h}).
1.160 +\subsubsection{Seventh and eighth runs}Core 0 is now empty after the creation phase at the beginning, so we  moved the receive VP there (fig. \ref{story:g}). This  only minimally improved the time, at this size of work unit, but allows overlapping the result collection with other work, which is an advantage when cutting  the work into more pieces, requiring longer collection as in Fig. \ref{story:h}.
1.161
1.162  Overall it is also noticeable that as work units become smaller, execution time becomes more irregular. Variability in work length correlates with the color, indicating cache behavior has worsened with smaller work size.
1.163