## VMS/0__Writings/kshalle

### changeset 64:06073dc28f72

perf-tuning -- fixed up related work section
author Sean Halle Sat, 07 Jul 2012 21:43:23 -0700 832c3927511f 8bbaf1544726 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 1 files changed, 31 insertions(+), 31 deletions(-) [+]
line diff
     1.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Sat Jul 07 02:39:50 2012 -0700
1.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Sat Jul 07 21:43:23 2012 -0700
1.3 @@ -104,47 +104,47 @@
1.4  \section{Background and Related Work}
1.5  \label{sec:related}
1.6
1.7 -A quick review of the process of performance tuning will provide much needed context for the ddisussion of other tools. %too negative
1.8 +A quick review of the process of performance tuning will provide much needed context for the discussion of other tools. %too negative
1.9
1.10 - Performance tuning is an iterative process that involves a mental model. The programmer takes measurements during execution that are then compared to the desired outcome. A mental model, constructed through experience and knowledge of the mechanics of execution, is used to generate a hypothesis explaining any discrepancies between the measurement and expectations. This hypothesis is then linked, again through a mental model, to things within the programmer's control, to suggest a change to make to the code. The modified code is run again, and these steps are repeated until the programmer is satisfied with the performance of the program.
1.11 + Performance tuning is an iterative process that involves a mental model. The programmer takes measurements during execution that are then compared to the desired outcome. A mental model, constructed through experience and knowledge of the mechanics of execution, is used to generate a hypothesis explaining any discrepancies between the measurement and expectations. This hypothesis is then linked, again through a mental model, to things within the programmer's control, to suggest a change to make to the code. The modified code is run again, and these steps are repeated until the programmer is satisfied with the performance of the program. Thus, the mental model is central to performance tuning.
1.12
1.13 -For parallel performance tuning, the rest of the paper will support the position that the quantities best to measure are scheduling decisions and the consequent usage of communication hardware and cores. Hence, the mental model should have, as concepts,  units of scheduled work, and scheduling decisions made on them, then relate those to consequent hardware behavior.  It should also relate all of those to application code, such as code boundaries that identify units, and constraints (dependencies) the application places on scheduling decisions.
1.14 +The rest of the paper will support the position that the quantities best to measure when performance tuning are scheduling decisions and the consequent usage of communication hardware and cores. Hence, the mental model should have, as concepts,  units of scheduled work, and scheduling decisions made on them, then relate those to consequent hardware behavior.  The model should also relate all of those to application code, such as code boundaries that identify units, and constraints (dependencies) the application places on scheduling decisions.  This model is how the individual views become integrated into a view of the whole.
1.15
1.16 -With this in mind, we evaluate the advantages and shortcomings of five categories of models used for performance tuning. The first four approaches are found, sometimes in combinations, in most classic performance evaluation tools, while the fifth is starting to emerge, enabled by the growing adoption of task-based programming languages in recent years.
1.17 +With this in mind, we evaluate how well five categories of previous models used for performance tuning fit with our position, and the consequences of gaps in the fit. The first four approaches are found, sometimes in combinations, in most classic performance evaluation tools, while the fifth is starting to emerge, enabled by the growing adoption of task-based programming languages in recent years.
1.18
1.19 -The commonality among the four classic approaches is either the lack of a  model of parallel computation, or a poorly suited one.
1.20 +The commonality among the four classic approaches is models that are either hardware centric or closely related to the sequential execution model. As will be seen, the difficulty in using these approaches stems from their models, which are missing key concepts  of parallel execution.
1.21
1.22 -The early tools also suffered from applications that, effectively, implemented the runtime as part of the application code, which hid information from the tool. Both MPI and threads cause the runtime to be embedded into the application. In such a case, the units of work are implied in the code, and difficult for tools to recognize. Likewise, constraints on scheduling are enforced by the code, but never stated in any explicit form.
1.23 +These early tools were created according to the same models that pervaded parallel application development at the time. Such models hampered the tools because they caused application code to, effectively, implement portions of the runtime as part of the application code, which hid critical information about units and constraints from the tool. Both MPI and threads cause the units of work to be \textit{implied} by the code,  which makes them difficult for tools to recognize. Likewise, constraints on scheduling are enforced by the code, but never stated in any explicit form, and so are unavailable to the tools to use to connect code features to runtime behaviors.
1.24
1.25
1.27 -Most of the older more established tools come from the threads world, and conceive of the application as a collection of virtual processors that perform actions, but don't include the concept of application-defined tasks nor constraints on them. This makes them unable to directly connect statistics they gather to scheduling choices and application features.  The lack of connection forces the user to guess at what aspect of the code is responsible for observed performance.
1.28 +Most of the early, more established, tools come from the threads world, and conceive of the application as a collection of virtual processors (threads) that perform actions, but don't include the concept of application-defined tasks nor constraints on them. This makes the tools unable to directly connect statistics they gather to scheduling choices and application features.  That lack of connection forces the user to guess at what aspect of the code is responsible for observed performance.
1.29
1.30 -For example, Tau [] is  representative of the thread-centric approach, and a highly cited, older, system for performance tuning parallel applications. It integrates many data sources, and has rich displays. However it models cores and memories and thread contexts, with actions taken on or by each, with no well defined concept of unit of work. What it is missing is the concept:  tasks, constraints on them, and scheduling choices.
1.31 +Tau []   represents the thread-centric approach well. It integrates many data sources, and has rich displays. However it models cores and memories, and thread contexts, with actions taken on or by each. Because it had no well defined concept of unit of work, constraints on them, and scheduling choices, it was unable to help the user form hypotheses for the parallelism-specific causes of poor performance.
1.32
1.33  \subsection{Event-centric approach}
1.34
1.35 -Systems that model parallel computation as a collection of events are well represented by Paradyn[], another highly cited classic performance tuning tool.  Its model of computation is based on events, both the timing of events and counts of events.
1.36 +A second classic approach  models parallel computation as a collection of events.  Paradyn[] represents this category well.  Its model of computation is based on both the timing of events and counts of events.
1.37
1.38 - It has a system for user-supplied instrumentation to collect event information and it has a hypothesis mechanism that protects the user from having to write custom code to test their hypotheses. However, the hypotheses are in terms of the timing and counts of events, not the parallel computation relevant information of units of scheduled work and the scheduling decisions made on those.
1.39 + It has a system for user-supplied instrumentation to collect event information and it has a hypothesis mechanism that protects the user from having to write custom code to test their hypotheses. However, the hypotheses are in terms of the timing and counts of events, not the parallel computation relevant information of units of scheduled work and the scheduling decisions made on those. As such, it helps to test hypothesis but not to generate them.
1.40
1.41  \subsection{Message-centric approach}
1.42
1.43 -Paragraph also follows an event-based model, but represents the large collection of simpler tools that instrument the MPI or other message-passing library. It shows whether cores are busy, and indicates communication overhead, but lacks an underlying computation model to tie the communication pattern realized to application code features, which are what is under programmer control. It also fails to show runtime overhead, and which portions of idle time are caused by runtime internal constraints.
1.44 +Paragraph also follows an event-based model, but represents the large collection of simpler tools that instrument the MPI or other message-passing library. It shows whether cores are busy, and indicates communication overhead and flight time, but lacks an underlying parallel computation model. Without such a model, it is difficult to  tie the communication pattern realized to application code features, which are what is under programmer control. Hence such message-based approaches also provide no help in forming hypotheses of which code features are responsible for observed performance.
1.45
1.46  %Note that Paraver and Vampir are just painting tools that take event measurements and paint them on the screen.
1.47
1.48  \subsection{Performance-counter approaches}
1.49 -Performance-counter approaches, such as VTune, POPI, and so forth concentrate on identifying hot-spots and potential false-sharing.  These suffer from a lack of encompassing computation model, leaving the user to guess at what might be the cause of measured numbers.  They do a good job of saying that something might be wrong, but a poor job of pointing to what is causing the problem, and hence leave the user baffled as to what to change in their code to get better performance.
1.50 +Performance-counter approaches, such as VTune, POPI, and so on make up the fourth category. They concentrate on identifying hot-spots and potential false-sharing, using the hardware itself as the execution model. Thus, they have no concept of unit  nor scheduling event nor even runtime. As such,they do a good job of saying that something might be wrong, but don't help in pointing to what is causing the problem, and hence leave the user with no help determining what to change in their code to get better performance, with the exception of changing data structures to fix false sharing.
1.51
1.53
1.54
1.55 - These shortcomings are rectified by recent parallel languages such as CnC[] and StarSs[], which cleanly identify tasks and the constraints on them. The people who develop such a language usually also develop  tools to go with it.
1.56 + Recent parallel languages have begun exposing units and constraints on them, for example  CnC[] and StarSs[].  This provides the tools with the chance to link measurements to a model of parallel computation, such as the one we propose. The user can then link such measurements to the units defined in their code, and form hypothesis about what to change to improve performance.
1.57
1.58 -The StarSs tool[]\ prioritizes simplifying the view for the user. It doesn't give performance information directly, but instead identifies tasks and tells the user whether it thinks the task size is too small, just right, or    too big. Too small has too much runtime overhead, while too big has too few tasks to keep the cores busy.
1.59 +As an example, the performance tuning tool[] that comes with StarSs provides a  view of the code-defined units to the user. For a specific machine, it tells the user whether it thinks the task size is too small, just right, or too big. Too small has too much runtime overhead, while too big has too few tasks to keep the cores busy.  This view simplifies things for the user by providing the hypothesis of what's causing poor performance and directly linking it  to codde.
1.60
1.61 -This is a step in the right direction, but it doesn't include effects of code choices on communication, and gives the programmer only limited information. Subtle and complex interactions between code, runtime, and hardware are often to blame for performance loss, and the limited information supplied here is no help with such complexities.
1.62 +This is a step in the right direction, but it gives the programmer only limited information. Subtle and complex interactions between code, runtime, and hardware are often to blame for performance loss, and the limited information supplied here doesn't help with such complexities. Also, the tool is language specific, and so doesn't apply to applications in other languages.
1.63
1.64
1.65
1.66 @@ -154,9 +154,9 @@
1.67  \section{Illustrative Story of Performance Tuning}
1.68  \label{sec:casestudy}
1.69
1.70 -In this section, we illustrate the benefits of connecting measurements to an appropriate computation model by walking through a typical performance tuning session. It shows the features of our approach in action, and indicates how competing tools' lack of those features makes the work more difficult.
1.71 +In this section, we illustrate the benefits of connecting measurements to an effective computation model by walking through a typical performance tuning session. It shows the features of our approach in action, and indicates how having those features provide an advantage to the user over methods that lack them.
1.72
1.73 -We first describe the program and language used, and then the features of our visualization. After this preparation, we show a sequence of the visualizations. In each,  we point out how the  performance loss is identified, and which visual features suggest the hypothesis for the cause of the loss.
1.74 +We first describe the program and language used, and then the features of our visualization. After this preparation, we show a sequence of the visualizations. In each,  we point out how the  performance loss is identified, and which visual features suggest the hypothesis for the cause of the loss.  We show how this directs the user to the specific sections of code that need to be changed, and how the model helps suggest what changes to try.
1.75
1.76  \subsection{The Application, and Target Hardware}
1.77
1.78 @@ -166,9 +166,9 @@
1.79
1.80  The application code uses language features to create virtual processors (VP). The first VP created divides the work into a number of pieces and creates a new VP for each piece.
1.81
1.82 -How many pieces is set by a tuning parameter in the code, and the number of cores. The appliation uses language features to distribute the VPs  across the cores,  in a round-robin fashion (before tuning).
1.83 +How many pieces is set by a tuning parameter in the code, and the number of cores. The application uses language features to distribute the VPs  across the cores,  in a round-robin fashion (before tuning).
1.84
1.85 -It then creates   a results VP that receives a partial-result from each piece and accumulates the results. The original divider VP  then waits for the results VP to indicate completion, after which the language runtime shuts down.
1.86 +It then creates   a results VP that receives a partial-result from each piece and accumulates the results. The  divider VP  then waits for the results VP to indicate completion, after which the language runtime shuts down.
1.87
1.88  \subsection{The language}
1.89  The language used is SSR, which is based on rendez-vous style send and receive operations made between virtual processors (VPs). It has commands for creating and destroying VPs, and three kinds of send-receive paired operations.
1.90 @@ -236,10 +236,10 @@
1.91    % \subfloat[]
1.92    %  {\includegraphics[scale=0.015]{../figures/194.pdf}
1.95 +   \subfloat[After fixing the load balancer]
1.96      {\includegraphics[scale=0.015]{../figures/2.pdf}
1.98 -   \subfloat[Solution~1 (portable)]
1.99 +   \subfloat[After changing so as to put  work on core 1 first (solution 1)]
1.100      {\includegraphics[scale=0.015]{../figures/5.pdf}
1.102    % \subfloat[S1+divide factor 0.2]
1.103 @@ -248,19 +248,19 @@
1.104    % \subfloat[S1+divide factor 0.3]
1.105    %  {\includegraphics[scale=0.015]{../figures/209.pdf}
1.107 -   \subfloat[S1+divide factor 0.5]
1.108 +   \subfloat[plus changing the divide factor from 0.6 to  0.5]
1.109      {\includegraphics[scale=0.015]{../figures/6.pdf}
1.111 -   \subfloat[S1+divide factor 0.4]
1.112 +   \subfloat[ further changing the divide factor to 0.4]
1.113      {\includegraphics[scale=0.015]{../figures/7.pdf}
1.115 -   \subfloat[Solution~2]
1.116 +   \subfloat[Going back to put divider VP onto its own core (Solution~2)]
1.117      {\includegraphics[scale=0.015]{../figures/12.pdf}
1.120 +   \subfloat[plus moving the receive VP to same core as divider VP]
1.121      {\includegraphics[scale=0.015]{../figures/10.pdf}
1.123 -   \subfloat[S2+divide factor 0.4]
1.124 +   \subfloat[plus changing the divide factor to 0.4]
1.125      {\includegraphics[scale=0.015]{../figures/15.pdf}
1.127    % \subfloat[S2+divide factor 0.3]
1.128 @@ -279,15 +279,15 @@
1.129
1.130
1.131
1.132 -In this  subsection,  many  SCGs are shown, within Fig \ref{story}, that display the measurements collected on various runs during tuning. They are all 40 columns wide, one for each core, and all are on the same scale, so relative height indicates relative execution time. They have lines in red, orange, and green, which represent application-code constructs. Red is creation of a virtual processor, green is the many-to-one \texttt{send\_of\_type\_to}, and orange is the singleton construct. For better visibility, only constraints that cross cores are shown.
1.133 +In this  subsection,  many  SCGs are shown, in Fig \ref{story}. which display the measurements collected on various runs during tuning, all on the same scale for direct comparison. They are  40 columns wide, one for each core, and relative height indicates relative execution time. They have lines in red, orange, and green, which represent application-code constructs. Red is creation of a virtual processor, green is the many-to-one \texttt{send\_of\_type\_to}, and orange is the singleton construct. For better visibility, only constraints that cross cores are shown.
1.134
1.135
1.136 -After functional debugging, the first tuning run produces the consequence graph seen in Figure \ref{fig:story:a}. The first thing to notice is that it is slimmer than expected: of the 40 available cores, only 13 are being used. As the application places work on cores explicitly, this must be a bug in the dividing code. A cursory inspection reveals that a closing curly brace in the distribution loop had been misplaced. This may be a very simple bug, but it went unnoticed despite using this application as test program for development of the language runtime, including performance, for several months.
1.137 +After functional debugging, the first tuning run produces the consequence graph seen in Figure \ref{fig:story:a}. The first thing to notice is that it is slimmer than expected: of the 40 available cores, only 13 are being used. Because the application places work on cores explicitly, this must be a bug in the dividing code. A cursory inspection reveals that a closing curly brace in the distribution loop had been misplaced. This may be a very simple bug, but it went unnoticed despite using this application as a test program for development of the language runtime, including performance, for several months.
1.138
1.139  \subsubsection{Second Run}
1.140   After fixing this, the next run (Fig \ref{story:b}) corresponds much more to the expected execution behaviour. However, there remains a noticeable section at the beginning where only 3 cores have work and the other 37 remain idle.
1.141
1.142 -Zooming in on those  cores, we see that creation code starts running on core 0, within the creation VP, and then the next block on the core is work! Creation stops, starving the other cores. Looking at the creation code, we see that the creation VP assigns the first work VP to its own core, so that work is now waiting in the queue to execute there. When it creates the second work VP, that creation call switches core 0 to the runtime. When done with creation, the runtime takes the next VP from the queue, witch is that waiting work VP. Hence core 0 does the work next instead of continuing with creation  (the merits of work stealing or other scheduling strategies are independent from this illustration of how to use this approach to performance tune).
1.143 +Zooming in on those  cores, we see that creation code starts running on core 0, within the creation VP, and then the next block on the core is work! Creation stops, starving the other cores. Looking at the creation code, we see that the creation VP assigns the first work VP to its own core, so that work is now waiting in the queue to execute there. When it creates the second work VP, that creation call switches core 0 to the runtime. When done with creation, the runtime takes the next VP from the queue, which is that waiting work VP. Hence core 0 does the work next instead of continuing with creation  (the merits of work stealing or other scheduling strategies are independent from this illustration of how to use this approach to performance tune).
1.144
1.145  The hypothesis was generated by looking at the code linked to each block and noting the visual pattern that creation code stopped running on core 0. Work code started running instead, and only after it finished did creation code start again. Hence, visual cues led directly to the hypothesis.
1.146
1.147 @@ -322,12 +322,12 @@
1.148
1.149  Now that the usage has been seen, we expand on the model behind the visualizations. The model ties the information together, and understanding it helps in generating  hypotheses from the visualization features.
1.150
1.151 -As seen, the model has two parts, a \emph{Unit \&\ Constraint Collection (UCC)}, and a \emph{Scheduling Consequence Graph} (just consequence graph or CG).  The UCC indicates the scheduling  choices  the application allows, encoding what the programmer has control over. The consequence graph says which of those were actually taken during the run and the consequences of that set of choices.
1.152 +As seen, the model has two parts, a \emph{Unit \&\ Constraint Collection (UCC)}, and a \emph{Scheduling Consequence Graph} (SCG or just consequence graph,  CG).  The UCC indicates the scheduling  choices  the application allows, and so shows what the programmer has control over. Whereas the consequence graph says which of those were actually taken during the run and the consequences of that set of choices.
1.153
1.154  We give a more precise description of UCC, then consequence graph, in turn.
1.155  However, space is too limited for a complete definition, which is given in a companion paper submitted to a longer format venue.
1.156  \subsection{Unit \& Constraint Collection}
1.157 -The UCC contains all the units of work that get scheduled during a run, and all constraints the application places on scheduling the units. That's a nice solid definition, but things aren't quite that simple. The complication is that different classes of application exist, with two degrees of freedom that determine how much of the UCC is actually defined in the application vs the input data, or even in the runtime.
1.158 +The UCC contains all the units of work that get scheduled during a run, and all constraints the application places on scheduling those units. That's a nice solid definition, but things aren't quite that simple. The complication is that different classes of application exist, with two degrees of freedom that determine how much of the UCC is actually defined in the application vs the input data, or even in the runtime.
1.159
1.160  Some applications have everything determined in the code, with all units fixed, and all constraints fixed. An example is matrix multiply with fixed size matrices.  But for others, the shape of the UCC is only partially defined by the application code.  Take matrix multiply used in Section \ref{sec:casestudy}, where an input parameter determines the number of units created. Here, the UCC is different for each parameter value. An extreme example is an NP complete problem, with redividable units, for which the units are a function of both the  input data \emph{and} decisions made by the runtime.
1.161