### changeset 81:b5e314f7cdb9

perf tuning -- Fudged intro and related work, probably need cleanup
author Sean Halle Mon, 13 Aug 2012 02:13:02 -0700 2bf63d88116a 8508a9dd967a 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 1 files changed, 37 insertions(+), 16 deletions(-) [+]
line diff
     1.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Sat Aug 11 17:38:18 2012 -0700
1.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Mon Aug 13 02:13:02 2012 -0700
1.3 @@ -88,12 +88,14 @@
1.4
1.5  Current tools may cover all the parts of the application code, but fail to adequately connect their observations to the runtime behavior, scheduling decisions, and consequent hardware behavior. The decision about which task or virtual processor is assigned to which core at what point in time is at the heart of parallel performance, so these choices need to be highlighted and connected to the application features influencing them.
1.6
1.7 -In this paper we describe a model and framework that provides a more complete picture. It captures the relevant application features (units of work and the constraints governing their execution), exposes the scheduling decisions taken in accordance with the application and other constraints, and provides a structure to which to attach the measurements captured during execution. The model relates the measurements to each other and to the application.
1.8 +In this paper we describe a model and framework that makes these connections, to make identifying the presence of less-than-optimal performance straight-forward, make it easier to generate a hypothesis for losses, and easier to link the hypothesis  to specific points in application code.
1.9 +
1.10 +Our approach provides a more complete picture by following a model of parallel computation that suggests what information to gather and provide to the user. The model suggests that the relevant application features are units of work and the constraints governing their execution, and that it is crucial to record and present  the decisions taken by the runtime on scheduling those units onto hardware. The model provides a structure to which to attach these measurements captured during execution, that simplifies hypothesis generation by relating the measurements to each other and to the application.
1.11
1.12
1.13 -Views generated by using the model identify the schedulable units in the application code, and indicate how causal activities related to them result in the observed performance.  The result is like having a map showing idle cores linked back to the spots in the application code responsible. The information visualized includes the units, the scheduling decisions made for each by the runtime, runtime overhead assigned to each, and consequent usage of the cores. Identifying idle cores is immediate, and the visual features  enable more quickly generating the correct hypotheses for the causes of the performance losses, and pinpointing where in the application code changes can be made to improve performance.
1.14 +Our instrumentation detects the schedulable units in the application code, and the visual presentation indicates how causal activities related to the units result in the observed performance.  The result is like having a map showing idle cores linked back to the spots in the application code responsible.  The visual arrangement makes identifying idle cores immediate, and  enables  quickly generating the correct hypotheses for the causes of the performance losses, and pinpointing where in the application code changes can be made to improve performance.
1.15
1.16 -The model of computation and its usage is illustrated with a story line of performance tuning a standard parallel application on a large multi-core system.
1.17 +The model of computation and our visual approach to using it is illustrated with a story line of performance tuning a standard parallel application on a large multi-core system.
1.18
1.19  We start with an overview of previous approaches in section \ref{sec:related}. We show usage of our visualizations through a case study in section \ref{sec:casestudy}, and then expand on the model behind it in section \ref{sec:theory}. Section \ref{sec:Implementation} will tie the model to implementation details. Finally, we will conclude in section \ref{sec:conclusion}.
1.20
1.21 @@ -101,33 +103,52 @@
1.22  \section{Background and Related Work}
1.23  \label{sec:related}
1.24
1.25 - Performance tuning is an iterative process that involves a mental model. The programmer takes measurements during execution and compares them to the desired outcome. A mental model, constructed through experience and knowledge of the mechanics of execution, is used to generate a hypothesis explaining any discrepancies between the measurement and expectations. This hypothesis is then linked, again through a mental model, to things within the programmer's control, to suggest a change to make to the code. The modified code is run again, and these steps are repeated until the programmer is satisfied with the performance of the program. Thus, the mental model is central to performance tuning.
1.26 + We believe that the three primary factors to consider when choosing a performance tuning tool are:
1.27
1.28 -The pthreads abstraction is very close to the hardware. There are almost no applications whose structure maps gracefully onto pthreads, which means that the user has to simultaneously keep in mind and connect two very different mental models. This accounts for much of the difficulty of programming with threads, and remains a problem when analyzing performance. Yet as the earliest and most widespread parallel programming model, and one which serves as the basis on top of which many other models are implemented, it must be supported.
1.29 +\begin{itemize}
1.30 +\item the ease  of recognizing that performance is less than optimal
1.31 +\item the ease of forming a hypothesis to explain the difference
1.32 +\item the ease of linking the hypothesis to changes to make in application code
1.33 +\end{itemize}
1.34 +
1.35
1.36 -Pthreads intentionally introduces randomness, and its synchronization constructs imply only indirect connections. For instance, very little conclusion can be drawn about the relationship between the computations in two separate threads from their accessing the same lock consecutively. Tools such as Paradyn \cite{PerfToolParadyn} or VTune \cite{PerfToolVTune} that rely on this model consequently have a hard time connecting measurements to the application. They collect a wealth of statistics, but the application is to the tool a foreign process, where things just happen'' for no reason. The anticipated problems are things like application bottleneck is synchronisation'', and the detailed problems too much time is spent in spinlocks''. If there is more than one use of spinlocks in the application, it is not even obvious where the problem actually is.
1.37
1.38 -One fix to these problems is to allow the users to introduce measuring points into their own code. While this allows for a great deal of flexibility, it requires a lot more effort. One major advantage of this approach is that instrumentation code is written in the source language, so it has access to application concepts. This advantage can be kept with automated instrumentation, by providing an instrumenting compiler, like the Tau \cite{PerfToolTau} project does.
1.39 -As long as the underlying parallel language is still pthreads, however, there is no meaningful structure common to all applications to which to attach in order to generate expressive measurement quantities. Usually, function boundaries and the call graph are used to contextualize measurements. The sequence and frequency of function calls is very useful in showing how sequential performance relates to application semantics, however, they tell little about parallel performance impacts because they have no bearing on synchronization events. Assuming the parallel programming model is implemented as an external library, only the specific subset of parallel library function calls is actually relevant to the parallel aspects of performance.
1.40 -Placing instrumentation code in the parallel library therefore allows capturing the important information for parallel performance. Unfortunately, pthreads does not capture even hints as to \emph{why} a given function call ends up blocking or not blocking, and what the effects on other threads are.
1.41 +The belief is based on the following description of the activity of performance tuning:
1.42 +the activity of performance tuning is an iterative process in which a mental model plays a key role. The programmer takes measurements during execution and compares them to the desired outcome. A hypothesis is generated for what caused any differences, by using a mental model, of causal relationships within  execution. The mental model is then also used to link the hypothesis to code, to suggest a change to make to improve performance. After modification, the code is run again, and these steps are repeated until the programmer is satisfied with the performance of the program. Thus, the mental model is central to performance tuning.
1.43
1.44 -When a less low-level parallel library is used, much of this problem disappears. For instance, in an application with MPI message passing \cite{MPI}, the information thread 2 spends little time waiting for messages from thread 0 but a lot of time waiting for messages from thread 1'' can be recorded, where in pthreads only thread 2 spends a lot of time waiting for a signal'' would be visible. It is much easier to reach the conclusion that the bottleneck is the slow rate at which thread 1 produces data from the first than from the second.
1.45 +Because of its central role in performance tuning, the mental model is  a major factor when considering usability of performance-tuning tools.  The  ease of forming a hypothesis of the cause of performance losses, and ease of linking that back to changes to make in application code are the main areas that our approach differs, due to its mental model, which is based on the new model of parallel computation.
1.46 +
1.47 + Until recently, a dominant mental model for parallel computation has been the pthread abstraction. It is very close to the hardware, without much higher-level semantics, which makes it  a problem when analyzing parallel performance. Yet as the earliest and most widespread parallel programming model, and one which serves as the basis on top of which many other models are implemented, it must be supported.
1.48 +
1.49 +Pthreads  introduces randomness through OS time-slicing, and its synchronization constructs imply only indirect connections. For instance, very little conclusion can be drawn about the relationship between the computations in two separate threads from their accessing the same lock consecutively. Tools such as Paradyn \cite{PerfToolParadyn} or VTune \cite{PerfToolVTune} that rely on this model consequently have a hard time connecting measurements to the application. They collect a wealth of statistics, but the application is to the tool a foreign process, where things just happen'' with causal links between events and code details. The reported problems are things like application bottleneck is synchronisation'', and  too much time is spent in spinlocks''. If there is more than one use of spinlocks in the application, it is not clear what point in the code is the problem.
1.50 +
1.51 +One fix to these issues is to allow the users to introduce measuring points into their own code, at the cost of increased programmer effort.  Because instrumentation code is written in the source language, it has access to application concepts. This advantage can be kept with automated instrumentation, by providing an instrumenting compiler, like the Tau \cite{PerfToolTau} project does.
1.52 +
1.53 +However, as long as the underlying execution model is still threads,  there is no meaningful structure common to all applications to use to generate expressive measurement quantities. Function boundaries and the call graph are not sufficient when parallel execution is performed. The sequence and frequency of function calls tell little about parallel performance impacts because they have no bearing on synchronization events.  Unfortunately, pthreads does not capture even hints as to \emph{why} a given function call ends up blocking or not blocking, and what the effects on other threads are.
1.54 +
1.55 +Higher level parallel abstractions help alleviate that particular problem. For instance, in an application with MPI message passing \cite{MPI}, the information thread 2 spends little time waiting for messages from thread 0 but a lot of time waiting for messages from thread 1'' can be recorded, where in pthreads only thread 2 spends a lot of time waiting for a signal'' would be visible. It is much easier to reach the conclusion that the bottleneck is the slow rate at which thread 1 produces data.
1.56  Leveraging the MPI library is a well-trodden path, with many tools available \cite{PMPI} \cite{MPICL} (the above-cited Tau and Paradyn also added this feature). A frequent complaint of users of these tools is that in complex programs, the large number of messages sent makes it difficult to see problems. As there is little that distinguishes messages beside size, it is difficult to implement good filtering. Once again, we find that the limitations come from a parallel model that captures insufficient information.
1.57
1.58 -Fortunately, parallel languages are evolving in the same direction, and it is generally thought that semantically rich parallel constructs, when they match the structure of the application well, are key to improved productivity. With languages such as \cite{CnCInHotPar} and StarSs \cite{StarSs}, the information available to the language runtime offers a much clearer picture of what constraints are placed on the execution of tasks and where they come from. Consequently, the application structure can be well reflected in visualizations, as the Temanejo debugger  \cite{PerfToolStarSs} for StarSs demonstrates.
1.59 +Fortunately, parallel languages are evolving in the same direction, and it is widely believed that semantically rich parallel constructs, when they match the structure of the application well, are key to improved productivity. With languages such as \cite{CnCInHotPar} and StarSs \cite{StarSs}, the information available to the language runtime offers a much clearer picture of the constraints  placed on the execution of tasks and where they come from. Consequently, the application structure can be well reflected in visualizations, as the Temanejo debugger  \cite{PerfToolStarSs} for StarSs demonstrates.
1.60
1.61 -We believe that the quantities best to measure when performance tuning are scheduling decisions and the consequent usage of communication hardware and cores. Hence, the mental model should have, as concepts,  units of scheduled work, and scheduling decisions made on them, then relate those to consequent hardware behavior. The model should also relate all of those to application code, such as code boundaries that identify units, and constraints (dependencies) the application places on scheduling decisions.
1.62
1.63 -In this paper, we present a model that allows visualization of the causes of parallel performance for any parallel programming library or language through a flexible system of constraints. Constraints can be defined during instrumentation of the parallel runtime to match as closely as possible all the information available. If the parallel programming model captures enough semantic information to distinguish between constraints of different sources in the application, and the runtime adds its internal constraints and those imposed by the hardware, all the factors influencing scheduling decisions should be covered. Combined with a filtering mechanism to show or hide certain groups of constraints as desired, these visualizations make it significantly easier to understand the performance of parallel applications based only on runtime instrumentation.
1.64 +Our approach differs in  the following ways:
1.65
1.66 +\begin{itemize}
1.67 +\item  all the factors influencing scheduling decisions will be covered, given sufficient language and runtime pre-requisites
1.68 +\item works for any parallel programming library or language
1.69 +\item instrument only the runtime, once,  for all applications
1.70 +\item flexibility to capture and represent any type of constraints
1.71 +\item a filtering mechanism shows only the most relevant groups of constraints, as desired
1.72 +\end{itemize}
1.73
1.74  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1.75  \section{Illustrative Story of Performance Tuning}
1.76  \label{sec:casestudy}
1.77
1.78 -In this section, we illustrate the benefits of connecting measurements to an effective computation model by walking through a typical performance tuning session. It shows the features of our approach in action, and indicates how having those features provide an advantage to the user over methods that lack them.
1.79 +In this section, we illustrate the benefits of connecting measurements to an effective computation model by walking through a typical performance tuning session using our approach. It shows the features  in action, and indicates how having those features provide an advantage to the user over methods that lack them.
1.80
1.81 -We first describe the program and language used, and then the features of our visualization. After this preparation, we show a sequence of the visualizations. In each,  we point out how the  performance loss is identified, and which visual features suggest the hypothesis for the cause of the loss.  We show how this directs the user to the specific sections of code that need to be changed, and how the model helps suggest what changes to try.
1.82 +We first describe the program and language used, and then the features of our visualization. After this preparation, we show a sequence of the visualizations. In each,  we point out how the  performance loss is identified, and which visual features suggest the hypothesis of the cause of the loss.  We show how this directs the user to the specific sections of code that need to be changed, and how the model helps suggest what changes to try.
1.83
1.84  \subsection{The Application, and Target Hardware}
1.85
1.86 @@ -340,7 +361,7 @@
1.87    \label{fig:UCC_Concreteness}
1.88  \end{figure}
1.89
1.90 -Figure \ref{fig:UCC_Concreteness} shows the two axes and the four sets of additional information required to determine the units and constraints in the final concrete UCC. The position an application-derived UCC lands on the grid indicates how far it is from being fully concrete.  The horizontal   indicates what inputs are still needed to determine the units, and vertical the constraints.  0 indicates that the units (constraints) are fully determined by the application code alone; 1 means parameter values also must be known; 2 means input data values also play a role, and 3 means the units (constraints) can only become known  after runtime scheduling decisions have been made.
1.91 +Figure \ref{fig:UCC_Concreteness} shows the two axes and the four sets of additional information that have to be added to the UCC in order to determine the units and constraints in the final fully concrete UCC. The position an application-derived UCC, or later in the life-line UCC, lands on the grid indicates how far it is from being fully concrete.  The horizontal   indicates what inputs are still needed to determine the units, and vertical the constraints.  0 indicates that the units (constraints) are fully determined by the application code alone; 1 means parameter values also must be known; 2 means input data values also play a role, and 3 means the units (constraints) can only become known  after runtime scheduling decisions have been made.
1.92
1.93  The closer the application-derived UCC is to the origin, the less additional information is needed to obtain a  concrete UCC descendant of it. The UCC labeled A in the figure is fully concrete just from the source code alone (representing for example, matrix multiply with fixed size matrices and fixed division). The UCC labeled B requires the input data and parameters to be specified before its units are concrete, but just parameters to make its constraints fully concrete (as per ray-tracing, with bounce depth specified as a parameter). The UCC labeled C only has variability in its constraints, which require input data (for example, H.264 motion vectors).
1.94  But even the least concrete UCC, out at the end of the diagonal (D in the figure), generates a concrete descendant during a run of the application.