changeset 72:d79f1861f95e

perf tune: rewrote related work
author Nina Engelhardt Mon, 06 Aug 2012 19:04:19 +0200 8ad64c3d3da1 2c858fb55da0 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 2 files changed, 12 insertions(+), 41 deletions(-) [+]
line diff
     1.1 Binary file 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf has changed

     2.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Mon Aug 06 01:18:50 2012 -0700
2.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Mon Aug 06 19:04:19 2012 +0200
2.3 @@ -68,7 +68,7 @@
2.4
2.5
2.6  \begin{abstract}
2.7 -Performance tuning is an important aspect of parallel programming. After all, why go parallel, if not for higher performance? Yet when trying to track down the causes of performance loss, a great deal of knowledge about the internal structure of both the application and the runtime is often needed to understand how the observed patterns of performance have come to pass.
2.8 +Performance tuning is an important aspect of parallel programming. Yet when trying to track down the causes of performance loss, a great deal of knowledge about the internal structure of both the application and the runtime is often needed to understand how the observed patterns of performance have come to pass.
2.9
2.10  The trend in parallel programming languages has been towards models that capture more structural information about the application, in an effort to increase both performance and ease of programming. We believe that this information can be used to improve performance tuning tools by making the causes of performance loss more readily apparent.
2.11
2.12 @@ -106,52 +106,23 @@
2.13  \section{Background and Related Work}
2.14  \label{sec:related}
2.15
2.16 -%A quick review of the process of performance tuning will provide much needed context for the discussion of other tools.
2.17 +Performance tuning is an iterative process that involves a mental model. The programmer takes measurements during execution that are then compared to the desired outcome. A mental model, constructed through experience and knowledge of the mechanics of execution, is used to generate a hypothesis explaining any discrepancies between the measurement and expectations. This hypothesis is then linked, again through a mental model, to things within the programmer's control, to suggest a change to make to the code. The modified code is run again, and these steps are repeated until the programmer is satisfied with the performance of the program. Thus, the mental model is central to performance tuning.
2.18
2.19 - Performance tuning is an iterative process that involves a mental model. The programmer takes measurements during execution that are then compared to the desired outcome. A mental model, constructed through experience and knowledge of the mechanics of execution, is used to generate a hypothesis explaining any discrepancies between the measurement and expectations. This hypothesis is then linked, again through a mental model, to things within the programmer's control, to suggest a change to make to the code. The modified code is run again, and these steps are repeated until the programmer is satisfied with the performance of the program. Thus, the mental model is central to performance tuning.
2.20 +The pthreads abstraction is very close to the hardware. There are almost no applications whose structure maps gracefully onto pthreads, which accounts for much of the difficulty of programming with threads. Yet as the earliest and most widespread parallel programming model, and one which serves as the basis on top of which many other models are implemented, it must be supported.
2.21 +Pthreads intentionally introduces randomness, and its synchronization constructs imply only indirect connections. For instance, very little conclusion can be drawn about the relationship between the computations in two separate threads from their accessing the same lock consecutively. Tools such as Paradyn \cite{PerfToolParadyn} that rely on this model consequently have a hard time connecting measurements to the application. They collect a wealth of statistics, but the application is to the tool a foreign process, where things just happen'' for no reason. The anticipated problems are things like "application bottleneck is synchronisation", and the detailed problems "too much time is spent in spinlocks". If there is more than one use of spinlocks in the application, it is not even obvious where the problem actually is.
2.22
2.23 -The rest of the paper will support the position that the quantities best to measure when performance tuning are scheduling decisions and the consequent usage of communication hardware and cores. Hence, the mental model should have, as concepts,  units of scheduled work, and scheduling decisions made on them, then relate those to consequent hardware behavior.  The model should also relate all of those to application code, such as code boundaries that identify units, and constraints (dependencies) the application places on scheduling decisions.  This model is how the individual views become integrated into a view of the whole.
2.24 +One fix to these problems is to allow the users to introduce measuring points into their own code. While this allows for a great deal of flexibility, it requires a lot more effort. One major advantage of this approach is that instrumentation code is written in the source language, so it has access to application concepts. This advantage can be kept with automated instrumentation, by providing an instrumenting compiler, like the Tau \cite{PerfToolTau} project does.
2.25 +As long as the underlying parallel language is still pthreads, however, there is no meaningful common structure to which to attach in order to generate expressive measurement quantities. Usually, function boundaries and the call graph are used to attach measurements. The sequence and frequency of function calls is very useful in showing how sequential performance relates to application semantics, however, they tell little about parallel performance impacts because they have no bearing on synchronization events. Assuming the parallel programming model is implemented as an external library, only the specific subset of parallel library function calls is actually relevant to the parallel aspects of performance.
2.26 +Placing instrumentation code in the parallel library therefore allows capturing the important information for parallel performance. Unfortunately, pthreads does not capture even hints as to \emph{why} a given function call ends up blocking or not blocking, and what the effects on other threads are.
2.27
2.28 -With this in mind, we evaluate how well five categories of previous models used for performance tuning fit with our position, and the consequences of gaps in the fit. The first four approaches are found, sometimes in combinations, in most classic performance evaluation tools, while the fifth is starting to emerge, enabled by the growing adoption of task-based programming languages in recent years.
2.29 +When a less low-level parallel library is used, much of this problem disappears. For instance, in an application with MPI message passing \cite{MPI}, the information thread 2 spends little time waiting for messages from thread 0 but a lot of time waiting for messages from thread 1'' can be recorded, where in pthreads only thread 2 spends a lot of time waiting for a signal'' would be visible. It is much easier to reach the conclusion that the bottleneck is the slow rate at which thread 1 produces data from the first than from the second.
2.30 +Leveraging the MPI library is a well-trodden path, with many tools available \cite{PMPI} \cite{MPICL} (the above-cited Tau and Paradyn also added this feature). A frequent complaint of users of these tools is that in complex programs, the large number of messages sent makes it difficult to see problems. As there is little that distinguishes messages beside size, it is difficult to implement good filtering. Once again, we find that the limitations come from a parallel model that captures insufficient information.
2.31
2.32 -The commonality among the four classic approaches is models that are either hardware centric or closely related to the sequential execution model. As will be seen, the difficulty in using these approaches stems from their models, which are missing key concepts  of parallel execution.
2.33 +Fortunately, parallel languages are evolving in the same direction, and it is generally thought that semantically rich parallel constructs, when they match the structure of the application well, are key to improved productivity. With languages such as \cite{CnCInHotPar} and StarSs \cite{StarSs}, the information available to the language runtime offers a much clearer picture of what constraints are placed on the execution of tasks and where they come from. Consequently, the application structure can be well reflected in visualizations, as the Temanejo debugger for StarSs \cite{PerfToolStarSs} demonstrates.
2.34
2.35 -These early tools were created according to the same models that pervaded parallel application development at the time. Such models hampered the tools because they caused application code to, effectively, implement portions of the runtime as part of the application code, which hid critical information about units and constraints from the tool. Both MPI and threads cause the units of work to be \textit{implied} by the code,  which makes them difficult for tools to recognize. Likewise, constraints on scheduling are enforced by the code, but never stated in any explicit form, and so are unavailable to the tools to use to connect code features to runtime behaviors.
2.36 +We believe that the quantities best to measure when performance tuning are scheduling decisions and the consequent usage of communication hardware and cores. Hence, the mental model should have, as concepts,  units of scheduled work, and scheduling decisions made on them, then relate those to consequent hardware behavior. The model should also relate all of those to application code, such as code boundaries that identify units, and constraints (dependencies) the application places on scheduling decisions.
2.37
2.38 -
2.40 -Most of the early, more established, tools come from the threads world, and conceive of the application as a collection of virtual processors (threads) that perform actions, but don't include the concept of application-defined tasks nor constraints on them. This makes the tools unable to directly connect statistics they gather to scheduling choices and application features.  That lack of connection forces the user to guess at what aspect of the code is responsible for observed performance.
2.41 -
2.42 -Tau \cite{PerfToolTau}   represents the thread-centric approach well. It integrates many data sources, and has rich displays. However it models cores and memories, and thread contexts, with actions taken on or by each. This means that it is entirely up to the user to understand both application and runtime in enough detail to recognize patterns in the observed performance and generate hypotheses about which constructs are implicated.
2.43 -
2.44 -%Because it had no well defined concept of unit of work, constraints on them, and scheduling choices, it was unable to help the user form hypotheses for the parallelism-specific causes of poor performance.
2.45 -
2.46 -\subsection{Event-centric approach}
2.47 -
2.48 -A second classic approach  models parallel computation as a collection of events.  Paradyn \cite{PerfToolParadyn} represents this category well.  Its model of computation is based on both the timing of events and counts of events.
2.49 -
2.50 - It has a system for user-supplied instrumentation to collect event information and it has a hypothesis mechanism that protects the user from having to write custom code to test their hypotheses. However, the hypotheses are in terms of the timing and counts of events, not the parallel computation relevant information of units of scheduled work and the scheduling decisions made on those. As such, it helps to test hypothesis but not to generate them.
2.51 -
2.52 -\subsection{Message-centric approach}
2.53 -
2.54 -Paragraph \cite{PerfToolParagraph} also follows an event-based model, but represents the large collection of simpler tools that instrument the MPI or other message-passing library. It shows whether cores are busy, and indicates communication overhead and flight time, but lacks an underlying parallel computation model. Without such a model, it is difficult to  tie the communication pattern realized to application code features, which are what is under programmer control. Hence such message-based approaches also provide no help in forming hypotheses of which code features are responsible for observed performance.
2.55 -
2.56 -%Note that Paraver and Vampir are just painting tools that take event measurements and paint them on the screen.
2.57 -
2.58 -\subsection{Performance-counter approaches}
2.59 -Performance-counter approaches, such as VTune, POPI, and so on make up the fourth category. They concentrate on identifying hot-spots and potential false-sharing, using the hardware itself as the execution model. Thus, they have no concept of unit  nor scheduling event nor even runtime. As such, they do a good job of saying that something might be wrong, but don't help in pointing to what is causing the problem, and hence leave the user with no help determining what to change in their code to get better performance.
2.60 -
2.62 -
2.63 -
2.64 - Recent parallel languages have begun exposing units and constraints on them, for example  CnC\cite{CnCInHotPar}  and StarSs \cite{StarSs}.  This provides the tools with the chance to link measurements to a model of parallel computation, such as the one we propose. The user can then link such measurements to the units defined in their code, and form hypothesis about what to change to improve performance.
2.65 -
2.66 -As an example, the performance tuning tool \cite{PerfToolStarSs} that comes with StarSs provides a  view of the code-defined units to the user. For a specific machine, it tells the user whether it thinks the task size is too small, just right, or too big. Too small has too much runtime overhead, while too big has too few tasks to keep the cores busy.  This view simplifies things for the user by providing the hypothesis of what's causing poor performance and directly linking it  to code.
2.67 -
2.68 -This is a step in the right direction, but it gives the programmer only limited information. Subtle and complex interactions between code, runtime, and hardware are often to blame for performance loss, and the limited information supplied here doesn't help with such complexities. Moreover, the view is language specific, and so cannot apply to applications in other languages.
2.69 -
2.70 -
2.71 -
2.72 +In this paper, we present a model that allows visualization of the causes of parallel performance for any parallel programming library or language through a flexible system of constraints. Constraints can be defined during instrumentation of the parallel runtime to match as closely as possible all the information available. If the parallel programming model captures enough semantic information to distinguish between constraints of different sources in the application, and the runtime adds its internal constraints and those imposed by the hardware, all the factors influencing scheduling decisions should be covered. Combined with a filtering mechanism to show or hide certain groups of constraints as desired, these visualizations make it significantly easier to understand the performance of parallel applications based only on runtime instrumentation.
2.73
2.74
2.75  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%