# HG changeset patch # User Nina Engelhardt # Date 1344356189 -7200 # Node ID 2c858fb55da04f7c7f6acf9557204ad165bba5fd # Parent d79f1861f95e53695340fd0ec5296323b78c720a perf tune: minor fixes diff -r d79f1861f95e -r 2c858fb55da0 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf Binary file 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf has changed diff -r d79f1861f95e -r 2c858fb55da0 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex Mon Aug 06 19:04:19 2012 +0200 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex Tue Aug 07 18:16:29 2012 +0200 @@ -69,11 +69,8 @@ \begin{abstract} Performance tuning is an important aspect of parallel programming. Yet when trying to track down the causes of performance loss, a great deal of knowledge about the internal structure of both the application and the runtime is often needed to understand how the observed patterns of performance have come to pass. - -The trend in parallel programming languages has been towards models that capture more structural information about the application, in an effort to increase both performance and ease of programming. We believe that this information can be used to improve performance tuning tools by making the causes of performance loss more readily apparent. - +The trend in parallel programming languages has been towards models that capture more structural information about the application, in an effort to increase both performance and ease of programming. This structural information can be used to improve performance tuning tools by making the causes of performance loss more readily apparent. We propose a universal but adaptable way of integrating more application structure into performance visualizations, relying on a model of parallel computation. The visualizations produced clearly identify idle cores, and tie the idleness to causal interactions within the runtime and hardware, and from there to the parallelism constructs that constrained the runtime and hardware behavior, thereby eliminating guesswork. - This is implemented for multi-core hardware, and we walk through a tuning session on a large multi-core machine to illustrate how performance loss is identified and how hypotheses for the cause are generated. We also give a concise description of the implementation and the computation model. \end{abstract} @@ -98,9 +95,9 @@ %% hard to understand to the constraint in the code that combined with the runtime to cause its placement in time and location. The pattern of placements, combined with contents of the code, leads to the hypothesis. -In this paper, we describe our model of computation, and illustrate its usage with a story line of performance tuning a standard parallel application on a large multi-core system. +The model of computation and its usage is illustrated with a story line of performance tuning a standard parallel application on a large multi-core system. -We start with a refresher on performance tuning and an overview of previous approaches in section \ref{sec:related}. We show usage of our visualizations through a case study in section \ref{sec:casestudy}, and then expand on the model behind it in section \ref{sec:theory}. Section \ref{sec:Implementation} will tie the model to implementation details. Finally, we will conclude in section \ref{sec:conclusion}. +We start with an overview of previous approaches in section \ref{sec:related}. We show usage of our visualizations through a case study in section \ref{sec:casestudy}, and then expand on the model behind it in section \ref{sec:theory}. Section \ref{sec:Implementation} will tie the model to implementation details. Finally, we will conclude in section \ref{sec:conclusion}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Background and Related Work} @@ -108,11 +105,11 @@ Performance tuning is an iterative process that involves a mental model. The programmer takes measurements during execution that are then compared to the desired outcome. A mental model, constructed through experience and knowledge of the mechanics of execution, is used to generate a hypothesis explaining any discrepancies between the measurement and expectations. This hypothesis is then linked, again through a mental model, to things within the programmer's control, to suggest a change to make to the code. The modified code is run again, and these steps are repeated until the programmer is satisfied with the performance of the program. Thus, the mental model is central to performance tuning. -The pthreads abstraction is very close to the hardware. There are almost no applications whose structure maps gracefully onto pthreads, which accounts for much of the difficulty of programming with threads. Yet as the earliest and most widespread parallel programming model, and one which serves as the basis on top of which many other models are implemented, it must be supported. -Pthreads intentionally introduces randomness, and its synchronization constructs imply only indirect connections. For instance, very little conclusion can be drawn about the relationship between the computations in two separate threads from their accessing the same lock consecutively. Tools such as Paradyn \cite{PerfToolParadyn} that rely on this model consequently have a hard time connecting measurements to the application. They collect a wealth of statistics, but the application is to the tool a foreign process, where things ``just happen'' for no reason. The anticipated problems are things like "application bottleneck is synchronisation", and the detailed problems "too much time is spent in spinlocks". If there is more than one use of spinlocks in the application, it is not even obvious where the problem actually is. +The pthreads abstraction is very close to the hardware. There are almost no applications whose structure maps gracefully onto pthreads, which means that the user has to simultaneously keep in mind and connect two very different mental models. This accounts for much of the difficulty of programming with threads, and remains a problem when analyzing performance. Yet as the earliest and most widespread parallel programming model, and one which serves as the basis on top of which many other models are implemented, it must be supported. +Pthreads intentionally introduces randomness, and its synchronization constructs imply only indirect connections. For instance, very little conclusion can be drawn about the relationship between the computations in two separate threads from their accessing the same lock consecutively. Tools such as Paradyn \cite{PerfToolParadyn} or VTune \cite{PerfToolVTune} that rely on this model consequently have a hard time connecting measurements to the application. They collect a wealth of statistics, but the application is to the tool a foreign process, where things ``just happen'' for no reason. The anticipated problems are things like ``application bottleneck is synchronisation'', and the detailed problems ``too much time is spent in spinlocks''. If there is more than one use of spinlocks in the application, it is not even obvious where the problem actually is. One fix to these problems is to allow the users to introduce measuring points into their own code. While this allows for a great deal of flexibility, it requires a lot more effort. One major advantage of this approach is that instrumentation code is written in the source language, so it has access to application concepts. This advantage can be kept with automated instrumentation, by providing an instrumenting compiler, like the Tau \cite{PerfToolTau} project does. -As long as the underlying parallel language is still pthreads, however, there is no meaningful common structure to which to attach in order to generate expressive measurement quantities. Usually, function boundaries and the call graph are used to attach measurements. The sequence and frequency of function calls is very useful in showing how sequential performance relates to application semantics, however, they tell little about parallel performance impacts because they have no bearing on synchronization events. Assuming the parallel programming model is implemented as an external library, only the specific subset of parallel library function calls is actually relevant to the parallel aspects of performance. +As long as the underlying parallel language is still pthreads, however, there is no meaningful structure common to all applications to which to attach in order to generate expressive measurement quantities. Usually, function boundaries and the call graph are used to contextualize measurements. The sequence and frequency of function calls is very useful in showing how sequential performance relates to application semantics, however, they tell little about parallel performance impacts because they have no bearing on synchronization events. Assuming the parallel programming model is implemented as an external library, only the specific subset of parallel library function calls is actually relevant to the parallel aspects of performance. Placing instrumentation code in the parallel library therefore allows capturing the important information for parallel performance. Unfortunately, pthreads does not capture even hints as to \emph{why} a given function call ends up blocking or not blocking, and what the effects on other threads are. When a less low-level parallel library is used, much of this problem disappears. For instance, in an application with MPI message passing \cite{MPI}, the information ``thread 2 spends little time waiting for messages from thread 0 but a lot of time waiting for messages from thread 1'' can be recorded, where in pthreads only ``thread 2 spends a lot of time waiting for a signal'' would be visible. It is much easier to reach the conclusion that the bottleneck is the slow rate at which thread 1 produces data from the first than from the second.