VMS/0__Writings/kshalle

changeset 83:78a1ee9b06f1

perf tuning -- small tweaks of intro and related work
author Sean Halle <seanhalle@yahoo.com>
date Tue, 14 Aug 2012 03:28:16 -0700
parents 8508a9dd967a
children d5f65e311f64
files 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex
diffstat 1 files changed, 22 insertions(+), 20 deletions(-) [+]
line diff
     1.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Mon Aug 13 18:48:19 2012 +0200
     1.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Tue Aug 14 03:28:16 2012 -0700
     1.3 @@ -82,62 +82,63 @@
     1.4  \section{Introduction}
     1.5  \label{sec:intro}
     1.6  
     1.7 -Performance visualizations and tuning tools for parallel programs are critical to achieving good performance, and yet current solutions leave something to be desired. All too often, performance tuning consists of staring at abstract views of usage or statistics, trying to \emph{guess} the cause of performance loss. Despite many different views of performance measurements, such as frequency by line of code, message sends and receives, and core usage timelines, the user doesn't know why a function runs in a particular spot on a particular core, nor whether that is desired behavior or erroneous behavior.
     1.8 +Performance visualizations and tuning tools for parallel programs are critical to achieving good performance, and yet the experience of using current solutions leaves something to be desired. All too often, performance tuning consists of staring at abstract views of usage or statistics, trying to \emph{guess} the cause of performance loss. Despite many different views of performance measurements, such as frequency by line of code, message sends and receives, and core usage timelines, the user doesn't know why a function runs in a particular spot on a particular core, nor whether that is desired behavior or erroneous behavior.
     1.9  
    1.10  Examining these myriad views can feel like being advised by the fabled six blind men examining an elephant by touching the part nearest to each\cite{PerfToolPoem}: one, reaching the side, says it is like a wall; the second, feeling the tusk, compares it to a spear; the next, touching the trunk, is reminded of a snake, and so on. All of these views are indeed correct and supply important information, but they don't connect to each other, to provide a view of how the parts interact to form a whole.
    1.11  
    1.12 -Current tools may cover all the parts of the application code, but fail to adequately connect their observations to the runtime behavior, scheduling decisions, and consequent hardware behavior. The decision about which task or virtual processor is assigned to which core at what point in time is at the heart of parallel performance, so these choices need to be highlighted and connected to the application features influencing them.
    1.13 +Likewise, current tools may cover all the parts of the application code, but fail to adequately connect their observations to the runtime behavior, scheduling decisions, and consequent hardware behavior. The decision about which task or virtual processor is assigned to which core at what point in time is at the heart of parallel performance, so these choices need to be highlighted and connected to the application features influencing them.
    1.14  
    1.15  In this paper we describe a model and framework that makes these connections, to make identifying the presence of less-than-optimal performance straight-forward, make it easier to generate a hypothesis for losses, and easier to link the hypothesis  to specific points in application code.
    1.16  
    1.17  Our approach provides a more complete picture by following a model of parallel computation that suggests what information to gather and provide to the user. The model suggests that the relevant application features are units of work and the constraints governing their execution, and that it is crucial to record and present  the decisions taken by the runtime on scheduling those units onto hardware. The model provides a structure to which to attach these measurements captured during execution, that simplifies hypothesis generation by relating the measurements to each other and to the application.
    1.18   
    1.19  
    1.20 -Our instrumentation detects the schedulable units in the application code, and the visual presentation indicates how causal activities related to the units result in the observed performance.  The result is like having a map showing idle cores linked back to the spots in the application code responsible.  The visual arrangement makes identifying idle cores immediate, and  enables  quickly generating the correct hypotheses for the causes of the performance losses, and pinpointing where in the application code changes can be made to improve performance.
    1.21 +Our instrumentation detects the schedulable units in the application code, and presents them visually in a way that indicates how causal activities related to the units result in the observed performance.  The effect is like having a map showing idle cores linked back to the spots in the application code responsible.  The visual arrangement makes identifying idle cores immediate, and  enables  quickly generating the correct hypotheses for the causes of the performance losses, and pinpointing where in the application code changes can be made to improve performance.
    1.22  
    1.23  The model of computation and our visual approach to using it is illustrated with a story line of performance tuning a standard parallel application on a large multi-core system.
    1.24  
    1.25 -We start with an overview of previous approaches in section \ref{sec:related}. We show usage of our visualizations through a case study in section \ref{sec:casestudy}, and then expand on the model behind it in section \ref{sec:theory}. Section \ref{sec:Implementation} will tie the model to implementation details. Finally, we will conclude in section \ref{sec:conclusion}.
    1.26 +We start with an overview of previous approaches in section \ref{sec:related}. We show usage of our visualizations through a case study in section \ref{sec:casestudy}, and then expand on the model behind it in section \ref{sec:theory}. Section \ref{sec:Implementation} will tie the model to implementation details. Finally, we conclude in section \ref{sec:conclusion}.
    1.27  
    1.28  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    1.29  \section{Background and Related Work}
    1.30  \label{sec:related}
    1.31  
    1.32 - We believe that the three primary factors to consider when choosing a performance tuning tool are:
    1.33 + When choosing a performance tuning tool, three primary factors to consider are:
    1.34  
    1.35  \begin{itemize} 
    1.36 -\item the ease of recognizing that performance is less than optimal
    1.37 -\item the ease of forming a hypothesis to explain the difference
    1.38 +\item the ease of recognizing performance loss
    1.39 +\item the ease of forming a hypothesis to explain the loss
    1.40  \item the ease of linking the hypothesis to changes to make in application code
    1.41  \end{itemize}
    1.42    
    1.43  
    1.44  
    1.45 -The belief is based on the following description of the activity of performance tuning: 
    1.46 -the activity of performance tuning is an iterative process in which a mental model plays a key role. The programmer takes measurements during execution and compares them to the desired outcome. A hypothesis is generated for what caused any differences, by using a mental model, of causal relationships within  execution. The mental model is then also used to link the hypothesis to code, to suggest a change to make to improve performance. After modification, the code is run again, and these steps are repeated until the programmer is satisfied with the performance of the program. Thus, the mental model is central to performance tuning.
    1.47 +The mental model that underlies the tool plays a key role in these three usefulness factors. This can be seen  by examining the  activity of performance tuning.  It is an iterative process. The programmer takes measurements during execution and compares them to the desired outcome. A hypothesis is generated for what caused any differences, by using a mental model, of causal relationships within  execution. The mental model is then also used to link the hypothesis to code, to suggest a change to make to improve performance. After modification, the code is run again, and these steps are repeated until the programmer is satisfied with the performance of the program. Thus, the mental model is central to performance tuning.
    1.48  
    1.49 -Because of its central role in performance tuning, the mental model is  a major factor when considering usability of performance-tuning tools.  The  ease of forming a hypothesis of the cause of performance losses, and ease of linking that back to changes to make in application code are the main areas that our approach differs, due to its mental model, which is based on the new model of parallel computation.
    1.50 +As a result of its role when using  a tuning tool, the mental model underlying the tool is   a major focus when comparing  tools.  The main way  our approach differs  is its mental model, which is based on a new model of parallel computation. The new model makes it  easier to form a hypothesis of the cause of performance losses, and easier to link that back to changes to make in application code.
    1.51  
    1.52 - Until recently, a dominant mental model for parallel computation has been the pthread abstraction. It is very close to the hardware, without much higher-level semantics, which makes it  a problem when analyzing parallel performance. Yet as the earliest and most widespread parallel programming model, and one which serves as the basis on top of which many other models are implemented, it must be supported.
    1.53 + Historically, a dominant mental model for parallel computation has been the software-thread abstraction. It is very close to the hardware, without much higher-level semantics. This makes it  a problem when analyzing parallel performance. Yet as the earliest and most widespread parallel programming model, and one that serves as the basis on top of which many other models are implemented, it must be supported.
    1.54  
    1.55 -Pthreads  introduces randomness through OS time-slicing, and its synchronization constructs imply only indirect connections. For instance, very little conclusion can be drawn about the relationship between the computations in two separate threads from their accessing the same lock consecutively. Tools such as Paradyn \cite{PerfToolParadyn} or VTune \cite{PerfToolVTune} that rely on this model consequently have a hard time connecting measurements to the application. They collect a wealth of statistics, but the application is to the tool a foreign process, where things ``just happen'' with causal links between events and code details. The reported problems are things like ``application bottleneck is synchronisation'', and  ``too much time is spent in spinlocks''. If there is more than one use of spinlocks in the application, it is not clear what point in the code is the problem.
    1.56 +A popular version is  pthreads, which introduces randomness through OS time-slicing, and whose synchronization constructs  imply only indirect connections. For instance, very little conclusion can be drawn about the relationship between the computations in two separate threads from their accessing the same lock consecutively. Tools such as Paradyn \cite{PerfToolParadyn} or VTune \cite{PerfToolVTune} that rely on this model consequently have a hard time connecting measurements to the application code details. They collect a wealth of statistics, but the application is to the tool a foreign process, where things ``just happen'' without causal links between events and code. Such tools generate conclusions such as ``application bottleneck is synchronisation'', and  ``too much time is spent in spinlocks'', without help as to what in the code causes this. If there is more than one use of spinlocks in the application, it is not clear what point in the code is the problem.
    1.57  
    1.58  One fix to these issues is to allow the users to introduce measuring points into their own code, at the cost of increased programmer effort.  Because instrumentation code is written in the source language, it has access to application concepts. This advantage can be kept with automated instrumentation, by providing an instrumenting compiler, like the Tau \cite{PerfToolTau} project does.
    1.59  
    1.60  However, as long as the underlying execution model is still threads, there is no meaningful structure common to all applications to use to generate expressive measurement quantities. Function boundaries and the call graph are not sufficient when parallel execution is performed. The sequence and frequency of function calls tell little about parallel performance impacts because they have no bearing on synchronization events.  Unfortunately, pthreads does not capture even hints as to \emph{why} a given function call ends up blocking or not blocking, and what the effects on other threads are.
    1.61  
    1.62  Higher level parallel abstractions help alleviate that particular problem. For instance, in an application with MPI message passing \cite{MPI}, the information ``thread 2 spends little time waiting for messages from thread 0 but a lot of time waiting for messages from thread 1'' can be recorded, where in pthreads only ``thread 2 spends a lot of time waiting for a signal'' would be visible. It is much easier to reach the conclusion that the bottleneck is the slow rate at which thread 1 produces data.
    1.63 -Leveraging the MPI library is a well-trodden path, with many tools available \cite{PerfToolParaver} \cite{PerfToolVampir} \cite{PerfToolParadyn} \cite{PerfToolTau}. A frequent complaint of users of these tools is that in complex programs, the large number of messages sent makes it difficult to see problems. As there is little that distinguishes messages beside size, it is difficult to implement good filtering. Once again, we find that the limitations come from a parallel model that captures insufficient information.
    1.64 +
    1.65 +Leveraging the MPI library is a well-trodden path, with many tools available \cite{PerfToolParaver} \cite{PerfToolVampir} \cite{PerfToolParadyn} \cite{PerfToolTau}. A frequent complaint of users of these tools is that in complex programs, the large number of messages sent makes it difficult to see problems. As there is little that distinguishes messages besides size, it is difficult to implement good filtering. Once again, we find that the limitations come from a parallel model that captures insufficient information.
    1.66  
    1.67  Fortunately, parallel languages are evolving in the same direction, and it is widely believed that semantically rich parallel constructs, when they match the structure of the application well, are key to improved productivity. With languages such as CnC \cite{CnCInHotPar} and StarSs \cite{StarSs}, the information available to the language runtime offers a much clearer picture of the constraints  placed on the execution of tasks and where they come from. Consequently, the application structure can be well reflected in visualizations, as the Temanejo debugger  \cite{PerfToolStarSs} for StarSs demonstrates.
    1.68  
    1.69 -In the following section, we will demonstrate how our approach overcomes these challenges through a combination of the following features:
    1.70 +In the following section, we will demonstrate how our approach overcomes the various challenges of these approaches through a combination of the following features:
    1.71  \begin{itemize}
    1.72 -\item  all the factors influencing scheduling decisions will be covered, given sufficient language and runtime pre-requisites
    1.73 -\item works for any parallel programming library or language
    1.74 -\item instrument only the runtime, once,  for all applications
    1.75 -\item flexibility to capture and represent any type of constraints
    1.76 -\item a filtering mechanism shows only the most relevant groups of constraints, as desired
    1.77 +\item  it visualizes all the factors influencing scheduling decisions, given sufficient language and runtime pre-requisites
    1.78 +\item it has a filtering mechanism that shows only the most relevant groups of constraints, as desired
    1.79 +\item it has flexibility to capture and represent any type of constraints
    1.80 +\item it works for any parallel programming library or language
    1.81 +\item it instruments only the runtime, once,  for all applications
    1.82 +
    1.83  \end{itemize}
    1.84  
    1.85  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    1.86 @@ -146,7 +147,7 @@
    1.87  
    1.88  In this section, we illustrate the benefits of connecting measurements to an effective computation model by walking through a typical performance tuning session using our approach. It shows the features  in action, and indicates how having those features provide an advantage to the user over methods that lack them.
    1.89  
    1.90 -We first describe the program and language used, and then the features of our visualization. After this preparation, we show a sequence of the visualizations. In each,  we point out how the  performance loss is identified, and which visual features suggest the hypothesis of the cause of the loss.  We show how this directs the user to the specific sections of code that need to be changed, and how the model helps suggest what changes to try. 
    1.91 +We first describe the program and language used, and then the features of our visualization. After this preparation, we show a sequence of the visualizations. In each,  we point out how the  performance loss is identified, and which visual features suggest the hypothesis of the cause of the loss.  We show how the visual features direct the user to the specific sections of code that need to be changed, and how the model helps suggest what changes to try. 
    1.92  
    1.93  \subsection{The Application, and Target Hardware}
    1.94  
    1.95 @@ -1283,3 +1284,4 @@
    1.96  
    1.97  
    1.98   trying to pinpoint the causes of performance loss,  internal structure of 
    1.99 +