Mercurial > cgi-bin > hgwebdir.cgi > VMS > 0__Writings > kshalle

changeset 46:e3e35676068a
perf tune: Introduction and beginning of Related Work rewritten
author: Nina Engelhardt <nengel@mailbox.tu-berlin.de>
date: Wed, 23 May 2012 18:31:52 +0200
parents: 04f4e1c4972a
children: 364de5b006db
files: 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex
diffstat: 2 files changed, 21 insertions(+), 16 deletions(-) [+]
[-]

0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0

0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 37 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 37
0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0
0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 37
     2.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Wed May 16 15:38:44 2012 +0200
     2.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Wed May 23 18:31:52 2012 +0200
     2.3 @@ -282,7 +282,7 @@
     2.4  %
     2.5  
     2.6  
     2.7 -
     2.8 +%MOIRAI: MOdel for Integrated Runtime Analysis through Instrumentation
     2.9  \title{Integrated Performance Tuning Using Semantic Information Collected by Instrumenting the Language Runtime}
    2.10  
    2.11  \author{
    2.12 @@ -310,33 +310,36 @@
    2.13  
    2.14  \section{Introduction and Motivation}
    2.15  
    2.16 -Performance visualizations and tuning tools for parallel programs leave something to be desired, and yet are critical to achieving good performance. They fail mainly because their displays of measurements  leave the user to \emph{guess} the cause of performance loss.  They    provide many different views of performance measurements,  like statistics by line of code, message sends and receives, and core usage timelines. But the user doesn't know why a function runs in a particular spot on a particular core, nor whether that  is desired behavior vs erroneous behavior.
    2.17 +Performance visualizations and tuning tools for parallel programs are critical to achieving good performance, and yet current solutions leave something to be desired. All too often, performance tuning consists of staring at columns of numbers, trying to \emph{guess} the cause of performance loss. Despite many different views of performance measurements, such as statistics by line of code, message sends and receives, and core usage timelines, the user doesn't know why a function runs in a particular spot on a particular core, nor whether that is desired behavior or erroneous behavior.
    2.18  
    2.19 +Examining these myriad views can feel like being advised by the fabled six blind men examining an elephant by touching the part nearest to each\ref{The poems of John Godfrey Saxe/The Blind Men and the Elephant}: one, reaching the side, proclaims it much like a wall; the second, feeling the tusk, compares it to a spear; the next, feeling the trunk, is reminded of a snake; the fourth, wrapping his arms around a leg, divines a tree; the fifth, chancing upon the ear, likens it to a fan; while the sixth, seizing the tail, insists that is is like a rope. All of these views are indeed correct and provide important information, but it is hard to reconstruct from them a view of the whole.
    2.20  
    2.21 -It's like the allegory of the five blind people and the elephant: one touches the ear and says it's round and flat, another touches the leg and says it's a tree, and so on. They each are correct, but the views don't connect to tell them the whole picture. 
    2.22 +Current tools, although they cover all the parts of the application code, fail to adequately connect their observations to the runtime behavior, scheduling decisions, and consequent hardware behavior. The decision about which task or virtual processor is assigned to which core at what point in time is at the heart of parallel performance, so these choices need to be highlighted and connected to the application features influencing them.
    2.23  
    2.24 -Current tools cover all the parts of the application code, but fail to adequately connect these to the runtime behavior, scheduling decisions, and consequent hardware behavior.  The decision about which task or virtual processor is assigned to which core at what point in time is the heart of parallel performance, so the tool needs to connect such choices and behaviors to application features. That is the information missing: the causes of the observed behavior, and how those causes tie to application features.
    2.25 +To provide a more complete picture, what is needed is a model or framework that can capture the relevant application features (units of work and the constraints governing their execution), expose the scheduling decisions taken within the parameters allowed by the application and other constraints, and provide a structure to which to attach the measurements captured during execution, in order to be able to relate the measurements among each other and to the application.
    2.26  
    2.27 +Our contribution is such a model and framework. Using it, we generate views that indicate the schedulable units in the code, all the constraints on scheduling them, and connect them to application code. It includes constraints imposed by the application, the runtime implementation and hardware details. 
    2.28  
    2.29 -To fix this, a model is needed that provides the linkage to connect the measurements together.  It  should be in terms of scheduling decisions, having the units the decisions are made on, and the various sources of constraints on the decisions. 
    2.30 +The connections made within the views enable more accurate hypotheses for the cause of performance losses. The views connect units to the specific segments of code that compose the units, and connect each constraint on scheduling choices to the precise source of the constraint, within the code, hardware, or runtime. Resource usage is separated into categories: application work, non-overlapped communication (which results from scheduling decisions), and scheduling/runtime overhead. 
    2.31 +For parameters that affect units and constraints, the effects of parameter choices are also shown.
    2.32 +In visual form, the integration and connections have the effect of making areas of performance loss evident, and the causes of the loss clear. 
    2.33  
    2.34 -Our contribution is such a model and framework. Using it, we generate views that indicate the schedulable units in the code, all the constraints on scheduling them, and connect them to application code. It includes constraints imposed by the application, the runtime implementation  and hardware details. 
    2.35 +In the following, we will describe our model of computation, which drives the instrumentation and visualization, and links the aspects of performance to the collected performance tuning information.
    2.36 +The model and visualization are illustrated with a story line, which shows how they are used to performance tune a standard parallel application on a large multi-core system.
    2.37  
    2.38 -The connections made within the views encourage hypotheses for the cause of performance losses.  The views connect units to the specific segments of code that compose the units, and connect each constraint on scheduling choice to the precise source of the constraint, within the code, hardware, or runtime. They separate resource usage into the categories: application work, non-overlapped communication (which results from scheduling decisions), and scheduling/runtime overhead. They also integrate parameter choices, within the code, that affect unit creation and constraints. In visual form, the integration and connections have the effect of making areas of performance loss evident, and the causes of the loss clear. 
    2.39 -
    2.40 - 
    2.41 -We describe our model of computation, which relates the aspects of performance, and the instrumentation and visualization that are guided by that model and collect performance tuning information and link it together. The model and visualization are illustrated with a story line, which shows how they are used to performance tune a standard parallel application on a challenging multi-core system. 
    2.42 -
    2.43 -We start with background on performance tuning and previous approaches to tools in Section X. Then...
    2.44 +We start with background on performance tuning and an overview of previous approaches in section \ref{related}. We shall introduce our model through a case study in section \ref{casestudy}, and then expand on the theory behind it in section \ref{theory}. Finally, we will conclude in section \ref{conclusion}.
    2.45  
    2.46  
    2.47  \section{Background and Related Work}
    2.48 -Performance tuning follows the standard pattern of iterating. In each iteration, first measurements of performance are taken and compared to what was desired. Some mental model is  used to generate a hypothesis of why the measurement was less than desired. A mental model is then used to link the hypothesis to things within the programmer's control,  to suggest a change to make to the code. The modified code is then run again and the iteration-step repeats until the person doing the tuning is satisfied.
    2.49 +\label{related}
    2.50  
    2.51 +Performance tuning is an iterative process. The programmer usually has some expectation of the performance of her program, and takes measurements during execution that are then compared to the desired outcome. A mental model, constructed through experience and knowledge of the mechanics of execution, is then used to generate a hypothesis explaining any discrepancies between the measurement and expectations. This hypothesis is then linked, again through a mental model, to things within the programmer's control, to suggest a change to make to the code. The modified code is then run again, and these steps are repeated until the programmer is satisfied with the performance of the program.
    2.52  
    2.53 -For performance of parallel computation, more than just runtime measurements are important. Context for those measurements must also be gathered, which includes the characteristics of the units of work that are scheduled onto resources, as well as the  constraints on scheduling those. Then during the run, not only is hardware usage measured but also the scheduling decisions of why the hardware was used that way.
    2.54 +To understand the performance of parallel programs, more than just runtime measurements are important. Context for those measurements must also be gathered, which includes the characteristics of the units of work that are scheduled onto resources, as well as the constraints on scheduling those. It is also crucial to record during the run not only measurements of when the hardware was busy doing what, but also the scheduling decisions that effected this hardware usage.
    2.55  
    2.56 -Based on this pattern of performance tuning, we divide other approaches to performance tuning into five categories and talk about them separately. First, the four classic approaches, then a newer breed of tool that is starting to emerge.
    2.57 +Based on this pattern of performance tuning, we evaluate the advantages and shortcomings of different tools' representations of an application and its performance. The first four approaches are found, sometimes in combinations, in most classic performance evaluation tools, while the fifth is starting to emerge, enabled by the growing adoption of task-based programming languages in recent years.
    2.58 +
    2.59 +%%
    2.60  
    2.61  \subsection{Thread-model based Approaches}
    2.62  Most of the older more established tools come from the threads world, and conceive of the application as a processor that performs actions, but don't include the concept of application-defined tasks nor constraints on them. This makes them unable to directly connect statistics they gather to application features.  The lack of connection forces the user to guess at what aspect of the code is responsible for observed performance.
    2.63 @@ -370,7 +373,7 @@
    2.64  
    2.65  
    2.66  \section{Illustrative Story of Performance Tuning}
    2.67 -
    2.68 +\label{casestudy}
    2.69  
    2.70  \begin{figure*}[t!]
    2.71    \begin{minipage}[b]{0.2\textwidth}
    2.72 @@ -527,6 +530,7 @@
    2.73  
    2.74  
    2.75  \section{The Model Behind the Visualization}
    2.76 +\label{theory}
    2.77  
    2.78  Now we expand on the model behind the visualizations, which is what ties the information together. It has two parts, a \emph{Unit \&\ Constraint Collection (UCC)}, and a \emph{Scheduling Consequence Graph} (just consequence graph or CG).  The UCC indicates the freedom of choice the application allows, encoding what the programmer has control over. The consequence graph says which of those were actually taken during the run and the consequences of that set of choices. We give a precise definition of UCC then consequence graph, in turn.
    2.79  
    2.80 @@ -623,6 +627,7 @@
    2.81  The last question is how to handle communication consequences. This is tricky because decisions in higher-level runtimes set the context for decisions in lower-level ones. This means a higher-level choice is linked to the consequences from lower-level choices. The value of a consequence graph is linking the size of boxes in it to the decisions made by the scheduler, as represented by the shape. It's not clear how to divide, among the levels, the time that cores spend waiting for non-overlapped communication. We have no good answer at the moment and leave it for future work.
    2.82  
    2.83  \section{Conclusion}
    2.84 +\label{conclusion}
    2.85  We have shown how to apply a computation model to instrument a language runtime for collecting measurements that connect: to each other, to application structure, to scheduling decisions, and to hardware. A simple visualization of the data has features that indicate lost performance, and features that visually link the lost performance to the cause, no matter if the cause is application structure, language runtime implementation, or hardware feature.  It is this linkage, due to the computation model, that sets this approach apart from others. 
    2.86   
    2.87