Mercurial > cgi-bin > hgwebdir.cgi > VMS > 0__Writings > kshalle

changeset 61:cce09bfd652f
perf tuning -- redid intro and some of sect 3
author: Sean Halle <seanhalle@yahoo.com>
date: Wed, 27 Jun 2012 05:00:03 -0700
parents: 74775fd41630
children: 59a4161e7bf2
files: 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex
diffstat: 1 files changed, 19 insertions(+), 19 deletions(-) [+]
[-]

0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 38 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 38
0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 38
     1.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Wed Jun 27 02:49:38 2012 -0700
     1.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Wed Jun 27 05:00:03 2012 -0700
     1.3 @@ -84,23 +84,22 @@
     1.4  
     1.5  Performance visualizations and tuning tools for parallel programs are critical to achieving good performance, and yet current solutions leave something to be desired. All too often, performance tuning consists of staring at abstract views of usage or statistics, trying to \emph{guess} the cause of performance loss. Despite many different views of performance measurements, such as frequency by line of code, message sends and receives, and core usage timelines, the user doesn't know why a function runs in a particular spot on a particular core, nor whether that is desired behavior or erroneous behavior.
     1.6  
     1.7 -Examining these myriad views can feel like being advised by the fabled six blind men examining an elephant by touching the part nearest to each\ref{The poems of John Godfrey Saxe/The Blind Men and the Elephant}: one, reaching the side, proclaims it much like a wall; the second, feeling the tusk, compares it to a spear; the next, feeling the trunk, is reminded of a snake, and so on. All of these views are indeed correct and provide important information, but they don't connect to provide a view of the whole, especially the interactions between parts.
     1.8 +Examining these myriad views can feel like being advised by the fabled six blind men examining an elephant by touching the part nearest to each\ref{The poems of John Godfrey Saxe/The Blind Men and the Elephant}: one, reaching the side, proclaims it much like a wall; the second, feeling the tusk, compares it to a spear; the next, feeling the trunk, is reminded of a snake, and so on. All of these views are indeed correct and supply important information, but they don't connect to each other, to provide a view of how the parts interact to form a whole.
     1.9  
    1.10  Current tools may cover all the parts of the application code, but fail to adequately connect their observations to the runtime behavior, scheduling decisions, and consequent hardware behavior. The decision about which task or virtual processor is assigned to which core at what point in time is at the heart of parallel performance, so these choices need to be highlighted and connected to the application features influencing them.
    1.11  
    1.12 -To provide a more complete picture, what is needed is a model or framework that can capture the relevant application features (units of work and the constraints governing their execution), expose the scheduling decisions taken in accordance with the application and other constraints, and provide a structure to which to attach the measurements captured during execution. This is needed  to relate the measurements to each other and to the application.
    1.13 +In this paper we describe a model and framework that provides a more complete picture. It captures the relevant application features (units of work and the constraints governing their execution), exposes the scheduling decisions taken in accordance with the application and other constraints, and provides a structure to which to attach the measurements captured during execution.
    1.14 +The model relates the measurements to each other and to the application.
    1.15 + 
    1.16  
    1.17 -Our contribution is such a model and framework. Using it, we generate views that indicate the schedulable units in the code, and all the constraints on scheduling them. It includes constraints imposed by the application, the runtime implementation and hardware details. 
    1.18 -
    1.19 -The connections made within the views enable more quickly generating the correct hypotheses for the cause of performance losses. The views connect units to the specific segments of code that compose the units, and connect each constraint on scheduling choices to the precise source of the constraint, within the code, hardware, or runtime. 
    1.20 +Using the model, we generate views that show the schedulable units in the code, runtime overhead assigned to each, the scheduling decisions made for them by the runtime, and consequent usage of the cores. Identifying idle cores is immediate, and the connections indicated by visual features  enable more quickly generating the correct hypotheses for the causes of the performance losses. The visual features connect each unit to the  segment of code executed, and to the constraint in the code that combined with the runtime to cause its placement in time and location. The pattern of placements, combined with contents of the code, leads to the hypothesis. 
    1.21  
    1.22  
    1.23   
    1.24  
    1.25 -In the following, we will describe our model of computation, which drives the instrumentation and visualization, and links the aspects of performance to the collected information.
    1.26 -The usage of the model and visualization is illustrated with a story line of  performance tuning a standard parallel application on a large multi-core system.
    1.27 +In this paper, we  describe our model of computation, and    illustrate its usage with a story line of  performance tuning a standard parallel application on a large multi-core system.
    1.28  
    1.29 -We start with refresher on performance tuning and an overview of previous approaches in section \ref{sec:related}. We shall usage of our visualizations through a case study in section \ref{sec:casestudy}, and then expand on the model behind it in section \ref{sec:theory}. Section \ref{sec:Implementation} will tie the model to implementation details. Finally, we will conclude in section \ref{sec:conclusion}.
    1.30 +We start with a refresher on performance tuning and an overview of previous approaches in section \ref{sec:related}. We show usage of our visualizations through a case study in section \ref{sec:casestudy}, and then expand on the model behind it in section \ref{sec:theory}. Section \ref{sec:Implementation} will tie the model to implementation details. Finally, we will conclude in section \ref{sec:conclusion}.
    1.31  
    1.32  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    1.33  \section{Background and Related Work}
    1.34 @@ -144,7 +143,7 @@
    1.35  
    1.36   These shortcomings are rectified by recent parallel languages such as CnC[] and StarSs[], which cleanly identify tasks and the constraints on them. The people who develop such a language usually also develop  tools to go with it. 
    1.37  
    1.38 -The StarSs tool[]\ tries to simplify the view for the user. It doesn't give performance information directly, but instead identifies tasks and tells the user whether it thinks the task size is too small, just right, or    tool big. Too small has too much runtime overhead, while too big has too few tasks to keep the cores busy.
    1.39 +The StarSs tool[]\ prioritizes simplifying the view for the user. It doesn't give performance information directly, but instead identifies tasks and tells the user whether it thinks the task size is too small, just right, or    tool big. Too small has too much runtime overhead, while too big has too few tasks to keep the cores busy.
    1.40  
    1.41  This is a step in the right direction, but it doesn't include effects of code choices on communication, and gives the programmer only limited information. Subtle and complex interactions between code, runtime, and hardware are often to blame for performance loss, and the limited information supplied here is no help with such complexities.
    1.42  
    1.43 @@ -168,20 +167,20 @@
    1.44  
    1.45  The application code uses language features to create virtual processors (VP). The first VP created divides the work into a number of pieces and creates a new VP for each piece. 
    1.46  
    1.47 -How many pieces is set by a tuning parameter in the code, and the number of cores. It uses language features to distribute the VPs  across the cores, initially in a round-robin fashion.
    1.48 +How many pieces is set by a tuning parameter in the code, and the number of cores. The appliation uses language features to distribute the VPs  across the cores,  in a round-robin fashion (before tuning).
    1.49  
    1.50  It then creates   a results VP that receives a partial-result from each piece and accumulates the results. The original divider VP  then waits for the results VP to indicate completion, after which the language  shuts down.
    1.51  
    1.52  \subsection{The language}
    1.53  The language used is SSR, which is based on rendez-vous style send and receive operations made between virtual processors (VPs). It has commands for creating and destroying VPs, and three kinds of send-receive paired operations. 
    1.54  
    1.55 -The first, \emph{send\_from\_to} specifies a specific sender and specific receiver. It is used by the results VP to tell the divider VP that the work is complete. The second, \emph{send\_of\_type\_to}, specifies a specific receiver, but the sender is anonymous, which increases flexibility while maintaining some control over scope. This  is used by the VPs doing the pieces to send their partial-result to the results processor. The third kind, \emph{send\_of\_type}, only specifies the type, and so acts as a global communication channel; this is not used in our application.
    1.56 +The first, \emph{send\_from\_to} specifies both sender and receiver VPs. It is used by the results VP to tell the divider VP that the work is complete. The second, \emph{send\_of\_type\_to}, specifies only a specific receiver, leaving the sender  anonymous, which increases flexibility while maintaining some control over scope. This  is used by the worker VPs doing the pieces to send their partial-result to the results processor. The third kind, \emph{send\_of\_type}, only specifies the type, and so acts as a global communication channel; this is not used in our application.
    1.57  
    1.58   The language also includes a \emph{singleton} construct that designates a piece of code as to be executed only once, which we use to  rearrange and copy data to get better cache behavior. A given copy is shared by several virtual processors on different cores, but the copy is only to be performed once. 
    1.59  
    1.60  Miscellaneous performance constructs are also available, such as one to force which core a virtual processor is assigned to. We use this in our example program to control scheduling.
    1.61  
    1.62 -A note on terminology: the word ``task''  has acquired multiple  meanings in the literature, making it a vague term. We often use, instead, the term  ``work-unit'', which is defined precisely as the trace-segment performed on a core, between two successive scheduling events, plus the set of datums consumed by that trace segment. The word task often maps well onto this precise definition, and we use both words, but mean the precise work-unit definition when we say task.
    1.63 +A note on terminology: We often use the term  ``work-unit'',  which we define precisely, instead of ``task'', which has acquired multiple  meanings in the literature. Work-unit  is defined as the trace-segment performed on a core, between two successive scheduling events, plus the set of datums consumed by that trace segment. The word task often maps well onto this  definition, and we use both words, but mean the precise work-unit definition when we say task.
    1.64  
    1.65  
    1.66  
    1.67 @@ -189,7 +188,7 @@
    1.68  \label{subsec:visualization_def}
    1.69   The first visualization is what we refer to as a scheduling consequence graph (SCG), or just consequence graph (CG). It depicts the scheduling operations performed by the runtime, and the consequent usage of the cores. 
    1.70  
    1.71 -The second visualization depicts constraints on those scheduling decisions, which come from the application, such as dependencies in the code. These limit the choices the runtime is allowed to make. We call this the Unit \& Constraint Collection, or UCC. 
    1.72 +The second visualization we call  the Unit \& Constraint Collection, or UCC. It depicts constraints on the scheduling decisions that come from the application. Constraints can be dependencies stated explicitly in the code, or can be implied by language constructs. These limit the choices the runtime is allowed to make. 
    1.73  
    1.74  The UCC shows only application-derived information, as opposed to the consequence graph, which combines the \textit{use} of the UCC-depicted constraints with runtime-imposed dependencies and hardware-imposed constraints.  Hence, the UCC states the degrees of freedom enabled by the application, while the consequence graph states how those were made use of, by a particular runtime on particular hardware.
    1.75  
    1.76 @@ -215,7 +214,7 @@
    1.77  
    1.78  The other type of visual feature seen in Fig \ref{fig:SCG_expl} is lines. Each represents a construct that influenced scheduling, where the color indicates which construct.   A line represents two things: a constraint, whose satisfaction made the lower unit ready, and a decision by the runtime to start the lower unit on that core. 
    1.79  
    1.80 -In general, lines may also be drawn, representing other kinds of interactions that affect core usage. For example,  our runtime implementation only allows one core at a time to access shared  scheduling state. Visualization of this can be turned on, as additional lines linking the gray runtime regions of blocks (visualization of such interactions is turned off in this paper). 
    1.81 +In general, lines may also be drawn that represent other kinds of interactions, which affect core usage. For example,  our runtime implementation only allows one core at a time to access shared  scheduling state. Visualization of this can be turned on, as additional lines linking the gray runtime regions of blocks (visualization of such interactions is turned off in this paper). 
    1.82  
    1.83  Two work-unit blocks   that appear in sequence and have no lines drawn to them often have a causal dependency between them, due to the semantics of the base language. Visualization of these dependencies is also turned off, but can be inferred via the link to the code.
    1.84   
    1.85 @@ -224,7 +223,7 @@
    1.86  
    1.87  Note that  many different orderings can be validly chosen. Which scheduler choices are valid is determined by three kinds of constraints: the application code constraints, hardware constraints, and runtime implementation imposed constraints.
    1.88  
    1.89 -Also note that because the  vertical axis represents the progress of the program execution, the user can see at a glance the total runtime (height), level of parallelism during the run (number of filled columns at a specific height), and processor utilization (blue-covered area vs. white). All figures are at the same scale, so they can be compared directly.
    1.90 +The visual features allow the user to see at a glance the total  execution time (height), idle cores during the run (empty columns), cache behavior (color of work regions), degree of overhead (size of gray regions), and which units constrained which other units (lines). All consequence graphs in this paper are at the same scale, so they can be compared directly.
    1.91  
    1.92  
    1.93  
    1.94 @@ -289,7 +288,7 @@
    1.95  \subsubsection{Second Run}
    1.96   After fixing this, the next run (Fig \ref{story:b}) corresponds much more to the expected execution behaviour. However, there remains a noticeable section at the beginning where only 3 cores have work and the other 37 remain idle.
    1.97  
    1.98 -Zooming in on those  cores, we see that creation code starts running on core 0, within the creation VP, and then the next block on the core is work! Creation stops, starving the other cores. Looking at the creation code, we see that the creation VP assigns the first work VP to its own core, so that work is now waiting in the queue to execute there. When it creates the second work VP, that creation call switches core 0 to the runtime. When done with creation, it takes the next VP from the queue, witch is that waiting work VP. Hence core 0 does the work next instead of continuing with creation  (the merits of work stealing or other scheduling strategies are independent from this illustration of how to use this approach to performance tuning).
    1.99 +Zooming in on those  cores, we see that creation code starts running on core 0, within the creation VP, and then the next block on the core is work! Creation stops, starving the other cores. Looking at the creation code, we see that the creation VP assigns the first work VP to its own core, so that work is now waiting in the queue to execute there. When it creates the second work VP, that creation call switches core 0 to the runtime. When done with creation, the runtime takes the next VP from the queue, witch is that waiting work VP. Hence core 0 does the work next instead of continuing with creation  (the merits of work stealing or other scheduling strategies are independent from this illustration of how to use this approach to performance tune).
   1.100  
   1.101  The hypothesis was generated by looking at the code linked to each block and noting the visual pattern that creation code stopped running on core 0. Work code started running instead, and only after it finished did creation code start again. Hence, visual cues led directly to the hypothesis. 
   1.102  
   1.103 @@ -297,12 +296,12 @@
   1.104  
   1.105  \subsubsection{Third run} Assigning work to the other cores first gives us Fig \ref{story:c}. The section that was at the top, with idle cores, has disappeared. A small idle period can still be observed between the first and the second set of work tasks, because the work tasks have roughly the same length and  the work on core 0 starts last. It thus holds up creation, which re-starts after all the others have finished work (note that work on some cores takes slightly longer because that core performs the copy-transpose singleton, and also variations are caused by cache misses).
   1.106  
   1.107 -  It is also noticeable that in the second set of work units to be distributed, not enough work pieces remain to fill all cores. 16 out of 40 remain idle.
   1.108 +  It is also noticeable that in the second set of work units to be distributed, not enough work pieces remain to fill all cores. 16 out of 40 remain idle at the bottom.
   1.109  
   1.110 -\subsubsection{Fourth and fifth runs} To try to fill the empty columns at the end, we modified the size of the work units. However, as figures \ref{story:d} and \ref{story:e} show, this did not help, as  the time spent creating the increased number of units becomes the bottleneck, and the time lost between sets grows larger than the time that previously was lost on the cores not receiving work.
   1.111 +\subsubsection{Fourth and fifth runs} To try to fill the empty columns at the end, we modified the size of the work units. However, as figures \ref{story:d} and \ref{story:e} show, this did not help. The blank areas between ``bands'' of work can be seen by the red lines to be due to  creation. The increased number of units causes creation to be the bottleneck again, and the time lost between sets grows larger than the time that previously was lost.
   1.112  
   1.113  \subsubsection{Sixth run} At this point we wanted to try the road not chosen, dedicating a core to the creation VP.
   1.114 -Going back to version b of the code and implementing this solution instead leads to fig. \ref{story:f}. The delay between the two sets has disappeared, leading to a 4\% shorter execution time.
   1.115 +Going back to version b of the code and implementing this solution, instead, leads to fig. \ref{story:f}. The blank area between the two sets has disappeared, showing a 4\% shorter execution time.
   1.116  
   1.117  \subsubsection{Seventh and eighth runs}As core 0 is now empty after the creation phase at the beginning, we also moved the receive VP there (fig. \ref{story:g}). This added only a minimal improvement at this size of work unit, but allows overlapping the result collection with other work, which is an advantage when cutting  the work into more pieces, requiring longer collection (fig. \ref{story:h}).
   1.118  
   1.119 @@ -973,3 +972,4 @@
   1.120   graph. It depicts all the scheduling operations performed by the runtime, 
   1.121  
   1.122  
   1.123 +