changeset 42:a4e0504b60f6

Perf Tuning -- cleaning up intro and related work -- rearrange related work
author Some Random Person <seanhalle@yahoo.com>
date Fri, 04 May 2012 08:21:46 -0700
parents ae51818f61c8
children 980d375417a3
files 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex
diffstat 1 files changed, 80 insertions(+), 77 deletions(-) [+]
line diff
     1.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Fri May 04 06:40:58 2012 -0700
     1.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Fri May 04 08:21:46 2012 -0700
     1.3 @@ -346,20 +346,26 @@
     1.4  
     1.5  For example, Tau is a highly cited older system for performance tuning parallel applications, which is representative of thread-centric approaches. It integrates many data sources, and has rich displays. However its model was cores and memories and thread contexts, with actions taken on or by each. It had no well defined concept of scheduling, unit scheduled, nor constraints on scheduling those units. Hence, it had no view that integrated the parallelism-specific information at the heart of performance:  tasks, constraints on them, and scheduling choices.
     1.6  
     1.7 -Another highly cited classic performance tuning system is Paradyn[], which represents . It is meant for applications that run for several days on multi-thousand node clusters. Its model of computation is based on events, both the timing of events and counts of events. It has a system for user-supplied instrumentation to collect event information and it has a hypotheses mechanism that protects the user from having to write custom code to test their hypotheses. However, the hypotheses are in terms of the timing and counts of events. not the parallel computation relevant information of units of scheduled work and the scheduling decisions made on those. 
     1.8 +Another highly cited classic performance tuning system is Paradyn[], which is representative of systems that model parallel computation as a collection of events. It is meant for applications that run for several days on multi-thousand node clusters. Its model of computation is based on events, both the timing of events and counts of events. It has a system for user-supplied instrumentation to collect event information and it has a hypotheses mechanism that protects the user from having to write custom code to test their hypotheses. However, the hypotheses are in terms of the timing and counts of events. not the parallel computation relevant information of units of scheduled work and the scheduling decisions made on those. 
     1.9  
    1.10  
    1.11  
    1.12 -Paragraph instruments MPI library. So it's an event-based model -- for cores only tells if busy, communication specific overhead, or idle.
    1.13 +Paragraph also follows an event-based model, and represents the large collection of simpler tools that instrument the MPI library. It shows whether cores are busy, and indicates communication overhead, but lacks any features that tie the communication pattern realized to application code features, which are what is under programmer control. It also fails to show runtime overhead, and which portions of idle time are caused by runtime internal constraints.
    1.14  
    1.15 -So, Tau is thread-view, Paragraph is MPI view, Paradyn is event view.
    1.16 +The commonality among the classic approaches is the lack of a  model of parallel computation. 
    1.17  
    1.18 -?
    1.19 +One difficulty faced by early tools is that  parallel applications written in MPI or threads effectively end up implementing a runtime system in the application code. In such a case, the units of work are implied in the code, and difficult for tools to recognize. Likewise, constraints on scheduling are enforced by the code, but never stated in any explicit form. 
    1.20  
    1.21 -?show measurements but no structure. Recently 
    1.22 + These shortcomings are rectified by recent parallel languages such as CnC and StarSs, which cleanly identify tasks and constraints on them. The people who develop such a language also develop language-specific tools to go with it. 
    1.23  
    1.24 - -- 
    1.25 -Seeing need task based languages now -- people who dev lang also dev tools to go with it. Direction is clearly going towards task-based, but not there yet
    1.26 +StarSs is clearly thinking about tasks, and even some about scheduling. but limited on scheduling -- can just place task into a queue, so can sort of manipulate scheduling but view doesn't really give all the constraints -- missing the runtime overhead, and missing the idle-time consequent from non-overlapped comm. 
    1.27 +
    1.28 +The StarSs tool tries to simplify the view for the user. It doesn't give performance information directly, but instead identifies tasks and tells the user whether it thinks the task size is too small, just right, or   -- instead, it has a recommended task size, which is between too small, which has too much overhead and too big, which has too few tasks to load balance. It makes task a one color if too short, another if just right, and a third if too long.
    1.29 +
    1.30 +
    1.31 +
    1.32 +
    1.33 +==========
    1.34  
    1.35  MPI is also machine-based abstraction, that gives communication information, but doesn't have concept of constraints . Its sort of in-between..
    1.36  
    1.37 @@ -367,9 +373,6 @@
    1.38  
    1.39  
    1.40  
    1.41 -StarSs is clearly thinking about tasks, and even some about scheduling. but limited on scheduling -- can just place task into a queue, so can sort of manipulate scheduling but view doesn't really give all the constraints -- missing the runtime overhead, and missing the idle-time consequent from non-overlapped comm. 
    1.42 -
    1.43 -The StarSs tool tries to simplify the view for the user. It doesn't give performance information directly, but instead identifies tasks and tells the user whether it thinks the task size is too small, just right, or   -- instead, it has a recommended task size, which is between too small, which has too much overhead and too big, which has too few tasks to load balance. It makes task a one color if too short, another if just right, and a third if too long.
    1.44  
    1.45  =======================?
    1.46  
    1.47 @@ -383,7 +386,6 @@
    1.48  ===== 
    1.49  
    1.50  
    1.51 -We've had basic perf tuning, then Tau, now 
    1.52  
    1.53  A survey of the most highly cited classic papers shows the commonality..
    1.54  
    1.55 @@ -432,7 +434,7 @@
    1.56  -- Shows what each core spends each bit of time on
    1.57  -- Visual representation directly indicates areas of performance loss
    1.58  
    1.59 -\begin{table*}[ht]
    1.60 +\begin{table*}[ht!]
    1.61  \caption
    1.62  {Comparison of Features and Benefits
    1.63  }
    1.64 @@ -461,71 +463,8 @@
    1.65  
    1.66  \section{Illustrative Story of Performance Tuning}
    1.67  
    1.68 -We start by showing the tool being used during a typical performance tuning session, to see how its features give benefit, and how competing tools' lack of those features makes the work more difficult.
    1.69  
    1.70 -We wish to use a standard program that the reader has likely already experienced attempting to performance tune, or else wondered why its performance wasn't better. The best example is likely matrix multiply, with which the reader should be familiar, allowing concentration on the tool without distraction about the application. We run it on a machine with 4 sockets by 10 cores each, for a total of 40 physical cores.
    1.71 -
    1.72 -The application code includes a function that automatically divides the work into a number of tasks, based on the number of cores and a tuning parameter. It distributes the tasks across the cores in a round-robin fashion, and then waits for completion of the calculation before initiating shutdown. The answers produced by the tasks are collected by a ``receiver" function and accumulated into the result matrix, whereupon it notifies the setup function of completion.
    1.73 -
    1.74 -The language used is SSR, which is based on rendez-vous style send and receive operations made between virtual processors (VPs). It has comands for creating and destroying virtutal processors, and three kinds of send-receive pairs. The first, \emph{send\_from\_to} specifies a specific sender and specific receiver. We use it to notify the VP dedicated to dividing the work (``creator VP'') that the VP collecting results (``results VP'') has received all sub-results and output the final result. The second, \emph{send\_of\_type\_to}, specifies a specific receiver, but the sender is anonymous, which increases flexibility while maintaining some control over scope. This construct is used by the processors multiplying sub-matrices to send their sub-result to the results processor. The third kind, \emph{send\_of\_type}, only specifies the type, and so acts as a global communication channel; this is not used our application. The language also includes a \emph{singleton} construct that designates a piece of code as to be executed only once, which we use to encapsulate a transposition of the second operand sub-matrix to column-major representation in order to fit access order to the cache lines. Each sub-matrix is shared between several sub-processors, but the transposition only needs to be performed once. Also available are some miscellaneous performance constructs such as the ability to force which core a virtual processor is assigned to.
    1.75 -
    1.76 -A note on terminology: the word ``task''  has acquired multiple  meanings in the literature, making it a vague term. We often use, instead, the term  ``work-unit'', which is defined precisely as the trace-segment performed on a core, between two successive scheduling events, plus the set of datums consumed by that trace segment. The word task often maps well onto this precise definition, and we use both words, but mean work-unit when we say task.
    1.77 -
    1.78 -
    1.79 -After functional debugging, the first run produces the visualization seen in Fig X. This is what we refer to as a scheduling consequence graph. It depicts all the scheduling operations performed by the runtime, along with the consequent usage of the cores. 
    1.80 -
    1.81 -Each column is associated with one core. 
    1.82 -A blue vertical block represents the time the core spends doing the actual work of one work-unit, the height being proportional to the number of cycles taken for execution. Just above each, in light blue, is the runtime overhead spent on that work-unit, which can be broken into pieces representing acquisition of the lock on the shared semantic state, time spent performing the semantics of the parallelism construct, time spent deciding which ready task to execute next, and time spent switching from virtual processor, to the runtime, and back. 
    1.83 -
    1.84 -
    1.85 -
    1.86 -A second visualization, seen in Figure X, depicts the constraints on the scheduling decisions the runtime is allowed to make. 
    1.87 -
    1.88 -The blue blocks are arranged according to the choices the scheduler in the runtime made.  Because they determine the succession of activities on a given core, these choices show causal dependencies between work-units.  
    1.89 -
    1.90 -Many different orderings could also have been validly chosen. Which scheduler choices are valid is determined by three kinds of constraints: the application code constraints, hardware constraints, and runtime implementation imposed constraints. 
    1.91 -
    1.92 -Returning to Fig \ref{story}, the lines in red, orange, and green represent application-code constraints that each tie two work-units together. The color represents the  kind of constraint imposed by one kind of  parallelism construct. Red is creation of a virtual processor, green is many-to-one send\_of\_type\_to, and orange is the singleton construct. For better visibility, only constraints that cross cores are shown.
    1.93 -
    1.94 -The columns represent each core, and the vertical axis represents the progress of the program execution, starting from the top. This gives at a glance the total runtime (height), level of parallelism during the run (number of filled columns at a specific height), and processor utilization (blue-covered area vs. white). All figures are at the same scale, so they can be compared directly.
    1.95 -
    1.96 -The first thing to notice, then, is that the first picture is slimmer than expected: of the 40 available cores, only 13 were being used. As the application places work on cores explicitly, this must be a bug in the dividing code. A cursory inspection revealed that a closing bracket in the distribution loop had been misplaced. This may be a very simple bug, but it went unnoticed despite using this application as test program for development of the language runtime, including analyzing performance, for several months.
    1.97 -
    1.98 -The second run (Fig \ref{story:b}) already corresponds much more to the expected execution behaviour. However, there remains a noticeable section at the beginning where only 3 cores have work and the other 37 remain idle. Focusing on core 0, we can see that the task creation VP (short tasks with red edges outgoing) creates work in order of cores, starting with core 0. The core's scheduler operates a simple round-robin between VPs assigned to its core, so the creator VP gets switched out for the newly created work unit quickly. The work tasks take a large amount of time to complete, during which task creation is suspended.
    1.99 -
   1.100 -Two solutions came to mind: distribute work to all other cores first so that they would be busy when the creator VP gets interrupted, or dedicate a core to the creator VP. The first solution has the advantage of preserving performance of the application even when run on a machine with a single-digit number of cores, so we tried it first. This gave us Fig \ref{story:c}. 
   1.101 -
   1.102 -The section with many cores idling at the beginning has disappeared. A small idle period can still be observed between the first and the second set of work tasks, because the work tasks have roughly the same length (some of them are slightly longer because they perform a copy-transpose singleton, and small variations can be caused by cache misses etc.), so the work on core 0 holding up the creator VP, being last to be distributed, is also last to finish. It is also noticeable that in the second set of work units to be distributed, not enough work pieces remain to fill all cores, so that 16 out of 40 remain idle.
   1.103 -
   1.104 -To try to fill that space, we tried to modify the size of the work units. However, as figures \ref{story:d} and \ref{story:e} show, this did not help, as almost immediately the time spent creating the increased number of units becomes the bottleneck, and the time lost between sets grows larger than the time lost on the cores not receiving any work.
   1.105 -
   1.106 -At this point we wanted to try out if taking the road not chosen, dedicating a core to the create VP, would improve performance more.
   1.107 -Going back to version b of the code and implementing this solution instead lead to fig. \ref{story:f}. The delay between the two sets has disappeared, leading to a 4\% shorter execution time.
   1.108 -
   1.109 -As core 0 is now empty after the creation phase at the beginning, we also moved the receive VP there (fig. \ref{story:g}). This added only a minimal improvement at this size of work unit, but allows overlapping the result collection with other work, which is an advantage when cutting up the work into more pieces, requiring longer collection (fig. \ref{story:h}).
   1.110 -
   1.111 -Overall it is also noticeable that as work units become smaller, execution becomes more irregular. Variability in task length is likely due to cache misses or page faults, but for verification of this hypothesis more data would need to be collected (for instance, the "cycles stalled due to cache misses" counter available on most modern Intel chips could be tracked and displayed for each unit). 
   1.112 -
   1.113 -In some places ``holes'' are noticeable. Inspecting these holes closer, we can see that the stalling tasks are waiting upon the completion of a singleton. However, the operations enclosed in the singleton take only a short time, and start much later than the idle periods. Once again, the simple round-robin scheduler is at fault: When the first VP reaches the singleton code portion, it sends a request to acquire the singleton. This request succeeds, but as sending a request results in suspension of the requesting VP, a new task is scheduled. If this happens to be a long task, the singleton is suspended for a long time, but all other VPs with this singleton have to wait, because it has already been reserved. Because several VPs assigned to the same core share the same matrix pieces so as to increase cache locality, this can result in all VPs on a core being stalled, leading to the observed idle times. This is a property of the language runtime, so the application programmer cannot change this, but making the work units smaller helps minimize these effects. 
   1.114 -
   1.115 -%\begin{figure}[ht]
   1.116 -% \begin{minipage}[b]{0.5\linewidth}
   1.117 -%  \centering
   1.118 -%  \includegraphics[width=0.27in, height=6in]{../figures/184.pdf}
   1.119 -%  \caption{default}
   1.120 -%  \label{fig:figure1}
   1.121 -% \end{minipage}
   1.122 -%\hspace{0.5cm}
   1.123 -% \begin{minipage}[b]{0.5\linewidth}
   1.124 -%  \centering
   1.125 -%  \includegraphics[width=1in]{../figures/.pdf}
   1.126 -%  \caption{default}
   1.127 -%  \label{fig:figure2}
   1.128 -% \end{minipage}
   1.129 -%\end{figure}
   1.130 -
   1.131 -
   1.132 -\begin{figure}[b]
   1.133 +\begin{figure*}[t!]
   1.134    \begin{minipage}[b]{0.2\textwidth}
   1.135          \subfloat[Original]
   1.136      {\quad\quad \includegraphics[scale=0.015]{../figures/192.pdf} \quad
   1.137 @@ -571,7 +510,71 @@
   1.138    \end{minipage}
   1.139  \caption{Performance tuning with Scheduling Consequence Graphs}
   1.140  \label{story}
   1.141 -\end{figure}
   1.142 +\end{figure*}
   1.143 +
   1.144 +In this section, we show our approach being used during a typical performance tuning session, to see how its features give benefit, and how competing tools' lack of those features makes the work more difficult.
   1.145 +
   1.146 +\subsection{Setup}
   1.147 +
   1.148 +In our session, we wish to tune a standard program that the reader has likely already experienced attempting to performance tune, and/or knows well. The best example is likely matrix multiply, with which the reader should be familiar, allowing concentration on the tool without distraction about the application. We run it on a machine with 4 sockets by 10 cores each, for a total of 40 physical cores.
   1.149 +
   1.150 +The application code includes a function that automatically divides the work into a number of tasks, based on the number of cores and a tuning parameter. It distributes the tasks across the cores in a round-robin fashion, and then waits for completion of the calculation before initiating shutdown. The answers produced by the tasks are collected by a ``receiver" and accumulated into the result matrix, whereupon it notifies the setup function of completion.
   1.151 +
   1.152 +The language used is SSR, which is based on rendez-vous style send and receive operations made between virtual processors (VPs). It has commands for creating and destroying virtutal processors, and three kinds of send-receive paired operations. The first, \emph{send\_from\_to} specifies a specific sender and specific receiver. We use it to notify the VP dedicated to dividing the work (``creator VP'') that the VP collecting results (``results VP'') has received all sub-results, so it can output the final result. The second, \emph{send\_of\_type\_to}, specifies a specific receiver, but the sender is anonymous, which increases flexibility while maintaining some control over scope. This construct is used by the processors multiplying sub-matrices to send their sub-result to the results processor. The third kind, \emph{send\_of\_type}, only specifies the type, and so acts as a global communication channel; this is not used in our application. The language also includes a \emph{singleton} construct that designates a piece of code as to be executed only once, which we use to  rearrange and copy data to get better cache behavior. A given copy is shared by several virtual processors, but the copy only needs to be performed once. Also available are some miscellaneous performance constructs, such as the ability to force which core a virtual processor is assigned to, which is used in our example program.
   1.153 +
   1.154 +A note on terminology: the word ``task''  has acquired multiple  meanings in the literature, making it a vague term. We often use, instead, the term  ``work-unit'', which is defined precisely as the trace-segment performed on a core, between two successive scheduling events, plus the set of datums consumed by that trace segment. The word task often maps well onto this precise definition, and we use both words, but mean the precise work-unit definition when we say task.
   1.155 +
   1.156 +A second visualization, seen in Figure X, depicts the constraints on the scheduling decisions, which limit the choices the runtime is allowed to make. We call this the Unit & Constraint Collection, or UCC. This shows only application-derived information. 
   1.157 +
   1.158 +\subsection{Walk through}
   1.159 +After functional debugging, the first run produces the visualization seen in Fig X. This is what we refer to as a scheduling consequence graph. It depicts all the scheduling operations performed by the runtime, along with the consequent usage of the cores. 
   1.160 +
   1.161 +Each column is associated with one core. 
   1.162 +A blue vertical block represents the time the core spends doing the actual work of one work-unit, the height being proportional to the number of cycles taken for execution. Just above each, in light blue, is the runtime overhead spent on that work-unit, which can be broken into pieces representing acquisition of the lock on the shared semantic state, time spent performing the semantics of the parallelism construct, time spent deciding which ready task to execute next, and time spent switching from virtual processor, to the runtime, and back. 
   1.163 + 
   1.164 +
   1.165 +The blue blocks are arranged according to the choices the scheduler in the runtime made.  Because they determine the succession of activities on a given core, these choices show causal dependencies between work-units.  
   1.166 +
   1.167 +Many different orderings could also have been validly chosen. Which scheduler choices are valid is determined by three kinds of constraints: the application code constraints, hardware constraints, and runtime implementation imposed constraints. 
   1.168 +
   1.169 +Returning to Fig \ref{story}, the lines in red, orange, and green represent application-code constraints that each tie two work-units together. The color represents the  kind of constraint imposed by one kind of  parallelism construct. Red is creation of a virtual processor, green is many-to-one send\_of\_type\_to, and orange is the singleton construct. For better visibility, only constraints that cross cores are shown.
   1.170 +
   1.171 +The columns represent each core, and the vertical axis represents the progress of the program execution, starting from the top. This gives at a glance the total runtime (height), level of parallelism during the run (number of filled columns at a specific height), and processor utilization (blue-covered area vs. white). All figures are at the same scale, so they can be compared directly.
   1.172 +
   1.173 +The first thing to notice, then, is that the first picture is slimmer than expected: of the 40 available cores, only 13 were being used. As the application places work on cores explicitly, this must be a bug in the dividing code. A cursory inspection revealed that a closing bracket in the distribution loop had been misplaced. This may be a very simple bug, but it went unnoticed despite using this application as test program for development of the language runtime, including analyzing performance, for several months.
   1.174 +
   1.175 +The second run (Fig \ref{story:b}) already corresponds much more to the expected execution behaviour. However, there remains a noticeable section at the beginning where only 3 cores have work and the other 37 remain idle. Focusing on core 0, we can see that the task creation VP (short tasks with red edges outgoing) creates work in order of cores, starting with core 0. The core's scheduler operates a simple round-robin between VPs assigned to its core, so the creator VP gets switched out for the newly created work unit quickly. The work tasks take a large amount of time to complete, during which task creation is suspended.
   1.176 +
   1.177 +Two solutions came to mind: distribute work to all other cores first so that they would be busy when the creator VP gets interrupted, or dedicate a core to the creator VP. The first solution has the advantage of preserving performance of the application even when run on a machine with a single-digit number of cores, so we tried it first. This gave us Fig \ref{story:c}. 
   1.178 +
   1.179 +The section with many cores idling at the beginning has disappeared. A small idle period can still be observed between the first and the second set of work tasks, because the work tasks have roughly the same length (some of them are slightly longer because they perform a copy-transpose singleton, and small variations can be caused by cache misses etc.), so the work on core 0 holding up the creator VP, being last to be distributed, is also last to finish. It is also noticeable that in the second set of work units to be distributed, not enough work pieces remain to fill all cores, so that 16 out of 40 remain idle.
   1.180 +
   1.181 +To try to fill that space, we tried to modify the size of the work units. However, as figures \ref{story:d} and \ref{story:e} show, this did not help, as almost immediately the time spent creating the increased number of units becomes the bottleneck, and the time lost between sets grows larger than the time lost on the cores not receiving any work.
   1.182 +
   1.183 +At this point we wanted to try out if taking the road not chosen, dedicating a core to the create VP, would improve performance more.
   1.184 +Going back to version b of the code and implementing this solution instead lead to fig. \ref{story:f}. The delay between the two sets has disappeared, leading to a 4\% shorter execution time.
   1.185 +
   1.186 +As core 0 is now empty after the creation phase at the beginning, we also moved the receive VP there (fig. \ref{story:g}). This added only a minimal improvement at this size of work unit, but allows overlapping the result collection with other work, which is an advantage when cutting up the work into more pieces, requiring longer collection (fig. \ref{story:h}).
   1.187 +
   1.188 +Overall it is also noticeable that as work units become smaller, execution becomes more irregular. Variability in task length is likely due to cache misses or page faults, but for verification of this hypothesis more data would need to be collected (for instance, the "cycles stalled due to cache misses" counter available on most modern Intel chips could be tracked and displayed for each unit). 
   1.189 +
   1.190 +In some places ``holes'' are noticeable. Inspecting these holes closer, we can see that the stalling tasks are waiting upon the completion of a singleton. However, the operations enclosed in the singleton take only a short time, and start much later than the idle periods. Once again, the simple round-robin scheduler is at fault: When the first VP reaches the singleton code portion, it sends a request to acquire the singleton. This request succeeds, but as sending a request results in suspension of the requesting VP, a new task is scheduled. If this happens to be a long task, the singleton is suspended for a long time, but all other VPs with this singleton have to wait, because it has already been reserved. Because several VPs assigned to the same core share the same matrix pieces so as to increase cache locality, this can result in all VPs on a core being stalled, leading to the observed idle times. This is a property of the language runtime, so the application programmer cannot change this, but making the work units smaller helps minimize these effects. 
   1.191 +
   1.192 +%\begin{figure}[ht]
   1.193 +% \begin{minipage}[b]{0.5\linewidth}
   1.194 +%  \centering
   1.195 +%  \includegraphics[width=0.27in, height=6in]{../figures/184.pdf}
   1.196 +%  \caption{default}
   1.197 +%  \label{fig:figure1}
   1.198 +% \end{minipage}
   1.199 +%\hspace{0.5cm}
   1.200 +% \begin{minipage}[b]{0.5\linewidth}
   1.201 +%  \centering
   1.202 +%  \includegraphics[width=1in]{../figures/.pdf}
   1.203 +%  \caption{default}
   1.204 +%  \label{fig:figure2}
   1.205 +% \end{minipage}
   1.206 +%\end{figure}
   1.207  
   1.208  
   1.209  %\begin{figure*}