### changeset 87:bc83d94128d0

perf tune: tweaks
author Nina Engelhardt Wed, 15 Aug 2012 16:58:30 +0200 6ddc44f95303 bb5df2b662dd 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 2 files changed, 9 insertions(+), 38 deletions(-) [+]
line diff
     1.1 Binary file 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf has changed

     2.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Wed Aug 15 00:29:29 2012 -0700
2.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Wed Aug 15 16:58:30 2012 +0200
2.3 @@ -82,7 +82,7 @@
2.4  \section{Introduction}
2.5  \label{sec:intro}
2.6
2.7 -Performance visualization and tuning tools for parallel programs are critical to achieving good performance, and yet the experience of using current solutions leaves something to be desired. Often, performance tuning requires interpreting abstract views of usage or statistics, trying to \emph{guess} the cause of performance loss. Despite many different views of performance measurements, such as frequency by line of code, message sends and receives, and core usage timelines, the user doesn't know why a function runs in a particular spot on a particular core, nor whether that is desired behavior or erroneous behavior.
2.8 +Performance visualization and tuning tools for parallel programs are critical to achieving good performance, and yet the experience of using current solutions leaves something to be desired. Often, performance tuning requires interpreting abstract views of usage or statistics, trying to \emph{guess} the cause of performance loss. Despite many different views of performance measurements, such as frequency by line of code, message sends and receives, and core usage timelines, the user does not know why a function runs in a particular spot on a particular core, nor whether that is desired behavior or erroneous behavior.
2.9
2.10  Examining these myriad views can feel like being advised by the fabled six blind men examining an elephant by touching the part nearest to each\cite{PerfToolPoem}: one, reaching the side, compares it to a wall; the second, feeling the tusk, likens it to a spear; the next, touching the trunk, is reminded of a snake, and so on. All of these views are indeed correct and supply important information, but they don't connect to each other, to provide a view of how the parts interact to form a whole.
2.11
2.12 @@ -90,7 +90,7 @@
2.13
2.14  Our approach provides a more complete picture by following a new model of parallel computation. The model suggests that the relevant application features are the units of work performed and the  constraints on scheduling them, and that it is crucial to record and display the scheduling decisions taken by the runtime. The model simplifies hypothesis generation by connecting the measurements to each other and to the application, analogous to providing a view of the whole elephant.
2.15
2.16 -Our instrumentation, based on the model, detects the  units within the application code when they are scheduled during a run, and visually links each to the units upon which its execution is conditional. The visual arrangement makes identifying idle cores immediate, enables quickly generating the correct hypotheses for the causes of the performance losses, and quickly pinpointing where in the application code changes can be made to improve performance. The effect is like having a map showing idle cores linked back to the spots in the application code responsible.
2.17 +Our instrumentation, based on the model, detects the  units within the application code when they are scheduled during a run, and visually links each to the units upon which its execution is conditional. The visual arrangement makes identifying idle cores immediate, enables quickly generating the correct hypotheses for the causes of the performance losses, and quickly pinpointing where in the application code changes can be made to improve performance. In effect, the visualization serves as a map showing idle cores linked back to the section of application code responsible.
2.18
2.19  The model of computation and our visual approach to using it is illustrated with a case study of performance tuning a standard parallel application on a large multi-core system.
2.20
2.21 @@ -173,12 +173,9 @@
2.22
2.23  The second visualization we call  the scheduling consequence graph (SCG), or just consequence graph (CG). It depicts the scheduling operations performed by the runtime, and the consequent usage of the cores.
2.24
2.25 -
2.26 -
2.27  The UCC shows only application-derived information, as opposed to the consequence graph, which combines the \textit{use} of the UCC-depicted constraints along with runtime-imposed dependencies and hardware-imposed constraints.  Hence, the UCC states the degrees of freedom enabled by the application, while the consequence graph states how those were made use of, by a particular runtime on particular hardware.
2.28
2.29  \subsubsection{UCC visualization}
2.30 -
2.31  \begin{figure}[ht]
2.32    \centering
2.33    \includegraphics[width = 3.2in, height = 1.8in]{../figures/UCC_stylized_for_expl.pdf}
2.34 @@ -186,15 +183,12 @@
2.35    \label{fig:UCC_expl}
2.36  \end{figure}
2.37
2.38 -
2.39  Fig. \ref{fig:UCC_expl} shows two versions of a UCC, both stylized for purposes of explanation.  These are used less often than the SCG,  mainly as a helper when brainstorming ways to get more usable parallelism out of an application.
2.40
2.41  Both versions of the UCC depict constraints on units. Starting  at the top, they  progress downwards,  placing units that constrain  above those constrained by them, and connecting them by a line representing the constraint. A unit is placed into the row just below the last of  the units constraining it, in the first available column. However, the rows and columns are just visual conveniences; the only information in the UCC is the lines between units, which represent constraints among the units. In practice, each unit is also labeled with a unique unit ID, from which the code of the unit and the parameters passed to it can be looked up.
2.42
2.43  What distinguishes the two types of UCC is the size of the units. The left one shows only the logical graph-like structure of the application, while the right weights the height of a unit by the number of instructions executed in it. When information about unit size is available, it can visually indicate the \textit{useable} parallelism available.
2.44
2.45 -
2.46 -
2.47  \subsubsection{SCG visualization}
2.48  \begin{figure}[ht]
2.49    \centering
2.50 @@ -203,8 +197,6 @@
2.51    \label{fig:SCG_expl}
2.52  \end{figure}
2.53
2.54 -
2.55 -
2.56  Fig. \ref{fig:SCG_expl} shows a consequence graph, stylized for purposes of explanation. This is the main visualization used during performance tuning, because it shows the consequent uses of the cores in the machine, and links those to the constraints and code in the application.
2.57
2.58  It is composed of a number of columns, one for each core. A column represents time on the core,  increasing as one goes down, measured in clock cycles. It is broken into blocks, each representing the time accounted to one work-unit. Each block is further divided into regions, each a different color, which indicates the kind of activity the core was engaged in during that  region's time-span.
2.59 @@ -214,16 +206,12 @@
2.60  The kinds of activities within a block are defined by the computation model that underlies the visualization. The first kind of activity is the actual work, plus  waiting for  cache misses. It is represented by a blue-to-red region where the color indicates  intensity of cache misses, with pure red representing at or above the maximum misses per instruction, and pure blue the minimum (the max and min are set in the tool that generates the visualization).
2.61
2.62    The second kind of activity is runtime overhead, represented by a gray region. This is the overhead spent on that particular work-unit. When desired by the user, it is further broken into pieces representing activities inside the runtime. The options include time spent on: constraints, when determining readiness of the work-unit; deciding which ready unit to assign to which hardware; and time spent switching from virtual processor, to the runtime, and back. In this paper, we show all runtime overhead lumped together, however in other circumstances a breakdown can be key to understanding interaction between runtime and application.
2.63 -
2.64
2.65  The other type of visual feature seen in Fig. \ref{fig:SCG_expl} is lines. Each represents a construct that influenced scheduling, where the color indicates what kind of construct.   A line represents two things: a constraint, whose satisfaction made the lower unit ready, and a decision by the runtime to start the lower unit on that core.
2.66
2.67  In general, lines may also be drawn that represent other kinds of interactions, which affect core usage. For example,  our runtime implementation only allows one core at a time to access shared  scheduling state. Visualization of this can be turned on, as additional lines linking the gray runtime regions of blocks (visualization of such interactions is turned off in this paper).
2.68
2.69  Two work-unit blocks   that appear in sequence and have no lines drawn to them often have a causal dependency between them, due to the semantics of the base language. Visualization of these dependencies is also turned off, but can be inferred via the link to the code.
2.70 -
2.71 -
2.72 -
2.73
2.74  Note that  many different orderings can be validly chosen. Which scheduler choices are valid is determined by three kinds of constraints: the application code constraints, hardware constraints, and runtime implementation imposed constraints.
2.75
2.76 @@ -281,7 +269,6 @@
2.77
2.78  The scheduling consequence graphs that display the measurements collected on various runs during tuning are displayed in Fig. \ref{story}. All use the same scale, for direct comparison. They have 40 columns, one for each core, and relative height indicates relative execution time. The lines in red, orange, and green represent application-code constructs. Red is creation of a virtual processor, green is the many-to-one \texttt{send\_of\_type\_to}, and orange is the singleton construct. For better visibility, only constraints that cross cores are shown.
2.79
2.80 -
2.81  After functional debugging, the first tuning run produces the consequence graph seen in Fig. \ref{fig:story:a}. The first thing to notice is that it is slimmer than expected: of the 40 available cores, only 13 are being used. Because the application places work on cores explicitly, this must be a bug in the dividing code. A cursory inspection reveals that a closing curly brace in the distribution loop had been misplaced. This may be a very simple bug, but it went unnoticed despite using this application as a test program for development of the language runtime (while using traditional thread-based performance tools).
2.82
2.83  \subsubsection{Second Run}
2.84 @@ -302,7 +289,7 @@
2.85  \subsubsection{Sixth run} At this point we wanted to try the road not chosen, dedicating a core to the creation VP.
2.86  Going back to version b of the code and implementing this solution, instead, leads to Fig. \ref{story:f}. The blank area between the two sets has disappeared, showing a 4\% shorter execution time.
2.87
2.88 -\subsubsection{Seventh and eighth runs}Core 0 is now empty after the creation phase at the beginning, so we  moved the receive VP there (Fig. \ref{story:g}). This  only minimally improved the time, at this size of work unit, but allows overlapping the result collection with other work, which is an advantage when cutting  the work into more pieces, requiring longer collection as in Fig. \ref{story:h}.
2.89 +\subsubsection{Seventh and eighth runs}Core 0 is now empty after the creation phase at the beginning, so we  moved the receive VP there (Fig. \ref{story:g}). This only minimally improved the time, at this size of work unit, but allows overlapping the result collection with other work, which is an advantage when cutting  the work into more pieces, requiring longer collection as in Fig. \ref{story:h}.
2.90
2.91  Overall it is also noticeable that as work units become smaller, execution time becomes more irregular. Variability in work length correlates with the color, indicating cache behavior has worsened with smaller work size.
2.92
2.93 @@ -312,14 +299,9 @@
2.94
2.95  In Fig \ref{story:d}, holes'' are noticeable. Inspecting these holes closer, we can see that the stalled blocks are at the end of orange lines. This indicates they are waiting upon the completion of a singleton. The pattern of blocks shows that usually  the singleton unit runs before the work unit, but in these cases the singleton code was delayed until after the work on that core.  This is a runtime implementation fluke. The only thing an application programmer can do is change the work size to minimize the impact. (For those curious, the first VP to reach the singleton is granted control, but a ready work VP lands in the queue during the granting activity, so when the runtime finishes granting, the work VP is next, and the VP that now owns the singleton sits and waits for the work to end. All work VPs on other cores that pass through the same singleton  also wait.)
2.96
2.97 -
2.98 -
2.99 -
2.100 -
2.101  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2.102  \section{The Model Behind the Visualization}
2.103  \label{sec:theory}
2.104 -
2.105  Now that the usage has been seen, we expand on the model behind the visualizations. The model ties the information together, and understanding it helps in generating  hypotheses from the visualization features.
2.106
2.107  As seen, the model has two parts, a \emph{Unit \&\ Constraint Collection (UCC)}, and a \emph{Scheduling Consequence Graph} (SCG or just consequence graph,  CG).  The UCC indicates the scheduling  choices  the application allows, and so shows what the programmer has control over. The consequence graph says which of those were actually taken during the run and the consequences of that set of choices.
2.108 @@ -340,7 +322,6 @@
2.109
2.110      We call a fully specified UCC a \emph{concrete} UCC.  Every run of an application eventually winds up defining a concrete UCC, such as produced for the performance tuning, as seen  in Fig. \ref{fig:UCC_example}. In it, every unit that was scheduled in the SCG appears,  along with  the application-defined constraints on scheduling them.  For this application, parameters determined the work  division, and so determined the units. Hence, the application alone does not specify the concrete UCC, because the units remain unknown until parameter values are given. In general, the amount of UCC made concrete by the application alone falls into a two-dimensional grid. One dimension covers the units, the other the constraints.
2.111
2.112 -
2.113  \begin{figure}[ht]
2.114    \centering
2.115    \includegraphics[width = 2in, height = 1.8in]{../figures/UCC_concreteness_grid.pdf}
2.116 @@ -362,7 +343,6 @@
2.117  The progression of UCCs generated during the life-line indicate what entity has the control over which units and constraints appear in the final concrete UCC. Viewing the progression gives insight into what is inside the application programmer's control vs under control of each tool in the toolchain or  the runtime. For example, the original application-derived UCC  indicates what can be done statically: the further out on the diagonal that UCC is, the less scheduling can be done statically in the toolchain.
2.118
2.119  In this paper, we do not suggest how to represent UCCs far out on the diagonal. One of those actually indicates a multi-verse of concrete-UCCs. Which of them materializes  depends on the data and what the scheduler decides. We only represent the concrete UCC that materializes during a run and leave the question of representing less concrete ones to future work.
2.120 -
2.121
2.122  \subsection{Scheduling Consequence Graph}
2.123  Whereas the UCC concentrates on application-derived information, the second part of the model adds-in effects of runtime  details and hardware. We called it the Scheduling Consequence Graph because it links scheduling decisions to their performance consequences. But it also indicates the role, in the decision, of application, runtime and hardware details.
2.124 @@ -394,12 +374,10 @@
2.125
2.126  Consequence graph features each tie back to features in the UCC and thence to specific segments of code or constructs.
2.127
2.128 -
2.129  \subsection{Levels of UCC and Consequence Graph}
2.130  There is one last twist to the story of UCCs and consequence graphs, which is that there are levels of them that correspond to the levels of scheduler in a hierarchical machine. We use an example involving a server machine with a hierarchy of runtimes to illustrate both, concentrating first on just the UCCs, then adding the consequence graph.
2.131
2.132  \subsubsection{Levels of UCC}
2.133 -
2.134  For the example, consider  a server with one rack,  having a back-plane that boards plug into. A board has its own memory with four sockets, each having a chip with four cores. So there is a back-plane network connecting the boards,  a bus on each board that connects the sockets to the DRAM, and inside the chip in each socket is a cache hierarchy that connects the cores.
2.135
2.136  The hardware is given a set of runtimes to match the hierarchy. Each network or bus has a runtime that schedules work onto the things connected below it. So the top runtime divides work among the boards, while a board's  runtime  divides work among the sockets, and a socket's  runtime  divides work among the cores.
2.137 @@ -408,9 +386,7 @@
2.138
2.139  The application in this example has been written in a language that allows work to be divided. The toolchain inserted a manipulator that allows each runtime to divide up the work it is given into smaller work-units, such as via the DKU pattern \cite{DKUTechRep_09}. This pushed the UCC of the application all the way to the right on the unit axis.
2.140
2.141 -So what does the concrete UCC produced during a run look like? Well, a unit is defined as the work resulting from one scheduling decision. Each runtime has its own scheduler, which means units are defined for each runtime.  That in turn means that each runtime has its own concrete UCC.
2.142 -
2.143 -
2.144 +So what does the concrete UCC produced during a run look like? Well, a unit is defined as the work resulting from one scheduling decision. Each runtime has its own scheduler, which means units are defined for each runtime.  That in turn means that each runtime has its own concrete UCC.
2.145  \begin{figure}[ht]
2.146    \centering
2.147    \includegraphics[width = 3.2in, height = 1.8in]{../figures/UCC_Levels_2.pdf}
2.148 @@ -423,14 +399,12 @@
2.149   %Great, that makes sense, now what about the consequence graphs?
2.150
2.151  \subsubsection{Levels of Consequence Graph}
2.152 -
2.153  With multiple levels of scheduling activities, a scheduling consequence graph can be constructed for each level.
2.154
2.155  For a given level, a lower-level runtime is seen as a single processor entity. So, what is the consequence of deciding to schedule a given work unit onto one of those  processor entities? When the goal is  improving performance, the consequence of interest is the time taken until completion of the work unit.
2.156
2.157  Hence, the question becomes,  when can the upper-level work unit be considered completed? The answer is, when all the lower-level units of which it is comprised are completed, plus any additional work spent dividing up the work and combining results. Once the lower-level consequence graph is established, this  time can be easily determined: it is the critical path across the lower-level consequence graph plus the additional, non-overlapped, time spent creating the lower-level units and combining results into the format of the higher-level unit. The divide and recombine are the creation of    the lower level, and so not part of it.
2.158
2.159 -
2.160  Consider the concrete example of the SSR matrix multiply application from Section \ref{sec:casestudy}. To start we go up a level, above the process created to execute the matrix multiply. Here, a unit is the execution of an entire application, invoked via OS commands. The  work-time of that unit is the execution time of the application, from the first instruction of the application code until the last. This time equals  the critical path through the matrix multiply work, plus creation of the various VPs, and collection of the results by the results VP.
2.161
2.162  If the case study had been performed in hierarchical fashion, then the SCG seen there would be divided between levels, with creation and results accumulation  moved to the higher application-level SCG. The lower-level would only contain the work of the multiply.
2.163 @@ -438,10 +412,6 @@
2.164  One twist is that the divide and recombine are themselves composed of  smaller work-units.  However, they are not visible to units in the work portion, although there are causalities between them. The divide units end with creation of work-level units, while the end of work-level units causes creation of recombine units.
2.165  Hence, divide and recombine are  inside of  the higher-level unit while not part of the lower-level SCG. They can be conveniently displayed in a single visualization because of the clean connection between levels provided by the create and end causalities.
2.166
2.167 -
2.168 -
2.169 -
2.170 -
2.171  To complete the discussion, we consider going down another level, into the physical processor.  In modern super-scalar multi-cores, each core is a parallel processor composed of functional blocks such as arithmetic or floating-point blocks, along with logic that issues instructions to those blocks. We first break down a single  SSR work unit of the kind seen in Section \ref{sec:casestudy}, into smaller work units, namely individual instructions. Then we break instructions down into pipeline stages.
2.172
2.173  The shape of the UCC for the instructions  inside a single SSR level work-unit is a straight line of sequential dependencies. However, the pipeline stages inside a core modify the UCC. They analyse the register use dependencies between instructions and generate a new UCC consisting of  functions constrained by the data dependencies. The dependencies are enforced by the issue logic, which  goes on to assign ready functions to the functional blocks.  The choices made by the issue logic  determine the shape of the SCG for this level, and the critical path of that SCG is the work time of the higher, SSR level, work-unit.
2.174 @@ -592,12 +562,13 @@
2.175  \section{Conclusion}
2.176  \label{sec:conclusion}
2.177
2.178 -We have shown how to apply a new model of parallel computation, to  instrument any language's runtime to collect measurements, and  display them effectively. The result is like having a map showing idle cores linked back to the spots in  application code responsible.
2.179 +We have shown how to apply a generalized model of parallel computation to build adaptable performance visualizations, relying only on information collected through instrumenting the language runtime, with no modification to the application.
2.180 +The approach is demonstrated through the case study of instrumenting the SSR message-passing language runtime and using it to tune a simple parallel matrix multiply.
2.181
2.182 -A case study related the usability improvements in ease of identifying instances of performance loss (via the Scheduling Consequence Graph, SCG), verifying that better performance is possible (via parallelism available in the Unit and Constraint Collection, UCC), connecting  loss to the causes (via SCG) no matter if cause is application structure,  runtime implementation, or hardware feature, and  suggesting changes to the application to improve  performance.
2.183 +The resulting visualizations show that the focus on the parallelism-relevant concepts of work units and constraints on their execution allows a clearer view of parallelism-specific issues.
2.184 +By integrating visual display of constraints stemming from application structure, language runtime implementation, or hardware features, the various possible causes for performance loss are covered. A flexible filtering system for different types of constraints avoids overcharging the display.
2.185
2.186 -  The benefits seen include low effort because the instrumentation is in the runtime, not the application, having all the  possible causes for performance loss covered, having a flexible filtering system for different types of constraints to avoid overcharging the display, and
2.187 - effectiveness for high-level'' parallel languages that closely match application concepts instead of hardware concepts.
2.188 +As the approach relies on information available to the runtime, we expect that even better results will be observed for high-level'' parallel languages that more closely match application concepts instead of hardware concepts.
2.189
2.190  \bibliography{bib_for_papers_12_Jy_15}
2.191