### changeset 86:6ddc44f95303

perf tuning -- new conclusion and little changes to intro
author Sean Halle Wed, 15 Aug 2012 00:29:29 -0700 972bdaea83b7 bc83d94128d0 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 1 files changed, 17 insertions(+), 17 deletions(-) [+]
line diff
     1.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Tue Aug 14 19:53:43 2012 +0200
1.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Wed Aug 15 00:29:29 2012 -0700
1.3 @@ -86,11 +86,11 @@
1.4
1.5  Examining these myriad views can feel like being advised by the fabled six blind men examining an elephant by touching the part nearest to each\cite{PerfToolPoem}: one, reaching the side, compares it to a wall; the second, feeling the tusk, likens it to a spear; the next, touching the trunk, is reminded of a snake, and so on. All of these views are indeed correct and supply important information, but they don't connect to each other, to provide a view of how the parts interact to form a whole.
1.6
1.7 -Likewise, current tools may cover all the parts of the application code, but they do not always adequately connect their observations to the runtime behavior, scheduling decisions, and consequent hardware behavior. The decision about which task or thread is assigned to which core at what point in time is at the heart of parallel performance, so these choices need to be highlighted and connected to the application features influencing them.
1.8 +Likewise, current tools may cover all the parts of the application code, but they often do not  adequately connect their observations to the runtime behavior, scheduling decisions, and consequent hardware behavior. The decision about which task or thread is assigned to which core at what point in time is at the heart of parallel performance, so these choices need to be displayed and connected to the application features influencing them.
1.9
1.10 -Our approach provides a more complete picture by following a model of parallel computation that suggests what information to gather and provide to the user. According to the model, the relevant application features are work units and the execution constraints, and it is crucial to record and present the scheduling decisions taken by the runtime.
1.11 +Our approach provides a more complete picture by following a new model of parallel computation. The model suggests that the relevant application features are the units of work performed and the  constraints on scheduling them, and that it is crucial to record and display the scheduling decisions taken by the runtime. The model simplifies hypothesis generation by connecting the measurements to each other and to the application, analogous to providing a view of the whole elephant.
1.12
1.13 -Our instrumentation detects the schedulable units in the application code, and visually links them to the units upon which its execution is conditional. The visual arrangement makes identifying idle cores immediate, enables quickly generating the correct hypotheses for the causes of the performance losses, and pinpointing where in the application code changes can be made to improve performance.
1.14 +Our instrumentation, based on the model, detects the  units within the application code when they are scheduled during a run, and visually links each to the units upon which its execution is conditional. The visual arrangement makes identifying idle cores immediate, enables quickly generating the correct hypotheses for the causes of the performance losses, and quickly pinpointing where in the application code changes can be made to improve performance. The effect is like having a map showing idle cores linked back to the spots in the application code responsible.
1.15
1.16  The model of computation and our visual approach to using it is illustrated with a case study of performance tuning a standard parallel application on a large multi-core system.
1.17
1.18 @@ -282,7 +282,7 @@
1.19  The scheduling consequence graphs that display the measurements collected on various runs during tuning are displayed in Fig. \ref{story}. All use the same scale, for direct comparison. They have 40 columns, one for each core, and relative height indicates relative execution time. The lines in red, orange, and green represent application-code constructs. Red is creation of a virtual processor, green is the many-to-one \texttt{send\_of\_type\_to}, and orange is the singleton construct. For better visibility, only constraints that cross cores are shown.
1.20
1.21
1.22 -After functional debugging, the first tuning run produces the consequence graph seen in Fig. \ref{fig:story:a}. The first thing to notice is that it is slimmer than expected: of the 40 available cores, only 13 are being used. Because the application places work on cores explicitly, this must be a bug in the dividing code. A cursory inspection reveals that a closing curly brace in the distribution loop had been misplaced. This may be a very simple bug, but it went unnoticed despite using this application as a test program for development of the language runtime, including performance, for several months.
1.23 +After functional debugging, the first tuning run produces the consequence graph seen in Fig. \ref{fig:story:a}. The first thing to notice is that it is slimmer than expected: of the 40 available cores, only 13 are being used. Because the application places work on cores explicitly, this must be a bug in the dividing code. A cursory inspection reveals that a closing curly brace in the distribution loop had been misplaced. This may be a very simple bug, but it went unnoticed despite using this application as a test program for development of the language runtime (while using traditional thread-based performance tools).
1.24
1.25  \subsubsection{Second Run}
1.26   After fixing this, the next run (Fig. \ref{story:b}) corresponds much more to the expected execution behaviour. However, there remains a noticeable section at the beginning where only 3 cores have work and the other 37 remain idle.
1.27 @@ -338,19 +338,19 @@
1.28    \label{fig:UCC_example}
1.29  \end{figure}
1.30
1.31 -    We call a fully specified UCC a \emph{concrete} UCC.  Every run of an application eventually winds up defining a concrete UCC, such as produced for the performance tuning, seen  in Fig. \ref{fig:UCC_example}. In it, every unit scheduled in the SCG appears,  along with  the application-defined constraints on scheduling them.  For this application, parameters determine how the work is divided, and so determine the units. Hence, the application alone does not specify the concrete UCC, because the parameter values must be known before the units can be determined. In general, the amount of UCC made concrete by the application alone falls into a two-dimensional grid. One dimension covers the units, the other the constraints.
1.32 +    We call a fully specified UCC a \emph{concrete} UCC.  Every run of an application eventually winds up defining a concrete UCC, such as produced for the performance tuning, as seen  in Fig. \ref{fig:UCC_example}. In it, every unit that was scheduled in the SCG appears,  along with  the application-defined constraints on scheduling them.  For this application, parameters determined the work  division, and so determined the units. Hence, the application alone does not specify the concrete UCC, because the units remain unknown until parameter values are given. In general, the amount of UCC made concrete by the application alone falls into a two-dimensional grid. One dimension covers the units, the other the constraints.
1.33
1.34
1.35  \begin{figure}[ht]
1.36    \centering
1.37    \includegraphics[width = 2in, height = 1.8in]{../figures/UCC_concreteness_grid.pdf}
1.38 -  \caption{Abstract representation of the kinds of UCC possible. The letters A, B, C, D stand for UCCs described in the text.}
1.39 +  \caption{Abstract representation of the amount  of concreteness of a UCC. The letters A, B, C, D stand for UCCs described in the text.}
1.40    \label{fig:UCC_Concreteness}
1.41  \end{figure}
1.42
1.43 -Fig. \ref{fig:UCC_Concreteness} shows the two axes and the four sets of additional information that have to be added to the UCC in order to determine the units and constraints in the final fully concrete UCC. The position an application-derived UCC, or later in the life-line UCC, lands on the grid indicates how far it is from being fully concrete.  The horizontal   indicates what inputs are still needed to determine the units, and vertical the constraints.  0 indicates that the units (constraints) are fully determined by the application code alone; 1 means parameter values also must be known; 2 means input data values also play a role, and 3 means the units (constraints) can only become known  after runtime scheduling decisions have been made.
1.44 +Fig. \ref{fig:UCC_Concreteness} shows the two axes and the four sets of additional information that have to be added to the UCC in order to determine the units and constraints in the final fully concrete UCC. The position an application-derived UCC, or later in the life-line UCC, lands on the grid indicates how far it is from being fully concrete.  The horizontal   position indicates what inputs are still needed to determine the units, and vertical the constraints.  0 indicates that the units (constraints) are fully determined by the application code alone; 1 means parameter values also must be known; 2 means input data values also play a role, and 3 means the units (constraints) can only become known  after runtime scheduling decisions have been made.
1.45
1.46 -The closer the application-derived UCC is to the origin, the less additional information is needed to obtain a  concrete UCC descendant of it. The UCC labeled A in the figure is fully concrete just from the source code alone (representing for example, matrix multiply with fixed size matrices and fixed division). The UCC labeled B requires the input data and parameters to be specified before its units are concrete, but just parameters to make its constraints fully concrete (as per ray-tracing, with bounce depth specified as a parameter). The UCC labeled C only has variability in its constraints, which require input data (for example, H.264 motion vectors).
1.47 +The closer an application-derived UCC is to the origin, the less additional information is needed to obtain a  concrete UCC descendant of it. The UCC labeled A in the figure is fully concrete just from the source code alone (representing for example, matrix multiply with fixed size matrices and fixed division). The UCC labeled B requires the input data and parameters to be specified before its units are concrete, but just parameters to make its constraints fully concrete (as per ray-tracing, with bounce depth specified as a parameter). The UCC labeled C only has variability in its constraints, which require input data (for example, H.264 motion vectors).
1.48  But even the least concrete UCC, out at the end of the diagonal (D in the figure), generates a concrete descendant during a run of the application.
1.49
1.50  Notice, though, that even a fully concrete UCC still has degrees of freedom, in deciding which units to run on which hardware and in what order of execution. These decisions fix interactions within the hardware, to yield the communication patterns and consequent performance seen during the run.
1.51 @@ -359,7 +359,7 @@
1.52
1.53  For example, specialization may perform a static scheduling, which fixes the units, moving the UCC towards the origin. Alternatively, the toolchain may inject manipulator code for the runtime to use, which lets it divide units during the run when it needs more. The injection of manipulator code makes the UCC less concrete, moving it further from the origin.
1.54
1.55 -The progression of UCCs generated during the life-line indicate what entity has control over the units and constraints that appear in the final concrete UCC. Viewing the progression gives insight into what is inside the application programmer's control vs under control of each tool in the toolchain or  the runtime. For example, the original application-derived UCC  indicates what can be done statically: the further out on the diagonal that UCC is, the less scheduling can be done statically in the toolchain.
1.56 +The progression of UCCs generated during the life-line indicate what entity has the control over which units and constraints appear in the final concrete UCC. Viewing the progression gives insight into what is inside the application programmer's control vs under control of each tool in the toolchain or  the runtime. For example, the original application-derived UCC  indicates what can be done statically: the further out on the diagonal that UCC is, the less scheduling can be done statically in the toolchain.
1.57
1.58  In this paper, we do not suggest how to represent UCCs far out on the diagonal. One of those actually indicates a multi-verse of concrete-UCCs. Which of them materializes  depends on the data and what the scheduler decides. We only represent the concrete UCC that materializes during a run and leave the question of representing less concrete ones to future work.
1.59
1.60 @@ -418,7 +418,7 @@
1.61    \label{fig:UCC_Levels}
1.62  \end{figure}
1.63
1.64 -Fig. \ref{fig:UCC_Levels} shows that these UCCs are related to each other in the same hierarchy as the runtimes. A unit scheduled in one runtime is broken into smaller units in the one below it. Each of those units is  then separately scheduled, making a separate UCC just for them. So, as the figure shows, a unit contains another entire UCC.
1.65 +Fig. \ref{fig:UCC_Levels} shows that these UCCs are related to each other in the same hierarchy as the runtimes. A unit scheduled by one runtime is in turn broken into sub-units that exist at a lower level. Each of those units is  then separately scheduled by the lower-level runtime, according to a UCC just for them. So, as the figure shows,  a unit in one UCC has an entire lower-level UCC inside it.
1.66
1.67   %Great, that makes sense, now what about the consequence graphs?
1.68
1.69 @@ -433,7 +433,7 @@
1.70
1.71  Consider the concrete example of the SSR matrix multiply application from Section \ref{sec:casestudy}. To start we go up a level, above the process created to execute the matrix multiply. Here, a unit is the execution of an entire application, invoked via OS commands. The  work-time of that unit is the execution time of the application, from the first instruction of the application code until the last. This time equals  the critical path through the matrix multiply work, plus creation of the various VPs, and collection of the results by the results VP.
1.72
1.73 -If the example was analyzed in hierarchical fashion, the SCG seen there would be divided between levels, with creation and results accumulation  moved to the higher application-level SCG. The lower-level would only contain the work of the multiply.
1.74 +If the case study had been performed in hierarchical fashion, then the SCG seen there would be divided between levels, with creation and results accumulation  moved to the higher application-level SCG. The lower-level would only contain the work of the multiply.
1.75
1.76  One twist is that the divide and recombine are themselves composed of  smaller work-units.  However, they are not visible to units in the work portion, although there are causalities between them. The divide units end with creation of work-level units, while the end of work-level units causes creation of recombine units.
1.77  Hence, divide and recombine are  inside of  the higher-level unit while not part of the lower-level SCG. They can be conveniently displayed in a single visualization because of the clean connection between levels provided by the create and end causalities.
1.78 @@ -577,7 +577,7 @@
1.79  For the UCC, units can be either unweighted or weighted. Weighted units appear as rectangles with height proportional to the weight, unweighted units appear as circles. We weight the units with the number of instructions in the work. This is not entirely execution-independent, but removes some of the influence of scheduling and data, such as cache misses.
1.80
1.81  A critical path algorithm calculates vertical position of a unit, as its position within the critical path. The nodes are spread horizontally such that none overlap.
1.82 -Simple constraints (dependencies) are painted as arcs. Complicated constraints are for now displayed as an additional node bearing information on the constraint, with incoming arcs from all units whose execution status affects the constraint and outgoing arcs to the constrained units.
1.83 +Simple constraints (dependencies) are painted as arcs. Complicated constraints are for now displayed as an additional node bearing information on the constraint, with incoming arcs from all units whose execution status affects the constraint, and outgoing arcs going to the constrained units.
1.84
1.85
1.86  \subsubsection{SCG}
1.87 @@ -591,13 +591,13 @@
1.88  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1.89  \section{Conclusion}
1.90  \label{sec:conclusion}
1.91 -We have examined the application of a generalized model of parallel computation to build adaptable performance visualizations, relying only on information collected through instrumenting the language runtime, with no modification to the application.
1.92 -The approach is demonstrated through the case study of instrumenting the SSR message-passing language runtime and using it to tune a parallel application.
1.93
1.94 -The resulting visualizations show that the focus on the parallelism-relevant concepts of work units and constraints on their execution allows a clearer view of parallelism-specific issues.
1.95 -By integrating visual display of constraints stemming from application structure, language runtime implementation, or hardware features, the various possible causes for performance loss are covered. A flexible filtering system for different types of constraints avoids overcharging the display.
1.96 +We have shown how to apply a new model of parallel computation, to  instrument any language's runtime to collect measurements, and  display them effectively. The result is like having a map showing idle cores linked back to the spots in  application code responsible.
1.97
1.98 -Any approach relying only on information available to the runtime is intrinsically limited by the expressiveness of the parallel programming model. We expect that even better results will be observed for high-level'' parallel languages that more closely match application concepts instead of hardware concepts.
1.99 +A case study related the usability improvements in ease of identifying instances of performance loss (via the Scheduling Consequence Graph, SCG), verifying that better performance is possible (via parallelism available in the Unit and Constraint Collection, UCC), connecting  loss to the causes (via SCG) no matter if cause is application structure,  runtime implementation, or hardware feature, and  suggesting changes to the application to improve  performance.
1.100 +
1.101 +  The benefits seen include low effort because the instrumentation is in the runtime, not the application, having all the  possible causes for performance loss covered, having a flexible filtering system for different types of constraints to avoid overcharging the display, and
1.102 + effectiveness for high-level'' parallel languages that closely match application concepts instead of hardware concepts.
1.103
1.104  \bibliography{bib_for_papers_12_Jy_15}
1.105