Mercurial > cgi-bin > hgwebdir.cgi > VMS > 0__Writings > kshalle
changeset 34:dfd7c969dbae
Perf Tuning text update -- adding def of UCC and consequence graph
| author | Some Random Person <seanhalle@yahoo.com> |
|---|---|
| date | Mon, 23 Apr 2012 10:49:03 -0700 |
| parents | 2ff01e61af81 |
| children | 663b333e36c0 |
| files | 0__Papers/Holistic_Model/Perf_Tune/figures/184.pdf 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex |
| diffstat | 2 files changed, 87 insertions(+), 13 deletions(-) [+] |
line diff
1.1 Binary file 0__Papers/Holistic_Model/Perf_Tune/figures/184.pdf has changed
2.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex Thu Apr 19 13:09:22 2012 +0200 2.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex Mon Apr 23 10:49:03 2012 -0700 2.3 @@ -310,27 +310,28 @@ 2.4 2.5 \section{Introduction and Motivation} 2.6 2.7 -Visualizations have been around, and performance tuning tools have been around, but they leave something to be desired. They are a bit fragmented, focusing on one specific view of the application, like statistics of line of code or function call or message sends and receives. They have core time-lines, showing which function runs by time, but the user doesn't know why a function runs in that particular spot on that particular core, nor whether that is desired behavior vs erroneous behavior. 2.8 +Performance visualizations and tuning tools for parallel programs leave something to be desired, and yet are critical to achieving good performance. They fail mainly because their displays of measurements leave the user to \emph{guess} the \emph{cause} of performance loss. They provide many different views of performance measurements, like statistics by line of code, message sends and receives, and core usage timelines. But the user doesn't know why a function runs in a particular spot on a particular core, nor whether that is desired behavior vs erroneous behavior. 2.9 2.10 2.11 -There are a variety of tools, but they have in common that they are composed of several disjointed parts that fail to tell the user what they want to know. This leaves the user to guess. It's like the allegory of the five blind people and the elephant: one touches the ear and says it's round and flat, another touches the leg and says it's a tree, and so on. They each are correct, but the views don't connect to tell them the whole picture. 2.12 +It's like the allegory of the five blind people and the elephant: one touches the ear and says it's round and flat, another touches the leg and says it's a tree, and so on. They each are correct, but the views don't connect to tell them the whole picture. 2.13 2.14 It is the parallel aspects of code and runtime decisions that the tools fall short on. 2.15 -The user usually knows what the application is doing semantically, but parallel performance is all about scheduling and other runtime behavior. The choices about which task or virtual processor is assigned to which core at what point in time is the heart of parallel performance. The causes of that behavior is a big part of the missing information the user wants to know. 2.16 +The user usually knows what the application is doing semantically, but parallel performance is determined by scheduling choices and other runtime behavior. The decision about which task or virtual processor is assigned to which core at what point in time is the heart of parallel performance. That is the information missing: the causes of that behavior, which the user wants to know. 2.17 2.18 2.19 -To fix this, a mental framework is needed that the views all fit into, to connect them together. The framework should be in terms of scheduling decisions, including the units decisions are made on, and the various sources of constraints on those decisions. 2.20 +To fix this, a framework is needed to connect the measurements together. It should be in terms of scheduling decisions, with the units the decisions are made on, and the various sources of constraints on those decisions. 2.21 2.22 -The views should indicate the units, and visually indicate the constraints, showing which are imposed by the application, which by the hardware, and which by the runtime implementation's details. 2.23 +Our contribution is such a framework. Using it, we generate views that indicate the schedulable units in the code. and all the constraints on scheduling them. These include constraints imposed by the application, scheduler implementation and hardware details. 2.24 2.25 -The views should also connect the units to specific segments of code that compose the units, and connect each constraint on scheduling choice to the precise source of the constraint, within the code, hardware, or runtime. They should also separate resource usage into categories of: application work, non-overlapped communication (which results from scheduling decisions), and scheduling/runtime overhead. They should also integrate parameter choices within the code that affect unit creation and constraints. 2.26 +The views connect the units to specific segments of code that compose the units, and connect each constraint on scheduling choice to the precise source of the constraint, within the code, hardware, or runtime. They should also separate resource usage into categories of: application work, non-overlapped communication (which results from scheduling decisions), and scheduling/runtime overhead. They should also integrate parameter choices within the code that affect unit creation and constraints. 2.27 2.28 +We introduce in this paper a model of computation that ties all aspects of performance together, along with instrumentation and visualization that is guided by the model and that links all relevant performance tuning information together. The model and visualization tools are illustrated with a story line, which shows how they are used to performance tune a standard parallel application on two multi-core systems. 2.29 2.30 2.31 \section{Background and Related Work} 2.32 Performance tuning has steps that are iterated until the person tuning is satisfied. First take measurements and display them, in order to discover discrepancies from desired behavior. Next, connect the details of the discrepancies with structure information to form a hypothesis for the cause of the discrepancy. The cause should suggest a plan to fix the problem. Then implement the plan, re-execute, gather new measurements, and repeat until satisfied. 2.33 2.34 -As discussed, for parallel performance, the information presented must include task/work-unit, constraints on scheduling those, and scheduling decisions actually made. 2.35 +For parallel performance, the information presented must include task/work-unit, constraints on scheduling those, and scheduling decisions actually made. 2.36 2.37 Most of the older more established tools come from the threads world, and conceive of the application as a processor that performs actions, but don't include the concept of application-defined tasks nor constraints on them. For example, 2.38 2.39 @@ -397,26 +398,64 @@ 2.40 Hence, each step of performance debugging involves several aspects, including mental model of computation, application code, runtime implementation, scheduling choices, and hardware. In order to be effective, a tool used during performance tuning must be part of a complete model of computation that ties all aspects of the debugging/tuning steps together. Current tools fall short, both because they lack an encompassing model of computation, and because the tools are isolated from each other. Without integration, the user gets an incomplete picture of the computation and must resort to guesses either of where the problem lies or of what to do to fix it. 2.41 2.42 2.43 -We introduce in this paper a model of computation that ties all aspects of performance together, along with instrumentation and visualization that is guided by the model and that links all relevant performance 2.44 -tuning information together. The model and visualization tools are illustrated with a story line, which shows how they are used to performance tune the standard matrix-multiply application on two multi-core systems. 2.45 +We introduce in this paper a model of computation that ties all aspects of performance together, along with instrumentation and visualization that is guided by the model and that links all relevant performance tuning information together. The model and visualization tools are illustrated with a story line, which shows how they are used to performance tune the standard matrix-multiply application on two multi-core systems. 2.46 2.47 Although we use standard visualization techniques [cite], our approach differs from previous work in both theoretical and practical aspects. The theory we use is The Holistic Model of Parallel Computation, which ties together parallelism construct semantics with scheduling choices made during a run, and specific measurements made on the cores. When put into practice, new kinds of measurements are taken, which complete the picture presented to the user, and each measurement is tied to a specific segment of code. The resulting combination not only identifies each source of performance loss, but ties it back to specific causes and suggests precise fixes, all of which is illustrated in our story line. 2.48 2.49 [maybe some stuff about features and benefits of our approach: no app instrumentation, it's all inside language runtime, very low overhead, integrated with VMS-based functional debugging, and so on] 2.50 2.51 2.52 +Table of features/benefits comparing us to others -- in related work section. 2.53 +-- language invariant.. concepts are lang independent, visualization is language independent 2.54 +-- instr lang once, then works across apps 2.55 +-- Integrate measurements with runtime implementation details 2.56 +-- Integrate measurements of one kind with those of a different kind 2.57 +-- Integrate measurements of core activity with code snippets 2.58 +-- Underlying theory ties measurements to each other, to runtime, and to hardware 2.59 +-- Has concept of unit of work / task 2.60 +-- Includes communication between cores 2.61 +-- Shows what each core spends each bit of time on 2.62 +-- Visual representation directly indicates areas of performance loss 2.63 + 2.64 +\begin{table*}[ht] 2.65 +\caption 2.66 +{Comparison of Features and Benefits 2.67 +} 2.68 +\label{tabPersonDaysLang} 2.69 +\begin{center} 2.70 +\begin{tabular}{llll} 2.71 + & Tau & Paradyn & ASH \\ 2.72 + Has Underlying Parallel Computation Model & & & Y\\ 2.73 + Has concept of schedulable unit of work & & & Y\\ 2.74 + Visualization is Invariant Across Languages & & & Y\\ 2.75 + Instrument Language only (not app code) & & & Y\\ 2.76 + Integrates Measurements into whole picture & & & Y\\ 2.77 + Visualizes all core activities & & & Y\\ 2.78 + Core's activities are integrated with scheduling decisions & & & Y\\ 2.79 + Visualizes runtime overhead & & & Y\\ 2.80 + Visualizes consequences of communications & & & Y\\ 2.81 + Visually suggests \emph{hypothesis} for performance loss & & & Y\\ 2.82 + & & & Y\\ 2.83 + & & & Y\\ 2.84 + & & & Y\\ 2.85 + L.O.C. & 470 & 290 & 310 2.86 + \end{tabular} 2.87 +\end{center} 2.88 +\end{table*} 2.89 + 2.90 + 2.91 \section{Illustrative Story of Performance Tuning} 2.92 2.93 We start by showing the tool being used during a typical performance tuning session, to see how its features give benefit, and how competing tools' lack of those features makes the work more difficult. 2.94 2.95 -The program we debug is standard matrix multiply, with which the reader should be familiar. We run it on a 40 core machine. 2.96 +We wish to use a standard program that the reader has likely already experienced attempting to performance tune, or else wondered why its performance wasn't better. The best example is likely matrix multiply, with which the reader should be familiar, allowing concentration on the tool without distraction about the application. We run it on a machine with 4 sockets by 10 cores each, for a total of 40 physical cores. 2.97 2.98 -The code includes a function that automatically divides the work into a number of tasks, based on the number of cores and a tuning parameter. It distributes the tasks across the cores in a round-robin fashion. The answers produced by the tasks are collected by a ``receiver" task and accumulated into the result matrix. 2.99 +The application code includes a function that automatically divides the work into a number of tasks, based on the number of cores and a tuning parameter. It distributes the tasks across the cores in a round-robin fashion. The answers produced by the tasks are collected by a ``receiver" task and accumulated into a result matrix. 2.100 2.101 2.102 -The language used is SSR, which is based on rendez-vous style send and receive operations made between virtual processors. It has comands for creating and destroying virtutal processors, and three kinds of send-receive pairs. The first, send\_from\_to specifies a specific sender and specific receiver. The second, send\_of\_type\_to, specifies a specific receiver, but the sender is anonymous, which increases flexibility while maintaining some control over scope. The third kind, send\_of\_type, only specifies the type, and so acts as a global communication channel. The language also includes performance constructs such as the ability to force which core a virtual processor is assigned to. 2.103 +The language used is SSR, which is based on rendez-vous style send and receive operations made between virtual processors. It has commands for creating and destroying virtual processors, and three kinds of send-receive pairs. The first, send\_from\_to specifies a specific sender and specific receiver. The second, send\_of\_type\_to, specifies a specific receiver, but the sender is anonymous, which increases flexibility while maintaining some control over scope. The third kind, send\_of\_type, only specifies the type, and so acts as a global communication channel. The language also includes performance constructs such as the ability to force which core a virtual processor is assigned to. 2.104 2.105 -A note on terminology: the word ``task'' has acquired multiple meanings in the literature, making it a vague term. We often use instead the term ``work-unit'', which is defined precisely as the trace-segment performed on a core, between two successive scheduling events, plus the set of datums consumed by that trace segment. The word task often maps well onto this precise definition, and we use both words. 2.106 +A note on terminology: the word ``task'' has acquired multiple meanings in the literature, making it a vague term. We often use, instead, the term ``work-unit'', which is defined precisely as the trace-segment performed on a core, between two successive scheduling events, plus the set of datums consumed by that trace segment. The word task often maps well onto this precise definition, and we use both words, but mean work-unit when we say task. 2.107 2.108 2.109 After functional debugging, the first run produces the visualization seen in Fig X. This is what we refer to as a consequence graph. It depicts all the scheduling operations performed by the runtime, along with the consequent usage of the cores. 2.110 @@ -537,6 +576,41 @@ 2.111 \subsection{Performance Tuning on 4 socket by 10 core by 2 context Machine} 2.112 2.113 Same as for 4 core machine.. this time, point out what choices are different between 40 core and 4 core. 2.114 + 2.115 +\section{The Model Behind the Visualization} 2.116 + 2.117 +Now we expand on the model behind the visualizations, which is what ties the information together. It has two parts, a \emph{Unit \&\ Constraint Collection (UCC)}, and a \emph{Scheduling Consequence Graph} (just consequence graph or CG). The UCC indicates the freedom of choice the application allows, encoding what the programmer has control over. The consequence graph says which of those were actually taken during the run and the consequences of that set of choices. We give a precise definition of UCC then consequence graph, in turn. 2.118 + 2.119 +\subsection{Unit \& Constraint Collection} 2.120 +The UCC contains all units of work that get scheduled and all application-related constraints on scheduling them. That's a nice solid definition but things aren't quite that simple. The complication is that different classes of application exist, with two degrees of freedom that determine how much of the UCC is actually defined in the application vs the input data, or even the scheduler. 2.121 + 2.122 +Some applications have everything determined in the code, with all units fixed, and all constraints fixed. An example is matrix multiply with fixed size matrices. But for others, the shape of the UCC is only partially defined by the application code. An extreme example is an NP complete problem, for which the units are a function of both the particular input data and decisions made by the scheduler! 2.123 + 2.124 + We call a fully specified UCC a \emph{concrete} UCC. Every run of an application eventually winds up defining a concrete UCC. But the amount of UCC made concrete by the application alone falls into a two-dimensional grid. One dimension covers the units, the other the constraints. 2.125 + 2.126 +Figure X shows the two axes and the four factors on each that determine how much of the UCC the application code fixes. The horizontal axis is for units, vertical for constraints. The values on an axis start at 0 with units/constraints fully fixed by the application code alone, 1 means they are fixed by parameters plus code, 2 means the values of input data also play a role, and 3 means they are determined by scheduling decisions as well as (optionally) any of the rest. 2.127 + 2.128 +The closer the UCC is to the origin, the more concrete it is. Even the least concrete UCC, out at the end of the diagonal, becomes concrete during a run of the application. As the computation unfolds, data values interact with scheduling decisions, to fix the units and constraints among them, until all are concrete at the end of the run. 2.129 + 2.130 +An added twist is that an application has a life-line, spanning from code through run, and its representation may change at the different stages of life. It starts as pristine source, then moves into specialization where code is translated into different representations than the original, and finally the specialized code is run. The UCC often changes between points in the life-line. 2.131 + 2.132 +For example, specialization may perform a transform that fixes some units, concretizing the UCC. Alternatively, the toolchain may inject manipulator code that allows the runtime scheduler to divide units during the run. It would stop units and divide them to make more when it runs low. The injection of manipulator code moves the UCC further out. 2.133 + 2.134 +Applications that are far out on the diagonal, only specify a portion of the structure of the UCC, but it still tells what is inside the application's control vs under the scheduler's control. It thus indicates what can be done statically. The further out on the diagonal a UCC is, the less can be done statically. 2.135 + 2.136 +In this paper, we do not suggest how to represent UCCs far up on the diagonal. Those actually indicate a multi-verse of UCCs. Which actually materializes depends on the data that shows up and what the scheduler does. We leave the question of representation to future work. 2.137 + 2.138 + 2.139 + 2.140 +\subsection{Scheduling Consequence Graph} 2.141 +The second part of the model adds in constraints caused by runtime implementation details, and by hardware. It then integrates all three aspects of parallel performance: application, runtime, and hardware. The resulting graph identifies all sources of lost performance, and links them to the cause of the loss, as seen during the tuning story in Section X. 2.142 + 2.143 + 2.144 +\subsection{Levels of UCC and Consequence Graph} 2.145 +There is one last twist to the story of UCCs and consequence graphs, which is that there are levels of them, which correspond to the levels of scheduler in a hierarchical machine. Work-unit (or just unit) was defined earlier in terms of instructions. An alternative definition is is the work consequent from 2.146 + 2.147 + 2.148 + 2.149 \end{document} 2.150 2.151 ==============================
