# HG changeset patch # User Some Random Person # Date 1337175122 25200 # Node ID e014ef8c9e0418952f1a9015f7dec4916ec59a60 # Parent 980d375417a31656baa4ed31126060e516a2ba4b Perf tuning -- first clean version diff -r 980d375417a3 -r e014ef8c9e04 .hgignore --- a/.hgignore Fri May 04 09:44:14 2012 -0700 +++ b/.hgignore Wed May 16 06:32:02 2012 -0700 @@ -1,8 +1,25 @@ +nbproject +Makefile +build +dist +src/Default +src/.settings +src/.cproject +src/.project +.dep.inc -glob: *.o -glob: *.aux -glob: *.blg -glob: *.Backup -glob: *.bbl -glob: *.ddf -glob: *.Backup-* \ No newline at end of file +.cproject +.project +Debug + +# mercurial syntax as per: http://mercurial.selenic.com/wiki/.hgignore +.*\.Backup$ +.*\.bb$ +.*\.aux$ +.*\.blg$ +.*\.bbl$ +.*\.ddf$ + +# shell-style syntax, as per: http://www.selenic.com/mercurial/hgignore.5.html +syntax: glob +*.Backup* diff -r 980d375417a3 -r e014ef8c9e04 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf Binary file 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf has changed diff -r 980d375417a3 -r e014ef8c9e04 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex Fri May 04 09:44:14 2012 -0700 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex Wed May 16 06:32:02 2012 -0700 @@ -283,9 +283,7 @@ -\title{Performance Tuning Requires Integrating Aspects of Application, Runtime, Scheduling, and Hardware..\\ -OR Performance Tuning Requires Connecting Application to Runtime, Scheduling, and Hardware\\ - OR Integrated Performance Tuning Using Semantic Information Collected by Instrumenting the Language Runtime} +\title{Integrated Performance Tuning Using Semantic Information Collected by Instrumenting the Language Runtime} \author{ Nina Engelhardt\\ @@ -322,149 +320,55 @@ To fix this, a model is needed that provides the linkage to connect the measurements together. It should be in terms of scheduling decisions, having the units the decisions are made on, and the various sources of constraints on the decisions. -Our contribution is such a model and framework. Using it, we generate views that indicate the schedulable units in the code, and all the constraints on scheduling them. These include constraints imposed by the application, the runtime implementation and hardware details. +Our contribution is such a model and framework. Using it, we generate views that indicate the schedulable units in the code, all the constraints on scheduling them, and connect them to application code. It includes constraints imposed by the application, the runtime implementation and hardware details. -The views connect the units to specific segments of code that compose the units, and connect each constraint on scheduling choice to the precise source of the constraint, within the code, hardware, or runtime. They separate resource usage into the categories: application work, non-overlapped communication (which results from scheduling decisions), and scheduling/runtime overhead. They also integrate parameter choices, within the code, that affect unit creation and constraints. +The connections made within the views encourage hypotheses for the cause of performance losses. The views connect units to the specific segments of code that compose the units, and connect each constraint on scheduling choice to the precise source of the constraint, within the code, hardware, or runtime. They separate resource usage into the categories: application work, non-overlapped communication (which results from scheduling decisions), and scheduling/runtime overhead. They also integrate parameter choices, within the code, that affect unit creation and constraints. In visual form, the integration and connections have the effect of making areas of performance loss evident, and the causes of the loss clear. - - We describe our model of computation, which relates the aspects of performance, and the instrumentation and visualization that are guided by that model and collect performance tuning information and link it together. The model and visualization are illustrated with a story line, which shows how they are used to performance tune a standard parallel application on a challenging multi-core system. We start with background on performance tuning and previous approaches to tools in Section X. Then... \section{Background and Related Work} -Performance tuning follows the standard iterative approach. In each step of iteration, first measurements of performance are taken and compared to what was desired. Some mental model is used to generate a hypothesis of why the measurement was less than desired. A mental model is then used to link the hypothesis to things within the programmer's control, to suggest a change to make to the code. The modified code is then run again and the iteration-step repeats until the person doing the tuning is satisfied. +Performance tuning follows the standard pattern of iterating. In each iteration, first measurements of performance are taken and compared to what was desired. Some mental model is used to generate a hypothesis of why the measurement was less than desired. A mental model is then used to link the hypothesis to things within the programmer's control, to suggest a change to make to the code. The modified code is then run again and the iteration-step repeats until the person doing the tuning is satisfied. - -For performance of parallel computation, more than just runtime measurements are important. Context for those measurements must also be gathered, which includes the characteristics of the units of work that are scheduled onto resources, as well as the constraints on scheduling those. Then during the run, not only is hardware usage measured but also the scheduling decisions actually made and carried out. +For performance of parallel computation, more than just runtime measurements are important. Context for those measurements must also be gathered, which includes the characteristics of the units of work that are scheduled onto resources, as well as the constraints on scheduling those. Then during the run, not only is hardware usage measured but also the scheduling decisions of why the hardware was used that way. +Based on this pattern of performance tuning, we divide other approaches to performance tuning into five categories and talk about them separately. First, the four classic approaches, then a newer breed of tool that is starting to emerge. + +\subsection{Thread-model based Approaches} Most of the older more established tools come from the threads world, and conceive of the application as a processor that performs actions, but don't include the concept of application-defined tasks nor constraints on them. This makes them unable to directly connect statistics they gather to application features. The lack of connection forces the user to guess at what aspect of the code is responsible for observed performance. + + For example, Tau is a highly cited older system for performance tuning parallel applications, which is representative of thread-centric approaches. It integrates many data sources, and has rich displays. However its model was cores and memories and thread contexts, with actions taken on or by each. It had no well defined concept of scheduling, unit scheduled, nor constraints on scheduling those units. Hence, it had no view that integrated the parallelism-specific information at the heart of performance: tasks, constraints on them, and scheduling choices. -Another highly cited classic performance tuning system is Paradyn[], which is representative of systems that model parallel computation as a collection of events. It is meant for applications that run for several days on multi-thousand node clusters. Its model of computation is based on events, both the timing of events and counts of events. It has a system for user-supplied instrumentation to collect event information and it has a hypotheses mechanism that protects the user from having to write custom code to test their hypotheses. However, the hypotheses are in terms of the timing and counts of events. not the parallel computation relevant information of units of scheduled work and the scheduling decisions made on those. +\subsection{Application-intensive approach} +Another highly cited classic performance tuning system is Paradyn[], which is representative of systems that model parallel computation as a collection of events. It is meant for applications that run for several days on multi-thousand node clusters. Its model of computation is based on events, both the timing of events and counts of events. It has a system for user-supplied instrumentation to collect event information and it has a hypotheses mechanism that protects the user from having to write custom code to test their hypotheses. However, the hypotheses are in terms of the timing and counts of events, not the parallel computation relevant information of units of scheduled work and the scheduling decisions made on those. +\subsection{Event driven approach} Paragraph also follows an event-based model, and represents the large collection of simpler tools that instrument the MPI library. It shows whether cores are busy, and indicates communication overhead, but lacks any features that tie the communication pattern realized to application code features, which are what is under programmer control. It also fails to show runtime overhead, and which portions of idle time are caused by runtime internal constraints. Paraver and Vampir are just painting tools that take event measurements and paint them on the screen. -Other approaches concentrate on performance counter data to identify hot-spots and potential false-sharing.. these suffer from the same lack of encompasing computation model, leaving the user to guess at what might be the cause of measured numbers. The do a good job of saying that something might be wrong, bor a poor job of pointing to what is causing the problem, and hence leave the user baffled as to what to change in their code to get better performance. +\subsection{Performance-counter approaches} +Other approaches, such as VTune, POPI, and so forth concentrate on performance counter data to identify hot-spots and potential false-sharing. These suffer from the same lack of encompassing computation model, leaving the user to guess at what might be the cause of measured numbers. They do a good job of saying that something might be wrong, but a poor job of pointing to what is causing the problem, and hence leave the user baffled as to what to change in their code to get better performance. +\subsection{Newer approaches} The commonality among the classic approaches is the lack of a model of parallel computation. One difficulty faced by early tools is that parallel applications written in MPI or threads effectively end up implementing a runtime system in the application code. In such a case, the units of work are implied in the code, and difficult for tools to recognize. Likewise, constraints on scheduling are enforced by the code, but never stated in any explicit form. - These shortcomings are rectified by recent parallel languages such as CnC and StarSs, which cleanly identify tasks and constraints on them. The people who develop such a language also develop language-specific tools to go with it. - -StarSs is clearly thinking about tasks, and even some about scheduling. but limited on scheduling -- can just place task into a queue, so can sort of manipulate scheduling but view doesn't really give all the constraints -- missing the runtime overhead, and missing the idle-time consequent from non-overlapped comm. + These shortcomings are rectified by recent parallel languages such as CnC and StarSs, which cleanly identify tasks and the constraints on them. The people who develop such a language also develop language-specific tools to go with it. The StarSs tool tries to simplify the view for the user. It doesn't give performance information directly, but instead identifies tasks and tells the user whether it thinks the task size is too small, just right, or -- instead, it has a recommended task size, which is between too small, which has too much overhead and too big, which has too few tasks to load balance. It makes task a one color if too short, another if just right, and a third if too long. - - -========== - -MPI is also machine-based abstraction, that gives communication information, but doesn't have concept of constraints . Its sort of in-between.. - -For communication, want two things: 1) idle-time on cores that is consequence from particular communication pattern, and in some cases 2) the energy due to the volume of communication. Both are consequence of the scheduling choices made. - - - - -=======================? - -Their structure information is either call graph or line of code. This is too limited for parallel performance. Need tasks and scheduling. More recent languages move towards task-based. This makes more of the needed structure information available. But we've not seen a complete or coherent presentation. - -===== - - The value of a consequence graph is linking the size of boxes in it to the decisions made by the scheduler, which are represented by the shape. This lets a person visually link scheduling decisions to consequences. The mind can quickly see empty areas, representing lost performance, and connect those to boxes and lines that cause the empty areas. They then look up the code of each box and construct of each line, which gives them the code structures that caused the loss of performance. This \emph{visual} linkage of undesirable effect back to \emph{cause in the code} is what sets our approach apart. -The value of the UCC is visually linking cause of performance loss to options available for fixing it. - -===== - - - -A survey of the most highly cited classic papers shows the commonality.. - -=========== - - -Use case: takes longer to create when more units -- that's because it mallocs all the matrix pieces, before it pairs any and starts the pair as work -- can see this from the graphs by linking the units to code they execute -- not the functions, specifically, but to the scheduled code-snippet. -- suggests changing the way divider works. - -Graph tells us the division code is bad -- works better when have 39 cores instead of 40 -- makes more pieces for 39 than for 40, and the end up fitting better onto the cores.. - -Systematic way of adding auto-tuning to language -- using TSC for time measurements, and counters for cache miss. Provides a framework that links the pieces together, and the UCC gives critical information useful in tuning. - -======================= - -Advantage: unlike statistical, it catches all "one off" and momentary serializations, which cause massive idling. - -Advantage: it is quick and easy to instrument a language that uses the same runtime abstraction (VMS) -- VMS was made to make language runtimes quick and easy, with low overhead (cite). VMS has the base instrumentation in it, language adds the types of constraints and the collection points. Each request to the runtime is in a separate (inlined) function, which makes adding the instrumentation simple. - -Advantage: instrument the language, NOT the application -- and yet it gives all the application-specific information from a run. -- others: Tau forces extensive instrumentation of application.. MPI based ones don't give the constraint info, nor what occupies time of core info.. - -======================= - -Expl of what is meant by "structure" info -- example where meas of runtime system showed that overhead of task creation took longer than task execution. Hypothesis was trivial: cause of lost performance is runtime overhead of creation is larger than work in a scheduled unit. The plan to fix is to change the number of work-units created, by changing the parameter in the divider code. Implementing this and re-executing showed that this source of performance loss was fixed by the change. - -The example shows that theory is part of hypothesis generation, because it required knowledge of the runtime and understanding that creation of a task is work performed as overhead inside the runtime. The example also shows that generating the plan to fix required understanding the segment of code that divided work into tasks, and the relationship between parameters to that code and execution time of the resulting tasks. - -Hence, each step of performance debugging involves several aspects, including mental model of computation, application code, runtime implementation, scheduling choices, and hardware. In order to be effective, a tool used during performance tuning must be part of a complete model of computation that ties all aspects of the debugging/tuning steps together. Current tools fall short, both because they lack an encompassing model of computation, and because the tools are isolated from each other. Without integration, the user gets an incomplete picture of the computation and must resort to guesses either of where the problem lies or of what to do to fix it. - - -We introduce in this paper a model of computation that ties all aspects of performance together, along with instrumentation and visualization that is guided by the model and that links all relevant performance tuning information together. The model and visualization tools are illustrated with a story line, which shows how they are used to performance tune the standard matrix-multiply application on two multi-core systems. - -Although we use standard visualization techniques [cite], our approach differs from previous work in both theoretical and practical aspects. The theory we use is The Holistic Model of Parallel Computation, which ties together parallelism construct semantics with scheduling choices made during a run, and specific measurements made on the cores. When put into practice, new kinds of measurements are taken, which complete the picture presented to the user, and each measurement is tied to a specific segment of code. The resulting combination not only identifies each source of performance loss, but ties it back to specific causes and suggests precise fixes, all of which is illustrated in our story line. - -[maybe some stuff about features and benefits of our approach: no app instrumentation, it's all inside language runtime, very low overhead, integrated with VMS-based functional debugging, and so on] - - -Table of features/benefits comparing us to others -- in related work section. --- language invariant.. concepts are lang independent, visualization is language independent --- instr lang once, then works across apps --- Integrate measurements with runtime implementation details --- Integrate measurements of one kind with those of a different kind --- Integrate measurements of core activity with code snippets --- Underlying theory ties measurements to each other, to runtime, and to hardware --- Has concept of unit of work / task --- Includes communication between cores --- Shows what each core spends each bit of time on --- Visual representation directly indicates areas of performance loss - -\begin{table*}[ht!] -\caption -{Comparison of Features and Benefits -} -\label{tabPersonDaysLang} -\begin{center} -\begin{tabular}{llll} - & Tau & Paradyn & ASH \\ - Has Underlying Parallel Computation Model & & & Y\\ - Has concept of schedulable unit of work & & & Y\\ - Visualization is Invariant Across Languages & & & Y\\ - Instrument Language only (not app code) & & & Y\\ - Integrates Measurements into whole picture & & & Y\\ - Visualizes all core activities & & & Y\\ - Core's activities are integrated with scheduling decisions & & & Y\\ - Visualizes runtime overhead & & & Y\\ - Visualizes consequences of communications & & & Y\\ - Visually suggests \emph{hypothesis} for performance loss & & & Y\\ - & & & Y\\ - & & & Y\\ - & & & Y\\ - L.O.C. & 470 & 290 & 310 - \end{tabular} -\end{center} -\end{table*} - - \section{Illustrative Story of Performance Tuning} @@ -524,14 +428,14 @@ The application code includes a function that automatically divides the work into a number of tasks, based on the number of cores and a tuning parameter. It distributes the tasks across the cores in a round-robin fashion, and then waits for completion of the calculation before initiating shutdown. The answers produced by the tasks are collected by a ``receiver" and accumulated into the result matrix, whereupon it notifies the setup function of completion. -The language used is SSR, which is based on rendez-vous style send and receive operations made between virtual processors (VPs). It has commands for creating and destroying virtutal processors, and three kinds of send-receive paired operations. The first, \emph{send\_from\_to} specifies a specific sender and specific receiver. We use it to notify the VP dedicated to dividing the work (``creator VP'') that the VP collecting results (``results VP'') has received all sub-results, so it can output the final result. The second, \emph{send\_of\_type\_to}, specifies a specific receiver, but the sender is anonymous, which increases flexibility while maintaining some control over scope. This construct is used by the processors multiplying sub-matrices to send their sub-result to the results processor. The third kind, \emph{send\_of\_type}, only specifies the type, and so acts as a global communication channel; this is not used in our application. The language also includes a \emph{singleton} construct that designates a piece of code as to be executed only once, which we use to rearrange and copy data to get better cache behavior. A given copy is shared by several virtual processors, but the copy only needs to be performed once. Also available are some miscellaneous performance constructs, such as the ability to force which core a virtual processor is assigned to, which is used in our example program. +The language used is SSR, which is based on rendez-vous style send and receive operations made between virtual processors (VPs). It has commands for creating and destroying virtutal processors, and three kinds of send-receive paired operations. The first, \emph{send\_from\_to} specifies a specific sender and specific receiver. We use it to notify the VP dedicated to dividing the work (``creator VP'') that the VP collecting results (``results VP'') has received all sub-results, so it can output the final result. The second, \emph{send\_of\_type\_to}, specifies a specific receiver, but the sender is anonymous, which increases flexibility while maintaining some control over scope. This construct is used by the processors multiplying sub-matrices to send their sub-result to the results processor. The third kind, \emph{send\_of\_type}, only specifies the type, and so acts as a global communication channel; this is not used in our application. The language also includes a \emph{singleton} construct that designates a piece of code as to be executed only once, which we use to rearrange and copy data to get better cache behavior. A given copy is shared by several virtual processors on different cores, but the copy only needs to be performed once. Miscellaneous performance constructs are also available, such as one to force which core a virtual processor is assigned to, which is used in our example program. A note on terminology: the word ``task'' has acquired multiple meanings in the literature, making it a vague term. We often use, instead, the term ``work-unit'', which is defined precisely as the trace-segment performed on a core, between two successive scheduling events, plus the set of datums consumed by that trace segment. The word task often maps well onto this precise definition, and we use both words, but mean the precise work-unit definition when we say task. -A second visualization, seen in Figure X, depicts the constraints on the scheduling decisions, which limit the choices the runtime is allowed to make. We call this the Unit & Constraint Collection, or UCC. This shows only application-derived information. +A second visualization, seen in Figure X, depicts the constraints on the scheduling decisions, which limit the choices the runtime is allowed to make. We call this the Unit \& Constraint Collection, or UCC. This shows only application-derived information. \subsection{Walk through} -After functional debugging, the first run produces the visualization seen in Fig X. This is what we refer to as a scheduling consequence graph. It depicts all the scheduling operations performed by the runtime, along with the consequent usage of the cores. +After functional debugging, the first run produces the visualization seen in Fig \ref{fig:figure1}. This is what we refer to as a scheduling consequence graph. It depicts all the scheduling operations performed by the runtime, along with the consequent usage of the cores. Each column is associated with one core. A blue vertical block represents the time the core spends doing the actual work of one work-unit, the height being proportional to the number of cycles taken for execution. Just above each, in light blue, is the runtime overhead spent on that work-unit, which can be broken into pieces representing acquisition of the lock on the shared semantic state, time spent performing the semantics of the parallelism construct, time spent deciding which ready task to execute next, and time spent switching from virtual processor, to the runtime, and back. @@ -541,7 +445,7 @@ Many different orderings could also have been validly chosen. Which scheduler choices are valid is determined by three kinds of constraints: the application code constraints, hardware constraints, and runtime implementation imposed constraints. -Returning to Fig \ref{story}, the lines in red, orange, and green represent application-code constraints that each tie two work-units together. The color represents the kind of constraint imposed by one kind of parallelism construct. Red is creation of a virtual processor, green is many-to-one send\_of\_type\_to, and orange is the singleton construct. For better visibility, only constraints that cross cores are shown. +Returning to Fig \ref{story}, the lines in red, orange, and green represent application-code constraints that each tie two work-units together. The color represents the kind of constraint imposed by one kind of parallelism construct. Red is creation of a virtual processor, green is the many-to-one \texttt{send\_of\_type\_to}, and orange is the singleton construct. For better visibility, only constraints that cross cores are shown. The columns represent each core, and the vertical axis represents the progress of the program execution, starting from the top. This gives at a glance the total runtime (height), level of parallelism during the run (number of filled columns at a specific height), and processor utilization (blue-covered area vs. white). All figures are at the same scale, so they can be compared directly. @@ -809,6 +713,143 @@ a lot of the time it's lists of measurements, or bar graphs, things like that -- over the whole application or by function -- forcing guessing of how it connects to -- if it tells you that this line creates a lot of level 2 cache misses, that doesn't tell you what the application is doing to cause this.. but when have whole UCC along with it, have context for the measurements -- puts the line of code into a framework -- it's necessary but not useful by itself -- it needs to be connected -- the unit information is more interesting than the line of code information -- line of code has only sequential meaning, missing scheduling connection -- need the scheduling behavior added -- need to know the unit of work that's causing problem, not the line of code -- unit provides a parallelism context, line of code does not.. unit provides an execution order and execution location, with implied communication -- line of code does not. + + + +====== Related work ======== + +StarSs is clearly thinking about tasks, and even some about scheduling. but limited on scheduling -- can just place task into a queue, so can sort of manipulate scheduling but view doesn't really give all the constraints -- missing the runtime overhead, and missing the idle-time consequent from non-overlapped comm. + +The PPoPP 12 paper and its ilk are fall into same category -- measurements without model to connect measurement to cause. + +-] MPI event based (Paragraph) + +-] Invasive Application Instrumentation approach (Paradyn) + +-] Thread-model based (Tau) + +-] profiling based (VTune, PPoPP 12 paper..) + +-] Task-based +(StarSs) + +? + +-] Instrument language runtime -- all applications inherit free + +-] Computation model backs it, connects effect back to cause + +-] Visual features indicate which hardware is idle + +-] Visual features indicate cause of idleness + +-] Low overhead instrumentation + +-] Task carries extra info that organizes -- connects disparate measurements -- helps connect effect back to cause + +-] Multiple languages easily reuse instrumentation + +-] Visual features indicate langauge features from app-code + +? + + +========== + +MPI is also machine-based abstraction, that gives communication information, but doesn't have concept of constraints . Its sort of in-between.. + +For communication, want two things: 1) idle-time on cores that is consequence from particular communication pattern, and in some cases 2) the energy due to the volume of communication. Both are consequence of the scheduling choices made. + + + + +=======================? + +Their structure information is either call graph or line of code. This is too limited for parallel performance. Need tasks and scheduling. More recent languages move towards task-based. This makes more of the needed structure information available. But we've not seen a complete or coherent presentation. + +===== + + The value of a consequence graph is linking the size of boxes in it to the decisions made by the scheduler, which are represented by the shape. This lets a person visually link scheduling decisions to consequences. The mind can quickly see empty areas, representing lost performance, and connect those to boxes and lines that cause the empty areas. They then look up the code of each box and construct of each line, which gives them the code structures that caused the loss of performance. This \emph{visual} linkage of undesirable effect back to \emph{cause in the code} is what sets our approach apart. +The value of the UCC is visually linking cause of performance loss to options available for fixing it. + +===== + + + +A survey of the most highly cited classic papers shows the commonality.. + +=========== + + +Use case: takes longer to create when more units -- that's because it mallocs all the matrix pieces, before it pairs any and starts the pair as work -- can see this from the graphs by linking the units to code they execute -- not the functions, specifically, but to the scheduled code-snippet. -- suggests changing the way divider works. + +Graph tells us the division code is bad -- works better when have 39 cores instead of 40 -- makes more pieces for 39 than for 40, and the end up fitting better onto the cores.. + +Systematic way of adding auto-tuning to language -- using TSC for time measurements, and counters for cache miss. Provides a framework that links the pieces together, and the UCC gives critical information useful in tuning. + +======================= + +Advantage: unlike statistical, it catches all "one off" and momentary serializations, which cause massive idling. + +Advantage: it is quick and easy to instrument a language that uses the same runtime abstraction (VMS) -- VMS was made to make language runtimes quick and easy, with low overhead (cite). VMS has the base instrumentation in it, language adds the types of constraints and the collection points. Each request to the runtime is in a separate (inlined) function, which makes adding the instrumentation simple. + +Advantage: instrument the language, NOT the application -- and yet it gives all the application-specific information from a run. -- others: Tau forces extensive instrumentation of application.. MPI based ones don't give the constraint info, nor what occupies time of core info.. + +======================= + +Expl of what is meant by "structure" info -- example where meas of runtime system showed that overhead of task creation took longer than task execution. Hypothesis was trivial: cause of lost performance is runtime overhead of creation is larger than work in a scheduled unit. The plan to fix is to change the number of work-units created, by changing the parameter in the divider code. Implementing this and re-executing showed that this source of performance loss was fixed by the change. + +The example shows that theory is part of hypothesis generation, because it required knowledge of the runtime and understanding that creation of a task is work performed as overhead inside the runtime. The example also shows that generating the plan to fix required understanding the segment of code that divided work into tasks, and the relationship between parameters to that code and execution time of the resulting tasks. + +Hence, each step of performance debugging involves several aspects, including mental model of computation, application code, runtime implementation, scheduling choices, and hardware. In order to be effective, a tool used during performance tuning must be part of a complete model of computation that ties all aspects of the debugging/tuning steps together. Current tools fall short, both because they lack an encompassing model of computation, and because the tools are isolated from each other. Without integration, the user gets an incomplete picture of the computation and must resort to guesses either of where the problem lies or of what to do to fix it. + + +We introduce in this paper a model of computation that ties all aspects of performance together, along with instrumentation and visualization that is guided by the model and that links all relevant performance tuning information together. The model and visualization tools are illustrated with a story line, which shows how they are used to performance tune the standard matrix-multiply application on two multi-core systems. + +Although we use standard visualization techniques [cite], our approach differs from previous work in both theoretical and practical aspects. The theory we use is The Holistic Model of Parallel Computation, which ties together parallelism construct semantics with scheduling choices made during a run, and specific measurements made on the cores. When put into practice, new kinds of measurements are taken, which complete the picture presented to the user, and each measurement is tied to a specific segment of code. The resulting combination not only identifies each source of performance loss, but ties it back to specific causes and suggests precise fixes, all of which is illustrated in our story line. + +[maybe some stuff about features and benefits of our approach: no app instrumentation, it's all inside language runtime, very low overhead, integrated with VMS-based functional debugging, and so on] + + +Table of features/benefits comparing us to others -- in related work section. +-- language invariant.. concepts are lang independent, visualization is language independent +-- instr lang once, then works across apps +-- Integrate measurements with runtime implementation details +-- Integrate measurements of one kind with those of a different kind +-- Integrate measurements of core activity with code snippets +-- Underlying theory ties measurements to each other, to runtime, and to hardware +-- Has concept of unit of work / task +-- Includes communication between cores +-- Shows what each core spends each bit of time on +-- Visual representation directly indicates areas of performance loss + +\begin{table*}[ht!] +\caption +{Comparison of Features and Benefits +} +\label{tabPersonDaysLang} +\begin{center} +\begin{tabular}{llll} + & Tau & Paradyn & ASH \\ + Has Underlying Parallel Computation Model & & & Y\\ + Has concept of schedulable unit of work & & & Y\\ + Visualization is Invariant Across Languages & & & Y\\ + Instrument Language only (not app code) & & & Y\\ + Integrates Measurements into whole picture & & & Y\\ + Visualizes all core activities & & & Y\\ + Core's activities are integrated with scheduling decisions & & & Y\\ + Visualizes runtime overhead & & & Y\\ + Visualizes consequences of communications & & & Y\\ + Visually suggests \emph{hypothesis} for performance loss & & & Y\\ + & & & Y\\ + & & & Y\\ + & & & Y\\ + L.O.C. & 470 & 290 & 310 + \end{tabular} +\end{center} +\end{table*} + + =================== Perf bounded by: the application-imposed constraints on scheduling units, the hardware-imposed constraints on scheduling them, and the runtime-implementation imposed constraints on scheduling them. diff -r 980d375417a3 -r e014ef8c9e04 0__Papers/VMS/Universal/figures/control_flow.svg --- a/0__Papers/VMS/Universal/figures/control_flow.svg Fri May 04 09:44:14 2012 -0700 +++ b/0__Papers/VMS/Universal/figures/control_flow.svg Wed May 16 06:32:02 2012 -0700 @@ -9,8 +9,8 @@ xmlns="http://www.w3.org/2000/svg" xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd" xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape" - width="990" - height="765" + width="1488.189" + height="1052.3622" id="svg2" sodipodi:version="0.32" inkscape:version="0.48.1 " @@ -520,9 +520,9 @@ objecttolerance="10" inkscape:pageopacity="0.0" inkscape:pageshadow="2" - inkscape:zoom="1.05" + inkscape:zoom="0.97968464" inkscape:cx="457.74508" - inkscape:cy="272.59162" + inkscape:cy="264.16662" inkscape:document-units="px" inkscape:current-layer="layer1" showgrid="false" @@ -539,7 +539,7 @@ image/svg+xml - + @@ -547,7 +547,7 @@ inkscape:label="Layer 1" inkscape:groupmode="layer" id="layer1" - transform="translate(-3.5307846,441.70658)"> + transform="translate(-3.5307846,729.06884)"> + transform="translate(160,-40)"> + transform="translate(177.05331,-604)"> Non-suspend end of work-unit Suspend at end of work-unit @@ -757,17 +757,17 @@ sodipodi:nodetypes="cc" inkscape:connector-curvature="0" id="path11472" - d="m 399.56974,-144.00562 0,34.68367" + d="m 399.56974,-364.00562 0,34.68367" style="fill:none;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-start:none;marker-end:url(#Arrow2Mend-8-7-0)" /> @@ -787,58 +787,58 @@ xml:space="preserve" style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS" x="729.43005" - y="-121.95343" + y="-341.95343" id="text11616" sodipodi:linespacing="100%"> new work-unit is attached to a VP's context new work-unit hasown localcontext which has attached context Non-suspend end of work-unit @@ -909,17 +909,17 @@ xml:space="preserve" style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS" x="95.276283" - y="-67.95343" + y="-287.95343" id="text11750" sodipodi:linespacing="100%">purely local (no ctxt attached) + transform="translate(-22.94669,-664)"> CILK is example of this case,when leaf child finishes Dataflow is example of this case pthread is example of this case,as is Cilk when suspends on sync @@ -1001,61 +1001,61 @@ sodipodi:nodetypes="cc" inkscape:connector-curvature="0" id="path12020" - d="m 228.09216,-156.20795 41.06088,0" + d="m 228.09216,-376.20795 41.06088,0" style="fill:none;stroke:#000000;stroke-width:0.99999994px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-start:none;marker-end:url(#Arrow2Mend-8-7-0)" /> no workin Q (in CILK, counts child ends, and(for CILK, Q filled by async spawns) + transform="translate(-280,-40)"> + transform="translate(77.05331,-503.31973)"> give them togive to hdlrs.constraintcheck timers,updater andfire hdlr forAssigner + id="tspan3523">expired ones + transform="translate(617.05331,-603.31973)"> + transform="translate(17.05331,-404)"> + transform="translate(-62.94669,-344)"> (in dataflow & CILK,(in dataflow & CILK, msgs from other cores go to Assigner to ask that ask for or push for work and to push) + x="273.27625" + y="33.726822" + id="tspan3527"> work, go to assigner) (in dataflow & CILK, push work to other cores via msgs -- remember constraints that cross cores) @@ -1373,11 +1373,11 @@ sodipodi:nodetypes="cc" inkscape:connector-curvature="0" id="path12490" - d="M 328.33352,16.826612 369.78149,-154.44636" + d="m 328.33352,-203.17339 41.44797,-171.27297" style="fill:none;stroke:#000000;stroke-width:1.00000012px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-end:url(#Arrow2Mend-8-7-0)" /> + transform="translate(97.05331,-404)"> - - send work-push msg - - @@ -1447,10 +1419,10 @@ sodipodi:nodetypes="cc" inkscape:connector-curvature="0" id="path12550" - d="m 239.95132,116.23968 -77.23827,34.97636" + d="m 239.95132,-103.76032 -77.23827,34.97636" style="fill:none;stroke:#000000;stroke-width:1.00000012px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-end:url(#Arrow2Mend-8-7-0)" /> done withno yes, constraintmsgs Assignermsgs (in CILK, completion of child on remote core notifies parent's core. In dataflow, remote pro- pendent sends data to dependents' cores) + transform="translate(437.05331,-484)"> receivework-push msg receive need-work msg & have workNo work + transform="translate(717.05331,-664)"> @@ -1745,11 +1717,11 @@ sodipodi:nodetypes="cc" inkscape:connector-curvature="0" id="path3769" - d="m 428.86861,-156.20795 200.12395,0" + d="m 428.86861,-376.20795 200.12395,0" style="fill:none;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-start:none;marker-end:url(#Arrow2Mend-8-7-0)" /> + transform="translate(297.05331,-784)"> + transform="translate(177.05331,-784)"> @@ -1839,40 +1811,40 @@ sodipodi:nodetypes="cc" inkscape:connector-curvature="0" id="path3813" - d="m 307.68401,-276.20795 61.46905,0" + d="m 307.68401,-496.20795 61.46905,0" style="fill:none;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-start:none;marker-end:url(#Arrow2Mend-8-7-0)" /> yes,chk + y="-454.95343">chk no @@ -1880,26 +1852,26 @@ xml:space="preserve" style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS" x="353.43005" - y="-253.95343" + y="-473.95343" id="text3829" sodipodi:linespacing="100%"> done with withmsgs + y="-455.95343">msgs + transform="translate(177.05331,-744)"> + transform="translate(617.05331,-763.31973)"> new work-unit has ownlocal context, butprev VP suspended (in pthreads, checks mutex structures, mutex structures, cond var structs, etc) + y="-516.5929"> cond var structs, etc) work-unitstate chgs Suspend at end of work-unit pthread is example of this case,as is Cilk when suspends on sync + y="-584.95343">as is Cilk when suspends on sync + transform="translate(-300,-280)"> yes,no done withmsgs + transform="translate(297.05331,-864)"> (in pthreads, checks mutex structures,work-unitstate chgs @@ -2473,10 +2445,10 @@ sodipodi:nodetypes="cc" inkscape:connector-curvature="0" id="path3500" - d="m 548.67846,-395.848 261.45961,-0.0912" + d="m 548.67846,-615.848 261.45961,-0.0912" style="fill:none;stroke:#000000;stroke-width:1.00000036px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-end:url(#Arrow2Mend-8-7-0)" /> + transform="translate(617.05331,-904)"> @@ -2553,11 +2525,11 @@ sodipodi:nodetypes="cc" inkscape:connector-curvature="0" id="path3590" - d="M 328.33352,22.451597 369.78149,-391.37328" + d="m 328.33352,-197.5484 41.44797,-413.82488" style="fill:none;stroke:#000000;stroke-width:1.00000012px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-end:url(#Arrow2Mend-8-7-0)" /> + transform="translate(-22.94669,-904)"> No VPsNo VPsready increase backoff eachrepetition & updatebackoff stateTry to make core enterpower-down idle statewhile waiting yes, send + transform="translate(80,20)"> @@ -2778,43 +2750,43 @@ xml:space="preserve" style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:start;line-height:100%;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;font-family:Trebuchet MS;-inkscape-font-specification:Trebuchet MS" x="625.43005" - y="-27.95343" + y="-247.95343" id="text8403" sodipodi:linespacing="100%">Don'tDon'tsend yes,gotThe difference between top two pathsis the way the request hdlr+assigner hasbeen implemented -- use shared state ontop path, but only local on the second.NOTE: the request hdlr and assigner arecombined into a single straight-line piece No VPs ready @@ -2906,7 +2878,7 @@ Core-controller was a second level beneath the runtime (Language Animator is the runtime).. The sched slots were virtual physical animators, and the masterVP was the Language Animator, which the AnimationMaster Fn and the plugin Fns supplied the behavior of. The coreCtlr switched between levels -- The MasterVP was "outside" the framework of the schedling slots -- in a way it was beside them (they took turns getting the actual phys animator, so same level) and in other way was above -- it controlled what went into the slots, so above them. I another way, the MasterVP was the Language Animator, while the slots were virtual physical, so they had no locial connection -- the slots were related to the same thing that animatored the language animator.. a very strange arrangement. + x="894.00696" + y="-182.42088"> + + send work-push msgand start timer for it. + + + Control Flow of Universal Proto-Runtime + + Expired timerhandler + + + + Expired timer + When send work to other core,it may be sleeping. So when timerexpires, do escrow, which is atomiccheck of whether work was received.If not, then check if dest core is sleeping -- if no, then take workback. If yes, do wakeup and set a new timer, which will do thishandling again. diff -r 980d375417a3 -r e014ef8c9e04 0__Papers/VMS/Universal/latex/VMS_universal.tex --- a/0__Papers/VMS/Universal/latex/VMS_universal.tex Fri May 04 09:44:14 2012 -0700 +++ b/0__Papers/VMS/Universal/latex/VMS_universal.tex Wed May 16 06:32:02 2012 -0700 @@ -33,7 +33,7 @@ %+Abstract \begin{abstract} -Software has not been keeping up with parallel hardware, which slows the economy and retards adoption of new hardware. The gap is due in part to the disruption caused by moving to parallel languages, and in part to the prohibitive effort of porting application code across platforms. A leading idea for solving this is domain-specific parallel languages, where custom constructs are made to match features of the problem. However, such languages have a small number of users, which can't support the currently large effort to create such languages and port them across hardware platforms. +Software has not been keeping up with parallel hardware, which slows the economy and retards adoption of new hardware. The gap is due in part to the disruption caused by moving to parallel languages, and in part to the prohibitive effort of porting parallel code across platforms. A leading idea for solving this is domain-specific parallel languages, where custom constructs are made to match features of the problem. However, such languages have a small number of users, which can't support the currently large effort to create such languages and port them across hardware platforms. To simplify creation of domain-specific languages, we propose a "proto" runtime, which provides the cross-language portion of runtime behavior. This includes handling the concurrency issues within the runtime itself. A given language only provides sequential implementations of its constructs. @@ -45,30 +45,30 @@ \section{Introduction} -Current parallel languages, such as pthreads, TBB, OpenMP, and MPI, require programmers to learn new, complex mental models. Sequential programmers have to be retrained to start using them, and a new set of programming practices must be adopted for them. Such retraining generates a large disruption in application-software companies that attempt to start using these parallel languages. +Current parallel languages, such as pthreads, TBB, OpenMP, and MPI, require programmers to learn new, complex mental models. Sequential programmers have to be retrained to using the parallel languages, and a new set of programming practices must be adopted for them. Such retraining generates a large disruption in the application-software companies that attempt to start using these parallel languages. -Further, after learning the new language and adopting the new programming practices, the companies still have to hand-tune each application for each target hardware platform. It also means that customers have to get a new binary when they upgrade hardware. Both effects are costly, and tend to retard adoption of new hardware, despite potential performance gains. +Further, after learning the new language and adopting the new programming practices, the companies still have to hand-tune each application for each target hardware platform. Not only that, but customers have to acquire a new binary when they upgrade hardware. Both effects are costly, and tend to retard adoption of new hardware, despite potential performance gains. -The net effect is that software lags behind hardware, and the potential advantages across the economy of new hardware designs are lost. What is needed is a way to ease the transition from sequential programming to parallel programming, and to reduce the need for hand-tuning to get efficient performance on new hardware. +The net result is that software lags behind hardware, and the potential advantages across the economy of new hardware designs are lost. What is needed is a way to ease the transition from sequential programming to parallel programming, and to reduce the amount of hand-tuning done to get efficient performance on new hardware. -Domain-specific languages promise to deliver both easier transition and efficiency across hardware[]. They do this by providing custom constructs that match patterns in the application. Thus, they are natural for the sequential programmer to use. The constructs ``hide'' the parallelism inside themselves, letting the runtime or toolchain handle it, freeing the programmer. +Domain-specific languages promise to deliver both easier transition and efficiency across hardware[]. They do this by providing custom constructs that match to patterns in the application. Thus, they are natural for the sequential programmer to use. The custom constructs ``hide'' parallelism inside themselves, letting the runtime or toolchain handle it, freeing the programmer, and putting performance inside the tools. This makes coding easier and reduces hand-tuning. To illustrate the concept, we briefly cover one such language, which is used for describing the behavior of hardware, called HWSim. It has a straight-forward means for describing hardware, and then extracts parallelism from the description. The extracted parallelism is then efficiently exploited on a variety of target platforms. -One blockage to wide-spread adoption of such domain-specific languages is the cost of developing them. Currently, toolchains are typically created, with an optimizer and back end for each target platform, which is expensive to create. +One blockage to wide-spread adoption of such domain-specific languages is the cost of developing them. Currently, a toolchain is created, including an optimizer and back end for each target platform. This is expensive. An alternative approach is so-called embedded languages, in which the new constructs are viewed as library calls made from a base language, such as C. This relies on the runtime system to provide efficient execution on a given target platform. It avoids the expense of creating a toolchain for each target, by instead creating a runtime for each target that is tuned to the hardware. This is a step forward, but such runtimes are still expensive to create. A technology to simplify the runtime creation would be helpful. -In this paper, we present a means to reduce runtime creation, by not only reducing the complexity, but also by making reuse across languages more practical. It essentially breaks a runtime into two pieces: a part that implements the semantic behavior of the custom constructs, and a part that is the same for every language. +In this paper, we present a means to reduce runtime creation, by not only reducing the complexity, but also by making reuse of runtime code in new languages more practical. It essentially breaks a runtime into two pieces: a part that implements the semantic behavior of the custom constructs, and a part that is the same for every language. - A nice happenstance is that the complicated multi-threaded issues that come up inside current runtime implementations can be collected inside the part that stays the same across languages. This means the language only need supply a sequential implementation of its constructs' behavior. + A nice happenstance is that most of the complicated concurrency issues that come up inside current runtime implementations can be collected inside the part that stays the same across languages. This means the language only need supply a sequential implementation of its constructs' behavior. The cross-language part handles low-level synchronization and communication between cores, exporting a simple interface. Another nice happenstance is that the interface between the two parts of the runtime modularizes the runtime code. This, by itself, speeds development of new runtimes. In addition, it makes sharing between languages practical, especially for the portion that chooses where to perform work, and in which order. - This portion of the runtime handles data affinity and the shape of the dependency graph, which are responsible for the resulting performance. However, it contains few, if any, language-specific portions, so is practical to share between languages, for a given target platform. This saves a non-trivial amount of development work. + This placement-choosing portion handles data affinity and the shape of the dependency graph, which are largely responsible for the resulting performance. However, it can be written to contain few language-specific portions, making it practical to share between languages, for a given target platform. This saves a non-trivial amount of development work. While runtimes built without our contribution are still free to share such code, they have no equivalent interface between the runtime pieces. This makes isolating this portion of code more time consuming, and forces more effort to fit code from one runtime into the one of a different language. @@ -79,9 +79,17 @@ Organization of paper \section{Background and Related Work} -For performance, the proto-runtime supports multiple levels of runtime hierarchy. In higher levels, work-units are large, leaving time for the decision about where to execute them to use advanced algorithms, which track data affinity and analyze dependency patterns. For lower levels, the work-units are smaller, leaving less time to search for the best location, so they have simpler algorithms. -UPR differs from pthreads, TBB, and other thread packages in that it provides more services to simplify runtime creation, and, more importantly, UPR has a mental model that is specific to runtime creation. pthreads and TBB are programming languages in their own right -- but UPR has no semantics usable in application code, because it's only a \emph{part} of a runtime. +The current method of creating a runtime for a new domain-specific language is either writing one from scratch, normally on top of pthreads, or else modifying an existing runtime. LibGomp, the runtime for OpenMP, is a popular starting point. + +From anecdotal evidence, talking with 5 groups that have created domain specific languages, development of bare functionality of the runtime takes anywhere from one to several months. Perhaps a majority of the time is spent on multi-threading design and debugging. That is just for basic functionality; if they want low overhead, they expect another three to six months of performance tuning. + +One group was interviewed, which used libGomp as the base upon which to build a streaming dataflow language. + +UPR is designed to shorten that development time. + + +UPR differs from pthreads, TBB, and other thread packages in that it provides a mental model that is specific to runtime creation. Pthreads and TBB are programming languages in their own right, but UPR is not. It has no semantics usable in application code, because it's only a \emph{part} of a runtime. It is geared exclusively to runtime creation. ?