# HG changeset patch # User Sean Halle # Date 1351510341 25200 # Node ID d005f901212656488e7d223a1e8ac27c6dbd6d5e # Parent cdd1852fe804b31ae0a95054d63782c7727209d1 FIRST CHANGE AFTER SUBMITTED VERSION -- modifying the paper for a new submission diff -r cdd1852fe804 -r d005f9012126 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex Mon Oct 08 23:05:18 2012 -0700 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex Mon Oct 29 04:32:21 2012 -0700 @@ -49,7 +49,7 @@ %MOIRAI: MOdel for Integrated Runtime Analysis through Instrumentation -\title{Integrated Performance Tuning Using Semantic Information Collected by Instrumenting the Language Runtime} +\title{Performance Tuning Scheduling Behavior Using Semantic Information Collected by Instrumenting the Language Runtime} %\authorinfo{Nina Engelhardt} % {TU Berlin} @@ -71,7 +71,10 @@ \begin{abstract} -Performance tuning is an important aspect of parallel programming. Yet when trying to pinpoint the causes of performance loss, often insufficient knowledge of the internal structure of the application and the runtime is available to understand how the observed patterns of performance have come to pass. +Performance tuning is an important aspect of parallel programming that involves understanding both communication behavior and scheduling behavior. Many good tools exist for identifying hot-spots in code, and idleness due to waiting on synchronization constructs. These help in tuning data layout, and to find constructs that spend a long time blocked, but leave the user guessing as to why they block for so long, with no other work to fill in. The answer often involves finding complex chain-reactions of scheduling decisions. We propose applying a novel model of parallel computation to guide the gathering and display of runtime scheduling decisions, which makes such chain-reactions easy to spot and easy to fix. + + which in turn requires th + which requires knowledge of the internal structure of the application and the runtime is available to understand how the observed patterns of performance have come to pass. A trend in parallel programming languages is towards models that capture more structural information about the application, in an effort to increase both performance and ease of programming. We propose using this structural information in performance tuning tools to make the causes of performance loss more readily apparent. Our work produces a universal, adaptable set of performance visualizations that integrates this extra application structure, via a new model of parallel computation. The visualizations clearly identify idle cores, and tie the idleness to causal interactions within the runtime and hardware, and from there to the parallelism constructs that constrained the runtime and hardware behavior, thereby eliminating guesswork. This approach can be used to instrument the runtime of any parallel language or programming model without modifying the application. As a case study, we applied it to the SSR message-passing model, and we walk through a tuning session on a large multi-core machine to illustrate the improvements in identifying performance loss and generating hypotheses for the cause. @@ -306,7 +309,7 @@ As seen, the model has two parts, a \emph{Unit \&\ Constraint Collection (UCC)}, and a \emph{Scheduling Consequence Graph} (SCG or just consequence graph). The UCC indicates the scheduling choices the application allows, and so shows what the programmer has control over. The consequence graph says which of those were actually taken during the run and the consequences of that set of choices. We give a more precise description of UCC, then consequence graph, in turn. -However, space is too limited for a complete definition, which is given in a companion paper submitted to a longer format venue. +However, this paper focuses on their application to performance tuning, so we abbreviate here and focus on a formal definition of the full model in a different paper. \subsection{Unit \& Constraint Collection} The UCC contains all the units of work that get scheduled during a run, and all constraints the application places on scheduling those units. This is the simple definition, but unfortunately, this information is not always easy to obtain. The complication is that different classes of application exist, with two degrees of freedom that determine how much of the UCC is actually defined in the application vs the input data, or even in the runtime.