### changeset 78:328f337153e3

perf tuning -- played with abstract and cleaned up SCG levels
author Sean Halle Fri, 10 Aug 2012 05:32:19 -0700 1dd96de6e570 4433a26ff153 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 1 files changed, 24 insertions(+), 14 deletions(-) [+]
line diff
     1.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Fri Aug 10 02:21:44 2012 -0700
1.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Fri Aug 10 05:32:19 2012 -0700
1.3 @@ -68,10 +68,10 @@
1.4
1.5
1.6  \begin{abstract}
1.7 -Performance tuning is an important aspect of parallel programming. Yet when trying to track down the causes of performance loss, a great deal of knowledge about the internal structure of both the application and the runtime is often needed to understand how the observed patterns of performance have come to pass.
1.8 +Performance tuning is an important aspect of parallel programming.  Yet when trying to pinpoint the causes of performance loss,  internal structure of the application and the runtime is often needed in order to understand how the observed patterns of performance have come to pass.
1.9  The trend in parallel programming languages has been towards models that capture more structural information about the application, in an effort to increase both performance and ease of programming. This structural information can be used to improve performance tuning tools by making the causes of performance loss more readily apparent.
1.10 -We propose a universal but adaptable way of integrating more application structure into performance visualizations, relying on a model of parallel computation. The visualizations produced clearly identify idle cores, and tie the idleness to causal interactions within the runtime and hardware, and from there to the parallelism constructs that constrained the runtime and hardware behavior, thereby eliminating guesswork.
1.11 -This is implemented for multi-core hardware, and we walk through a tuning session on a large multi-core machine to illustrate how performance loss is identified and how hypotheses for the cause are generated. We also give a concise description of the implementation and the computation model.
1.12 +We propose using a new computation model  to collect application structure and produce performance visualizations. The visualizations clearly identify idle cores, and tie the idleness to causal interactions within the runtime and hardware, and from there to the parallelism constructs that constrained the runtime and hardware behavior, thereby eliminating guesswork.
1.13 +The approach is used to instrument the runtime of any language without  application modifications. This is implemented for multi-core hardware, and we walk through a tuning session on a large multi-core machine to illustrate how performance loss is identified and how hypotheses for the cause are generated. We also give a concise description of the implementation and the computation model.
1.14  \end{abstract}
1.15
1.16
1.17 @@ -79,7 +79,7 @@
1.18
1.19
1.20  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1.21 -\section{Introduction and Motivation}
1.22 +\section{Introduction}
1.23  \label{sec:intro}
1.24
1.25  Performance visualizations and tuning tools for parallel programs are critical to achieving good performance, and yet current solutions leave something to be desired. All too often, performance tuning consists of staring at abstract views of usage or statistics, trying to \emph{guess} the cause of performance loss. Despite many different views of performance measurements, such as frequency by line of code, message sends and receives, and core usage timelines, the user doesn't know why a function runs in a particular spot on a particular core, nor whether that is desired behavior or erroneous behavior.
1.26 @@ -416,19 +416,29 @@
1.27
1.28  \subsubsection{Levels of Consequence Graph}
1.29
1.30 -With multiple levels of scheduling activities, a scheduling consequence graph can be constructed for each level. How, then, do these tie together?
1.31 +With multiple levels of scheduling activities, a scheduling consequence graph can be constructed for each level.
1.32
1.33 -For a given level, a lower-level runtime is seen as a single processing unit. So, what is the consequence of deciding to schedule a given work unit onto one of those  processing units? When the goal is  improving performance, the consequence of interest is the time taken until completion of the work unit.
1.34 +For a given level, a lower-level runtime is seen as a single processor entity. So, what is the consequence of deciding to schedule a given work unit onto one of those  processor entities? When the goal is  improving performance, the consequence of interest is the time taken until completion of the work unit.
1.35
1.36 -Hence, the question becomes,  when can the upper-level work unit be considered completed? The answer is, when all the lower-level units of which it is comprised are completed, plus any additional work spent dividing up the work and combining results. Once the lower-level consequence graph is established, this  time can be easily determined: it is the critical path across the lower-level consequence graph plus the additional, non-overlapped, time spent creating the lower-level units and combining results into the format of the higher-level unit.
1.37 +Hence, the question becomes,  when can the upper-level work unit be considered completed? The answer is, when all the lower-level units of which it is comprised are completed, plus any additional work spent dividing up the work and combining results. Once the lower-level consequence graph is established, this  time can be easily determined: it is the critical path across the lower-level consequence graph plus the additional, non-overlapped, time spent creating the lower-level units and combining results into the format of the higher-level unit. The divide and recombine are the creation of    the lower level, and so not part of it.
1.38
1.39 -Consider the concrete example of the SSR matrix multiply application, from the story in section \ref{sec:casestudy}. Going up,  the level above the execution under study involves the invocations of entire applications, via OS commands. At that level, a unit is an entire process, and the  work-time of that unit is the execution time of the application. The SSR matrix multiply execution time includes the critical path through the matrix multiply work, plus creation of the various VPs, and collection of the results by the results VP. If the story in section \ref{sec:casestudy} were done in hierarchical fashion, the SCG seen there would be divided between levels, with some pieces moved to the application-level SCG.  In particular, the creation time and results accumulation time would be moved to the application-level SCG and represent overhead added to the lower-level SCG. The lower-level would only contain the work of the multiply.
1.40
1.41 -One thing that is specific to the consequence graph and does not appear in the UCC is overhead. How is overhead dealt with across levels? We must determine what activities in higher levels count as overhead, and what portion of that is overlapped with work in the lower level SCG, and hence does not contribute to the critical path.
1.42 +Consider the concrete example of the SSR matrix multiply application, from the story in section \ref{sec:casestudy}. To start we go up a level, above the process created to execute the matrix multiply. Here, a unit is the execution of an entire application, invoked via OS commands. The  work-time of that unit is the execution time of the application, from the first instruction of the application code until the last. This time equals  the critical path through the matrix multiply work, plus creation of the various VPs, and collection of the results by the results VP.
1.43
1.44 -To complete the discussion, we consider going down yet another level, into the physical processor.  In modern super-scalar multi-cores, each core, is a parallel processor composed of functional blocks such as arithmetic or floating-point blocks. At this level, we break down a single  SSR work unit, as seen in section \ref{sec:casestudy}, into into smaller work units: individual instructions.
1.45 +If the story in section \ref{sec:casestudy} were done in hierarchical fashion, the SCG seen there would be divided between levels, with creation and results accumulation  moved to the higher application-level SCG. The lower-level would only contain the work of the multiply.
1.46
1.47 -Now, what does the SCG look like for the instructions inside an SSR level work-unit? For the SSR unit, the work-time  is the critical path  from the jump out of the runtime into the first instruction of the unit, up until the jump back from the unit's code, into the runtime. This critical path is set by the issue logic in the core, which is in fact a relatively complicated scheduler that generates the SCG. It analyses the register use dependencies between instructions (=constraints between work units) and dispatches (=assigns) them to different functional blocks. The overhead for an instruction is the pipestages spent in fetch, decode, rename, and the issue logic. Most of this is overlapped by the pipeline effect. but  contributes to the critical path during pipeline disruptions like mis-predicted branches.
1.48 +One twist is that the divide and recombine are themselves composed of  smaller work-units.  However, they are not visible to units in the work portion, although there are causalities between them. The divide units end with creation of work-level units, while the end of work-level units causes creation of recombine units.
1.49 +Hence, divide and recombine are  inside of  the higher-level unit while not part of the lower-level SCG. They can be conveniently displayed in a single visualization because of the clean connection between levels provided by the create and end causalities.
1.50 +
1.51 +
1.52 +
1.53 +
1.54 +
1.55 +To complete the discussion, we consider going down another level, into the physical processor.  In modern super-scalar multi-cores, each core is a parallel processor composed of functional blocks such as arithmetic or floating-point blocks, along with logic that issues instructions to those blocks. We first break down a single  SSR work unit of the kind seen in section \ref{sec:casestudy}, into smaller work units, namely individual instructions. Then we break instructions down into pipeline stages.
1.56 +
1.57 +The UCC for the instructions inside a single SSR level work-unit is a straight line of sequential dependencies. However, the pipeline stages inside a core modify the UCC. They analyse the register use dependencies between instructions and generate a new UCC consisting of  data dependencies among functions. Issue logic acts as the scheduler that generates the order of executing these functions on the functional blocks.  The choices made by the issue logic determine the shape of the SCG for this level, and the critical path of that SCG is the work time of the higher SSR level work-unit.
1.58 +
1.59 +Notice that the overhead for an instruction is the pipestages spent in fetch, decode, analysis,  issue, and retirement. Most of this is overlapped by the pipeline effect. but  contributes to the critical path during pipeline disruptions like mis-predicted branches. The analysis portion corresponds to divide overhead seen in the higher levels, and issue is runtime overhead at the instruction level.
1.60
1.61  %To give insight into going from the SSR work-unit level up to the application-level, let us now go from the instruction level up to the SSR unit level, looking up' so to speak. Considering overhead, from inside the core,  the SSR level overhead looks just like more instructions.
1.62
1.63 @@ -461,7 +471,7 @@
1.64  One benefit seen directly from these details is that many things remain the same across languages, because the computation model is invariant across languages.  This includes the basic nature of the visualizations and details of the implementation process.
1.65  The reason is that visualizations are based on units, constraints, and causality, all of which exist in every language (even though some older ones obfuscate units).
1.66
1.67 -Another benefit evident from the details in this section is that the instrumentation is done only once, for a language. All applications written in the language inherit the visualizations, without any change to the application code.
1.68 +Another benefit evident from the details in this section is that the instrumentation is done only once, for a language. All applications written in the language inherit the visualizations, without  change to the application code.
1.69
1.70  \subsection{Meta-units and unit life-line in the computation model}
1.71  \label{subsec:LifeLine}
1.72 @@ -471,9 +481,9 @@
1.73
1.74  Each  unit also has a life-line, which progresses so:  creation of the meta-unit \pointer , state updates that affect constraints on the unit \pointer,   the decision is made to animate the unit  \pointer, movement of the meta-unit plus data to physical resources that do the animation \pointer  , animation of the unit, which does the work \pointer,  communication of state-update, that unit has completed, and hardware is free \pointer ,  constraint updates within runtime, possibly causing new meta-unit creations or freeing other meta-units to be chosen for animation.  This repeats for each unit. Each step is part of the model.
1.75
1.76 - Note a few implications: first, many activities internal to the runtime are part of a unit's life-line, and take place when only the meta-unit exists, before or after the work of the actual unit; second, communication that is internal to the runtime is part of the unit life-line, such as state updates; third, creation may be implied, such as in pthreads, or triggered such as in dataflow, or be by explicit command such as in StarSs, and once created, a meta-unit may languish before the unit it represents is free to be animated.
1.77 + Note a few implications: first, many activities internal to the runtime are part of a unit's life-line, and take place when only +the meta-unit exists, before or after the work of the actual unit; second, communication that is internal to the runtime is part of the unit life-line, such as state updates; third, creation may be implied, such as in pthreads, or triggered such as in dataflow, or be by explicit command such as in StarSs, and once created, a meta-unit may languish before the unit it represents is free to be animated.
1.78
1.79 -Also, note that this explains why the visualizations remain largely the same across languages. The concepts of a meta-unit, a unit, constraints on a unit, and a unit life-line are all valid in every language.  The visualizations are based on these concepts, and so likewise largely remain the same.  In the UCC, only the constraint patterns that represent  the language's constructs change between languages. In the SCG, only which construct a line in the SCG represents changes.
1.80 +Also, note that this explains why the visualizations remain largely the same across languages. The concepts of a meta-unit, a unit, constraints on a unit, and a unit life-line are all valid in every language.  The visualizations are based on these concepts, and so likewise largely remain the same.  In the UCC, only the constraint patterns that represent  the language's constructs change between languages. In the SCG, only whh construct a line in the SCG represents changes.
1.81
1.82  \subsection{Mapping model onto implementation details in runtime}
1.83
`