### changeset 80:2bf63d88116a

perf tuning -- completed levels of SCG, and small tweak of abstract
author Sean Halle Sat, 11 Aug 2012 17:38:18 -0700 4433a26ff153 b5e314f7cdb9 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 1 files changed, 26 insertions(+), 25 deletions(-) [+]
line diff
     1.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Fri Aug 10 18:44:02 2012 +0200
1.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Sat Aug 11 17:38:18 2012 -0700
1.3 @@ -68,9 +68,9 @@
1.4
1.5
1.6  \begin{abstract}
1.7 -Performance tuning is an important aspect of parallel programming. Yet when trying to pinpoint the causes of performance loss, often insufficient knowledge of the internal structure of the application and the runtime is available to understand how the observed patterns of performance have come to pass.
1.8 -A trend in parallel programming languages is towards models that capture more structural information about the application, in an effort to increase both performance and ease of programming. This structural information can be used to improve performance tuning tools by making the causes of performance loss more readily apparent.
1.9 -We propose a universal, adaptable set of performance visualizations that integrates more application structure, via a new model of parallel computation. The visualizations clearly identify idle cores, and tie the idleness to causal interactions within the runtime and hardware, and from there to the parallelism constructs that constrained the runtime and hardware behavior, thereby eliminating guesswork.
1.10 +Performance tuning is an important aspect of parallel programming. Yet when trying to pinpoint the causes of performance loss, often times insufficient knowledge  of the internal structure of the application and the runtime is available to understand how the observed patterns of performance have come to pass.
1.11 +A trend in parallel programming languages is towards models that capture more structural information about the application, in an effort to increase both performance and ease of programming. We propose using this structural information in  performance tuning tools to make the causes of performance loss more readily apparent.
1.12 +Our work produces a universal, adaptable set of performance visualizations that integrates this extra application structure,via a new model of parallel computation. The visualizations clearly identify idle cores, and tie the idleness to causal interactions within the runtime and hardware, and from there to the parallelism constructs that constrained the runtime and hardware behavior, thereby eliminating guesswork.
1.13  This approach can be used to instrument the runtime of any parallel programming model without modifying the application. As a case study, we applied it to a message-passing model, and we walk through a tuning session on a large multi-core machine to illustrate how performance loss is identified and how hypotheses for the cause are generated.
1.14  \end{abstract}
1.15
1.16 @@ -114,7 +114,7 @@
1.17  When a less low-level parallel library is used, much of this problem disappears. For instance, in an application with MPI message passing \cite{MPI}, the information thread 2 spends little time waiting for messages from thread 0 but a lot of time waiting for messages from thread 1'' can be recorded, where in pthreads only thread 2 spends a lot of time waiting for a signal'' would be visible. It is much easier to reach the conclusion that the bottleneck is the slow rate at which thread 1 produces data from the first than from the second.
1.18  Leveraging the MPI library is a well-trodden path, with many tools available \cite{PMPI} \cite{MPICL} (the above-cited Tau and Paradyn also added this feature). A frequent complaint of users of these tools is that in complex programs, the large number of messages sent makes it difficult to see problems. As there is little that distinguishes messages beside size, it is difficult to implement good filtering. Once again, we find that the limitations come from a parallel model that captures insufficient information.
1.19
1.20 -Fortunately, parallel languages are evolving in the same direction, and it is generally thought that semantically rich parallel constructs, when they match the structure of the application well, are key to improved productivity. With languages such as \cite{CnCInHotPar} and StarSs \cite{StarSs}, the information available to the language runtime offers a much clearer picture of what constraints are placed on the execution of tasks and where they come from. Consequently, the application structure can be well reflected in visualizations, as the Temanejo debugger for StarSs \cite{PerfToolStarSs} demonstrates.
1.21 +Fortunately, parallel languages are evolving in the same direction, and it is generally thought that semantically rich parallel constructs, when they match the structure of the application well, are key to improved productivity. With languages such as \cite{CnCInHotPar} and StarSs \cite{StarSs}, the information available to the language runtime offers a much clearer picture of what constraints are placed on the execution of tasks and where they come from. Consequently, the application structure can be well reflected in visualizations, as the Temanejo debugger  \cite{PerfToolStarSs} for StarSs demonstrates.
1.22
1.23  We believe that the quantities best to measure when performance tuning are scheduling decisions and the consequent usage of communication hardware and cores. Hence, the mental model should have, as concepts,  units of scheduled work, and scheduling decisions made on them, then relate those to consequent hardware behavior. The model should also relate all of those to application code, such as code boundaries that identify units, and constraints (dependencies) the application places on scheduling decisions.
1.24
1.25 @@ -203,7 +203,7 @@
1.26    The second kind of activity is runtime overhead, represented by a gray region. This is the overhead spent on that particular work-unit. When desired by the user, it is further broken into pieces representing activities inside the runtime. The options include time spent on: constraints, when determining readiness of the work-unit; deciding which ready unit to assign to which hardware; and time spent switching from virtual processor, to the runtime, and back. In this paper, we show all runtime overhead lumped together, however in other circumstances a breakdown can be key to understanding interaction between runtime and application.
1.27
1.28
1.29 -The other type of visual feature seen in Fig \ref{fig:SCG_expl} is lines. Each represents a construct that influenced scheduling, where the color indicates which construct.   A line represents two things: a constraint, whose satisfaction made the lower unit ready, and a decision by the runtime to start the lower unit on that core.
1.30 +The other type of visual feature seen in Fig \ref{fig:SCG_expl} is lines. Each represents a construct that influenced scheduling, where the color indicates what kind of construct.   A line represents two things: a constraint, whose satisfaction made the lower unit ready, and a decision by the runtime to start the lower unit on that core.
1.31
1.32  In general, lines may also be drawn that represent other kinds of interactions, which affect core usage. For example,  our runtime implementation only allows one core at a time to access shared  scheduling state. Visualization of this can be turned on, as additional lines linking the gray runtime regions of blocks (visualization of such interactions is turned off in this paper).
1.33
1.34 @@ -436,9 +436,9 @@
1.35
1.36  To complete the discussion, we consider going down another level, into the physical processor.  In modern super-scalar multi-cores, each core is a parallel processor composed of functional blocks such as arithmetic or floating-point blocks, along with logic that issues instructions to those blocks. We first break down a single  SSR work unit of the kind seen in section \ref{sec:casestudy}, into smaller work units, namely individual instructions. Then we break instructions down into pipeline stages.
1.37
1.38 -The UCC for the instructions inside a single SSR level work-unit is a straight line of sequential dependencies. However, the pipeline stages inside a core modify the UCC. They analyse the register use dependencies between instructions and generate a new UCC consisting of  data dependencies among functions. Issue logic acts as the scheduler that generates the order of executing these functions on the functional blocks.  The choices made by the issue logic determine the shape of the SCG for this level, and the critical path of that SCG is the work time of the higher SSR level work-unit.
1.39 +The shape of the UCC for the instructions  inside a single SSR level work-unit is a straight line of sequential dependencies. However, the pipeline stages inside a core modify the UCC. They analyse the register use dependencies between instructions and generate a new UCC consisting of  functions constrained by the data dependencies. The dependencies are enforced by the issue logic, which  goes on to assign ready functions to the functional blocks.  The choices made by the issue logic  determine the shape of the SCG for this level, and the critical path of that SCG is the work time of the higher, SSR level, work-unit.
1.40
1.41 -Notice that the overhead for an instruction is the pipestages spent in fetch, decode, analysis,  issue, and retirement. Most of this is overlapped by the pipeline effect. but  contributes to the critical path during pipeline disruptions like mis-predicted branches. The analysis portion corresponds to divide overhead seen in the higher levels, and issue is runtime overhead at the instruction level.
1.42 +The pipestages spent performing fetch, decode, analysis,  issue, and retirement correspond to functions in a runtime.  Fetch corresponds to creation of the instruction-level work-unit. Decode and analysis logic generate the new UCC, as part of the runtime,  Issue logic  performs  the scheduling function of a runtime. Retirement is also part of enforcing the dependencies, and so is  also part of the runtime's scheduler.  Most of the overhead of these pipestages is overlapped by the pipeline effect. but  does still contribute to the critical path during pipeline disruptions like mis-predicted branches. It is counted as non-overlapped runtime overhead.
1.43
1.44  %To give insight into going from the SSR work-unit level up to the application-level, let us now go from the instruction level up to the SSR unit level, looking up' so to speak. Considering overhead, from inside the core,  the SSR level overhead looks just like more instructions.
1.45
1.46 @@ -454,14 +454,14 @@
1.47
1.48  %The last question is how to handle communication consequences, which result from decisions made in all levels? The decisions in higher-level runtimes set the context for decisions in lower-level ones, which links a higher-level choice  to the consequences resulting from the lower-level choices. But the value of a consequence graph comes from linking the size of boxes in it to the decisions made by the scheduler. It's not clear how to divide-up  the time  cores spend waiting for non-overlapped communication, to assign portions to different levels. We have no good answer at the moment and leave it for future work.
1.49
1.50 -In practice, the question of levels does not come up very often, as there is generally only one level of interest being investigated. However, the perspective is occasionally important, e.g. when identifying the different components of the model in a concrete runtime so as to insert instrumentation code in the right places.
1.51 +In practice, the question of levels does not come up very often for multi-core machines, as there is generally only one level of interest being investigated. However, the perspective is occasionally important here, e.g. when identifying the different components of the model in a concrete runtime so as to insert instrumentation code in the right places.
1.52
1.53
1.54  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1.55  \section{Implementation}
1.56  \label{sec:Implementation}
1.57
1.58 -Now that the usage and theory are in hand, we give the details of implementation. We attempt a bottom-up view, to provide an alternate path to understanding the model and visualizations,  as well as providing a path to derive the benefits directly from the details.
1.59 +Now that the usage and theory are in hand, we give the details of implementation. We attempt a bottom-up view, to provide an alternate path to understanding the model and visualizations,  as well to provide a path to derive the benefits directly from the details.
1.60
1.61  This section will show the level of effort needed to implement our approach for a new language. In short, this involves inserting collection points into the runtime, then modifying the post-processing that produces the visualization.
1.62
1.63 @@ -469,40 +469,40 @@
1.64  points inside the runtime that correspond to points in the model. It is only in these spots that instrumentation gets inserted into runtime code.
1.65
1.66  One benefit seen directly from these details is that many things remain the same across languages, because the computation model is invariant across languages.  This includes the basic nature of the visualizations and details of the implementation process.
1.67 -The reason is that visualizations are based on units, constraints, and causality, all of which exist in every language (even though some older ones obfuscate units).
1.68 +The reason is that visualizations are based on units, constraints, and causality, all of which exist in every language (even though some older ones obfuscate units and constraints on them).
1.69
1.70 -Another benefit evident from the details in this section is that the instrumentation is done only once, for a language. All applications written in the language inherit the visualizations, without  change to the application code.
1.71 +Another benefit evident from the details in this section is that the instrumentation is implemented only once, for a language. All applications written in the language inherit the visualizations, without  change to the application code.
1.72
1.73  \subsection{Meta-units and unit life-line in the computation model}
1.74  \label{subsec:LifeLine}
1.75  In preparation for mapping the model onto implementation details, we define a meta-unit and unit life-line. These form the basis for deciding points in the runtime  where data is collected.
1.76
1.77 -Every unit has a meta-unit that represents it in the runtime. A  unit is defined as the trace of application code that exists between two scheduling decisions. Looking at this in more detail, every runtime has some form of internal bookkeeping state for a unit, used to track constraints on it and make decisions about when and where to execute. This exists even if that state is just a pointer to a function that sits in a queue. We call this bookkeeping state for a unit the meta-unit.
1.78 +Every unit has a meta-unit that represents it in the runtime. A  unit is defined as the work that exists after leaving the runtime, up until re-entering it. For example, the trace of instructions on a core, from the point of leaving the runtime up until the next invocation. Looking at this in more detail, every runtime has some form of internal bookkeeping state for a unit, used to track constraints on it and make decisions about when and where to execute. This exists even if that state is just a pointer to a function that sits in a queue. We call this bookkeeping state for a unit the meta-unit.
1.79
1.80  Each  unit also has a life-line, which progresses so:  creation of the meta-unit \pointer , state updates that affect constraints on the unit \pointer,   the decision is made to animate the unit  \pointer, movement of the meta-unit plus data to physical resources that do the animation \pointer  , animation of the unit, which does the work \pointer,  communication of state-update, that unit has completed, and hardware is free \pointer ,  constraint updates within runtime, possibly causing new meta-unit creations or freeing other meta-units to be chosen for animation.  This repeats for each unit. Each step is part of the model.
1.81
1.82 - Note a few implications: first, many activities internal to the runtime are part of a unit's life-line, and take place when only +the meta-unit exists, before or after the work of the actual unit; second, communication that is internal to the runtime is part of the unit life-line, such as state updates; third, creation may be implied, such as in pthreads, or triggered such as in dataflow, or be by explicit command such as in StarSs, and once created, a meta-unit may languish before the unit it represents is free to be animated.
1.83 + Note a few implications: first, many activities internal to the runtime are part of a unit's life-line, and take place when only the meta-unit exists, before or after the work of the actual unit; second, communication that is internal to the runtime is part of the unit life-line, such as state updates; third, creation may be implied, such as in pthreads, or triggered such as in dataflow, or be by explicit command such as in StarSs. Once created, a meta-unit may languish before the unit it represents is free to be animated.
1.84
1.85 -Also, note that this explains why the visualizations remain largely the same across languages. The concepts of a meta-unit, a unit, constraints on a unit, and a unit life-line are all valid in every language.  The visualizations are based on these concepts, and so likewise largely remain the same.  In the UCC, only the constraint patterns that represent  the language's constructs change between languages. In the SCG, only whh construct a line in the SCG represents changes.
1.86 +This explains why the visualizations remain largely the same across languages. The concepts of a meta-unit, a unit, constraints on a unit, and a unit life-line are all valid in every language.  The visualizations are based on these concepts, and so likewise largely remain the same.  In the UCC, only the constraint patterns that represent  the language's constructs change between languages. In the SCG, only which construct a line in the SCG represents changes.
1.87
1.88  \subsection{Mapping model onto implementation details in runtime}
1.89
1.90 -The meta-unit and unit life-line aspects of the computation model map straight-forwardly to the UCC visualization. The constraints in the UCC are those stated in or implied by the application (with the complexities noted in Section \ref{sec:theory}).
1.91 +The model's concepts of meta-unit and unit life-line  map directly to the UCC visualization. The constraints in the UCC visualization are those stated in or implied by the application (with the complexities about UCC modifications and levels noted in Section \ref{sec:theory}).
1.92
1.93 -However, the SCG is not a strict expression of the model, rather it's purpose is practical. It shows usage of the cores, and relates that to the quantities in the model. Hence, the measurements for the SCG all are boundaries, where the core's time switches from one category in the model to a different.
1.94 +However, the SCG is not a strict expression of the model, rather it's purpose is practical. It shows usage of cores, and relates that to the quantities in the model. Hence, the measurements for the SCG all are boundaries, where the core's time switches from one category in the model to a different.
1.95
1.96 -This differs from the model in subtle ways. Most notably, the model declares segments of time where communications take place, while the SCG doesn't measure the communication time directly, rather it captures idleness of the core caused by the non-overlapped portion of that communication.
1.97 +This differs from the model in subtle ways. Most notably, the model declares segments of time where communications take place, while the SCG doesn't measure the communication time directly, rather it captures idleness of the core caused by the non-overlapped portion of that communication. Also, when calculating the critical path, the SCG only counts non-overlapped portions of runtime activity.
1.98
1.99  This difference stems from the SCG's focus on core usage, and assigning each idle period to a cause. The runtime's choice of units to cores is what determined the source and destination of communications, which caused the idling. Hence, idle periods due to non-overlapped communication  are consequences of the  assignment choices made by the scheduler.  This supports the name: scheduling consequence graph.
1.100
1.101
1.102 -What must be collected during the run differs between the two types of visualization. For the UCC it is unit boundaries and the constraints related to each unit. For the SCG, the same units must  be collected, but also the time a core spends on each segment of the unit's life-line.  Also, implementation details of the runtime will cause things such as idling the core during lock acquisition to be counted towards a unit's life segment. What core activities go to which life segments changes from runtime to runtime. For example, our implementation includes idle time due to acquiring the lock on shared runtime state as part of the state-update life-line step.
1.103 +What must be collected during the run differs between the two types of visualization. For the UCC it is unit boundaries and the constraints related to each unit. For the SCG, the same units must  be collected, but also the time a core spends on each segment of the unit's life-line.  Also, implementation details of the runtime will cause things such as idling the core during lock acquisition to be counted towards a unit's life segment. What core activities go to which life segments changes from runtime to runtime. For example, our implementation includes idle time due to acquiring the lock on shared runtime state as part of the  life-line step of state update.
1.104
1.105 -The SCG represents each  cause of a transition from one segment of core usage to another as an arc between boxes. Such a causation is always a causal dependency of some kind, because the SCG only represents physical events, even if it corresponds to a complex construct in the application. These causations are collected and  tied to one of: construct constraint, runtime internal constraint (such as must acquire lock), or hardware constraint (such as only one activity at a time on a core). In this paper, all are collected, but the only causations visualized are constructs  that cross cores, with propendent on one core and its dependent on another.
1.106 +The SCG represents each  cause of a transition from one segment of core usage to another as an arc between regions or boxes. Such a causation is always a causal dependency of some kind, because the SCG only represents physical events, even if it corresponds to a complex construct in the application. These causations are collected and  tied to one of: construct constraint, runtime internal constraint (such as must acquire lock), or hardware constraint (such as only one activity at a time on a core). In this paper, all are collected, but the only causations visualized are constructs  that cross cores, with propendent on one core and its dependent on another.
1.107
1.108  \subsection{Instrumenting our implementation of SSR on top of VMS}
1.109
1.110 -We instrumented a version of SSR implemented on top of a proto-runtime system called VMS.  This proto-runtime embodies most of a runtime implementation, but has replaced two key portions with interfaces. Those portions are the handling of language construct-constraints and the decision of which core to assign a unit to. To implement a language, one simply supplies those two portions of code, via the interface.
1.111 +We instrumented a version of SSR implemented on top of a proto-runtime system called VMS \cite{VMS_LCPC_11}.  This proto-runtime embodies most of a runtime implementation, but has replaced two key portions with interfaces. Those portions are the handling of language construct-constraints and the decision of which core to assign a unit to. To implement a language, one simply supplies those two portions of code, via the interface.
1.112
1.113  VMS also has the advantage for our approach of being written in accordance with the computation model, which makes instrumenting it especially convenient. Each language construct has its own handler into which to insert measurement code, and transitions in unit life-lines also have convenient locations in VMS to insert instrumentation code.
1.114
1.115 @@ -515,21 +515,21 @@
1.116  SSR has both deterministic constraints, which specify the source and destination VP, such as send\_from\_to, and non-de\-ter\-mi\-nis\-tic ones, in which the runtime is what chooses which VPs interact, such as send\_of\_type\_to\ and singleton. Deterministic ones display the same in the UCC and the SCG. However, non-deterministic ones need all possibilities to be determined for the UCC, requiring extra instrumentation code.
1.117
1.118  \subsubsection{Collecting a unit}
1.119 -Code to record a new unit is inserted into VMS at the transition out of the runtime and into application code. Code to record the unit end is inserted into the VMS primitive that switches back to the runtime.
1.120 +Code to record a new unit is inserted into VMS at the transition out of the runtime and into application code. Code to record the unit end is inserted into the VMS primitive that performs switching back to the runtime.
1.121
1.122  \subsubsection{Collecting the constraints} In VMS, each language construct has its own handler. Code is inserted into each handler, to record which unit invoked the construct, and any units freed by it. The SCG links the unit that made a construct call to the units freed by that call.
1.123
1.124  What information needs to be collected for SCG and UCC and how it is done depends on the construct:
1.125  \begin{itemize}
1.126  \item create\_VP:  We place code into the create\_VP\ handler, which records  the calling VP\ + unit, along with the newly created unit, and the VP it is assigned to. Both the SCG and UCC draw arcs between creating unit and created.
1.127 -\item send\_from\_to and receive\_from\_to: \textbf{} Code is placed into both handlers at the point that checks if both the rendez-vous requests are present. When true, it records both the unit+VPs that connected. The UCC  and SCG both represent this by two crossing dependencies.
1.128 -\item Send\_to\_of\_type and receive\_to\_of\_type:   The same code is inserted to record both the unit+VPs that connected. This is enough for the SCG.  But for the UCC, we want to capture all sending and receiving permutations available, so we add code that collects the group of senders and the corresponding group of receivers.
1.129 -\item Singleton:   The singleton unit has a group of predecessor units and a group of successor units. The first predecessor to complete enables the singleton unit, while all successors must wait for its completion. We insert code into the handler, which records the predecessor that enabled the singleton. This is all that the SCG needs. For the UCC, we add code inside the singleton call that collects the calling unit, adding it to the predecessor group, and the unit it jumps to, adding that to the successor group.
1.130 +\item send\_from\_to and receive\_from\_to: \textbf{} Code is placed into both handlers at the point that checks if both the rendez-vous requests are present. When true, it records the two units + VPs that connected. The UCC  and SCG both represent this by two crossing dependencies.
1.131 +\item Send\_to\_of\_type and receive\_to\_of\_type:   The same code is inserted to record  the  unit+VPs that connected. This is enough for the SCG.  But for the UCC, we want to capture all sending and receiving permutations available, so we add code that collects the group of senders and the corresponding group of receivers.
1.132 +\item Singleton:   The singleton unit has a group of predecessor units and a group of successor units. The first predecessor to complete enables the singleton unit, while all successors must wait for its completion. We insert code into the handler, which records the predecessor that enabled the singleton unit. This is all that the SCG needs. For the UCC, we add code inside the singleton call that collects the calling unit, adding it to the predecessor group, and the unit it jumps to, adding that to the successor group.
1.133
1.134  \end{itemize}
1.135
1.136  \subsubsection{Recording time, instructions, and cache misses }
1.137 - Just recording the units and connections between them is not enough. Because the SCG represents core usage, it also needs  the cycles spent on each activity, including internal runtime activities. The size of each interval of core usage is recorded and assigned to a  segment of a particular unit's life-line.
1.138 + Just recording the units and connections between them is not enough for the SCG. Because it represents core usage, the SCG also needs  the cycles spent on each activity to be collected, including internal runtime activities. The size of each interval of core usage is recorded and assigned to a  segment of some  unit's life-line.
1.139
1.140  The UCC also makes use  of the number of instructions in a unit, as an estimate of size of work in the unit, as illustrated by Fig \ref{fig:UCC_expl}. Without knowing the relative size of the units, it is hard to estimate the amount of parallelism \emph{usefully} available in the application.
1.141
1.142 @@ -1263,3 +1263,4 @@
1.143
1.144
1.145
1.146 + trying to pinpoint the causes of performance loss,  internal structure of
1.147 \ No newline at end of file
`