### changeset 105:f04cf8642d6f

Submitted version of Perf Tuning paper for ISPASS
author Sean Halle Thu, 19 Sep 2013 15:32:31 -0700 a42086520679 8cb89eed66b4 0__Papers/Holistic_Model/Perf_Tune__long_version_for_TACO/latex/Holistic_Perf_Tuning.pdf 0__Papers/Holistic_Model/Perf_Tune__long_version_for_TACO/latex/Holistic_Perf_Tuning.tex 2 files changed, 512 insertions(+), 161 deletions(-) [+]
line diff
     1.1 Binary file 0__Papers/Holistic_Model/Perf_Tune__long_version_for_TACO/latex/Holistic_Perf_Tuning.pdf has changed

     2.1 --- a/0__Papers/Holistic_Model/Perf_Tune__long_version_for_TACO/latex/Holistic_Perf_Tuning.tex	Tue Sep 17 06:40:43 2013 -0700
2.2 +++ b/0__Papers/Holistic_Model/Perf_Tune__long_version_for_TACO/latex/Holistic_Perf_Tuning.tex	Thu Sep 19 15:32:31 2013 -0700
2.3 @@ -57,7 +57,7 @@
2.4  \begin{document}
2.5
2.6  %MOIRAI: MOdel for Integrated Runtime Analysis through Instrumentation
2.7 -\title{Augmenting Performance Tuning\\ via Visualization of Cause-and-Effect Chains}
2.8 +\title{Augmenting Performance Tuning\\ with Visualization of Cause-and-Effect Chains}
2.9
2.10  %\authorinfo{Nina Engelhardt}
2.11  %           {TU Berlin}
2.12 @@ -92,7 +92,7 @@
2.13
2.14  %The level of complexity is reflected by the fact that finding the optimal set of scheduling choices is known to be an  NP hard problem [].
2.15
2.16 -As an example of such a chain, a unit of work completes, which sends a signal to the runtime to update the runtime's internal state of the unit and the state of the hardware resource it occupied. This in turn causes the runtime to choose a different unit to own that hardware and sends the meta-information for that unit to the hardware. This in turn triggers communication, to move  the data consumed by the unit  to the hardware. Then the work of the new unit takes place there.
2.17 +As an example of such a chain, a unit of work completes, which sends a signal to the runtime to update the runtime's internal state of the unit and the state of the hardware resource it occupied. This in turn causes the runtime to choose a different unit to own that hardware and send the meta-information for that unit to the hardware. This in turn triggers communication, to move  the data consumed by the unit  to the hardware. Then the work of the new unit takes place there.
2.18
2.19  Any one of these interactions could be individually abnormal, and an unexpected source of performance loss. In particular, there may be congestion in the network that causes the data movement to be especially slow for that new unit, or the runtime may experience contention for access to its internal state (IE, contention for an internal lock).
2.20
2.21 @@ -121,7 +121,7 @@
2.22
2.23  %The contribution of this paper to be the way the parallel model has been applied, rather than any aspects of the model itself.  The distinction is important because the space is used to convey the value gained by the  process of applying the model. Only enough of the model is stated, in \S \ref{sec:theory} and \S \ref{sec:Implementation}, to understand  where the value comes from and how to  instrument a runtime to gain it.
2.24
2.25 -This paper is organized as follows:  we introduce the basics of the theory behind the tool in \S\ref{sec:basics}, then give the features of our visualization and their meaning in \S\ref{sec:visualization}. We illustrate their use by walking through a tuning session in \S\ref{sec:casestudy}. To prepare for explaining how to implement the approach, we expand on the theory behind it in \S\ref{sec:theory}. \S\ref{sec:Implementation} then gives the details of how to instrument a runtime to collect the data dictated by the theory. We briefly relate the approach to other tuning tools in \S\ref{sec:related} and draw conclusions in \S\ref{sec:conclusion}.
2.26 +This paper is organized as follows:  we introduce the basics of the theory behind the tool in \S\ref{sec:basics}, then give the features of our visualization and their meaning in \S\ref{subsec:visualization_def}. We illustrate their use by walking through a tuning session in \S\ref{sec:casestudy}. To prepare for explaining how to implement the approach, we expand on the theory behind it in \S\ref{sec:theory}. \S\ref{sec:Implementation} then gives the details of how to instrument a runtime to collect the data dictated by the theory. We briefly relate the approach to other tuning tools in \S\ref{sec:related} and draw conclusions in \S\ref{sec:conclusion}.
2.27
2.28
2.29  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2.30 @@ -482,46 +482,18 @@
2.31
2.32
2.33
2.34 -In this paper, we do not suggest how to represent abstract UCCs.  One actually indicates a multi-verse of concrete-UCCs. Which of them materializes  depends on the data and what the scheduler decides. We only represent the concrete UCC that materializes during a run and leave the question of representing abstract ones to future work.
2.35 -
2.36 -\subsection{Applying UCC and SCG to hierarchical hardware}
2.37 -%Whereas the UCC has value in understanding the structure of the application,  the second part of the model shows how that structure  was made use of during a run. We called it the Scheduling Consequence Graph because it links scheduling decisions to their performance consequences. It does so by indicating distinct activities in the fundamental logical structures, and linking them causally to each other. As a result it provides a link from each instance of lost performance, back through each element that contributed to causing the loss, as illustrated in Section \ref{subsec:holes}.
2.38 -
2.39 -
2.40 -%A consequence graph accounts for each bit of core time, arranged into boxes and arcs, with boxes further divided into regions. A box  represents all core time assigned to one work unit. Each region inside represents a segment of time that the core was engaged in a specific type of activity.  An arc links  regions (or boxes)  and represents a causality of some kind.
2.41 -
2.42 -%There is one kind of region for each kind of activity that can source or sink a causal link, and a region may source or sink multiple causalities. Each type of causal link is represented by a corresponding type of arc.
2.43 -
2.44 -%The core activities/\ region-types are: doing application work, waiting idly for communication of work data, managing constraints, deciding assignment of work onto cores, and doing internal runtime activities. The runtime region has sub-activities, but we do not detail them here due to space. However, some will be stated in Section \ref{sec:Implementation} when we talk about instrumenting a runtime.
2.45 -
2.46 -%Arcs are gathered into groups according to the nature of the causality they represent.  The kinds of causal links are: satisfaction of a sequential dependency in the base language; satisfaction of a parallel constraint  (i.e., one unit did something to satisfy a constraint on the other, causing it to be free to be scheduled); a causal link internal to the runtime (for example, the runtime on one core releasing a shared lock, causing the other core to acquire it); and causal links in the hardware (for example, one work-unit finishes on a core, causing another work-unit to start there, modulo a choice by the runtime).
2.47 -
2.48 -%We will now expand on each of those kinds of causal link.
2.49 -
2.50 -%\paragraph{Constraint  causal link} Two entire boxes (units) are linked this way when   action by one unit contributes to satisfaction of a constraint blocking the other unit. This includes sequential dependencies from the base language (which are noted in the tool but normally not displayed).
2.51 -
2.52 -%Sequential dependencies may add superfluous constraints that  eliminate some otherwise allowed choices in the UCC. An example would be a sequential \texttt{for} loop that creates work-units -- no parallelism constructs cause the creations to be done in sequence, but the base C language sequentializes it nonetheless.
2.53 -
2.54 -%\paragraph{Runtime internal causal link} Runtime implementation details may introduce extra" causalities between units. For example, the runtime we instrumented for this paper runs separately on each core and relies upon a global lock for accessing shared runtime information. This lock introduces a causal relationship  when the runtime on one core is attempting to process one unit, but must wait for the runtime on a different core to finish with its unit.
2.55 -
2.56 -% Normally, these are not displayed explicitly, due to clutter, but can be turned on when needed, for instance to determine the cause of a particular pattern of core usage.
2.57 -
2.58 -%\paragraph{Hardware causal link} The physical fact that a given resource can only be used by one work-unit at a time introduces hardware causalities. When multiple units are free to execute, but all cores are busy, then completion of a unit  on one core causes (in part) the next ready unit to run on that core.
2.59 -
2.60 -%These are also not normally displayed, due to clutter, and not all hardware dependencies are directly measured. Future work will focus on using the performance counters and other instrumentation to add more information about communication paths taken as a consequence of the scheduling decisions made. It will start with the current linkage of application-code to runtime decisions, and add consequent usage of communication hardware. This gives an end-to-end linkage between runtime choices and caused behavior on the hardware.
2.61 -
2.62 -
2.63 -%\subsection{Breadth of Applicability of UCC and SCG}
2.64 -
2.65 -The question arises of how to apply the UCC and SCG to heterogeneous machines that have multiple levels of runtime, and thus multiple levels of scheduler. To answer that, let's explore a specific machine. In this case, a cluster in one rack. The rack has a back-plane that boards plug into, while a board has its own memory and four sockets, each socket holds a chip with four cores. This system has corresponding physically separate networks:  a back-plane network connecting the boards; an inter-socket network on each board; and a  cache hierarchy connecting cores within a socket.
2.66 -
2.67 -A runtime is provided for each network, to distribute work among the networked components. The top runtime divides work among the boards, then a board's runtime divides work among the sockets, and a socket's  runtime divides work among the cores.
2.68 -
2.69 -To a given runtime, each runtime below it is treated as an indivisible  machine. So the backplane runtime treats each board as a complete machine, creates work for each board, and schedules it. To send work to a board, it  communicates with the runtime that manages that board, but without knowing the internal details of that runtime.  It just sees the runtime as an opaque machine" that it gives work to. Hence the top runtime schedules onto board-level runtimes, which each schedule onto socket-level runtimes, and so on.This multi-level nature should be reflected in the UCC and SCG.
2.70 -
2.71 -To facilitate this approach, the application  has been written in a language that provides a construct for work-unit division. This construct provides a manipulator to the runtime, giving the runtime the ability to dynamically  divide up the work it is given into smaller work-units. Examples include the DKU pattern \cite{DKUTechRep_09}, TBB's loop constructs \cite{LangTBBRef}, HPF's matrix division facilities \cite{LangHPF}, and so on. Note that with such a manipulator, the UCC is maximally abstract on the unit axis.
2.72 -
2.73 -\subsubsection{Levels of UCC}So what does the \textit{concrete} UCC produced during a run on this machine look like? Well,  scheduling decisions bound a work-unit, and the decisions take place inside a runtime. Hence the  units are defined relative to a runtime. Each runtime, then has its own concrete UCC for the units it schedules.
2.74 +In this paper, we do not suggest how to represent abstract UCCs.  One of them actually indicates a multi-verse of concrete-UCCs. Which of them materializes  depends on the parameters, data and what the scheduler decides. We only represent the concrete UCC that materializes during a run and leave the question of representing abstract ones to future work.
2.75 +
2.76 +\subsection{Hierarchical  UCC and SCG}
2.77 +
2.78 +So far, we have treated computation systems as having only one runtime system, and the hardware as having only one level of communication network.  In reality, most machines run an OS, which is viewed within this theory as  a runtime system. Also, modern desktop and server hardware has  multiple levels of communication hierarchy within the memory system, not to mention explicit networks within clusters of servers and high performance hardware.
2.79 +
2.80 +The theory can likewise be applied hierarchically, by embedding a UCC and SCG inside a unit of a parent UCC/SCG. Naturally, the theory is most useful when the levels of UCC and SCG closely match the levels in the runtime system and communication in the hardware.
2.81 +
2.82 +For example, the theory views the execution of an application within an OS as one work-unit. Inside that unit is an entire UCC and SCG that are generated during the run. Now, if we consider an application that runs on a high performance machine, then many of the units created during the application run may in turn be split into smaller units that are distributed among the cores.
2.83 +
2.84 +
2.85 +\subsubsection{Levels of UCC}Scheduling decisions form the boundary of a work-unit, and those scheduling decisions take place inside a runtime. Hence work  units are defined relative to a runtime. Each runtime, then has its own concrete UCC for the units it schedules.
2.86  \begin{figure}[ht]
2.87    \centering
2.88    \includegraphics[width = 3.2in, height = 1.8in]{../figures/UCC_Levels_2.pdf}
2.89 @@ -529,154 +501,155 @@
2.90    \label{fig:UCC_Levels}
2.91  \end{figure}
2.92
2.93 -As seen in Fig. \ref{fig:UCC_Levels}, this means there are multiple levels of  concrete UCC, which  are related to each other in the same hierarchy as the runtimes. A single unit that is scheduled by one level of runtime has an entire concrete UCC in the level below. That unit is broken into sub-units, each of those is separately scheduled within the lower-level runtime. Hence, there's a whole lower-level UCC inside that unit, as the figure shows. The UCC hierarchy matches the  runtime hierarchy, which matches the hardware hierarchy.
2.94 +As seen in Fig. \ref{fig:UCC_Levels}, this means there are multiple levels of  concrete UCC, which  are related to each other in the same hierarchy as the runtime systems. A single unit that is scheduled by one level of runtime has an entire concrete UCC inside of it, on the runtime level below.  The UCC hierarchy matches the  runtime hierarchy, which ideally matches the hardware hierarchy.
2.95
2.96  \subsubsection{Levels of SCG}
2.97 -Each concrete UCC has a corresponding SCG. In the  SCG definition given above,  each column represents time on a core. But in a higher-level SCG, a column represents a runtime instead of a core. The consequence thus shifts to that of scheduling a given work unit onto  a particular runtime. Now, if the tuning goal is to  improve performance, then the consequence of interest is the time taken until the lower runtime completes the higher level work unit.
2.98 -
2.99 -The completion is when all the lower-level units are completed, plus any additional work spent dividing up the work before and combining results after. Once the lower-level consequence graph is established, this  time can be easily determined: it is the critical path across the lower-level consequence graph plus the additional, non-overlapped, time spent creating the lower-level units and combining results into the format of the higher-level unit. The divide and recombine are not part of the  lower level UCC/SCG, because the divide creates  the lower concrete UCC.
2.100 -
2.101 -\subsubsection{Applying the multi-level UCC and SCG}
2.102 -
2.103 -Let us apply this multi-level definition  to  the matrix multiply  from Section \ref{sec:casestudy}. We start at a level above the process that was created to execute the application. Here, a unit is the execution of an entire application, invoked via OS commands. The UCC at that level  treats each application execution as a unit. The SCG at that level treats the runtime" region as the time spent by the OS creating a new process and running the loader and linker. The work time of a unit is the execution time of the application, from the first instruction until it exits.
2.104 -
2.105 -The application then contains code that breaks up the work, creates virtual processors to perform the work, and code that collects the results.  Breaking up the work is the calculations that determine the dimensions of each sub-matrix, and pairs up the sub-matrices. Each pair makes one work-unit, and has a virtual processor created to perform the multiply. These work-units comprise the second-level UCC, which is the UCC shown in \S\ref{sec:UCC}. The corresponding second-level SCGs are the ones shown during the walk-through.
2.106 -
2.107 -
2.108 -
2.109 -%The divide and recombine, in this case, are themselves composed of smaller work-units.  However, they are not visible to units in the work portion, although there are causalities between them. The divide units end with creation of work-level units, while the end of work-level units causes creation of recombine units.
2.110 -
2.111 -%Hence, divide and recombine are inside of the higher-level unit while not part of the lower-level SCG. They can be conveniently displayed in a single visualization because of the clean connection between levels provided by the create and end causalities.
2.112 -
2.113 -%To complete the discussion, we consider going down another level, into the physical processor.  In modern super-scalar multi-cores, each core is a parallel processor composed of functional blocks such as arithmetic or floating-point blocks, along with logic that issues instructions to those blocks. We first break down a single SSR work unit of the kind seen in Section \ref{sec:casestudy} into smaller work units, namely individual instructions. Then we break instructions down into pipeline stages.
2.114 -
2.115 -%The shape of the UCC for the instructions inside a single SSR level work-unit is a straight line of sequential dependencies. However, the pipeline stages inside a core modify the UCC. They analyse the register use dependencies between instructions and generate a new UCC consisting of  functions constrained by the data dependencies. The dependencies are enforced by the issue logic, which  goes on to assign ready functions to the functional blocks.  The choices made by the issue logic  determine the shape of the SCG for this level, and the critical path of that SCG is the work time of the higher, SSR level, work-unit.
2.116 -
2.117 -%The stages spent performing fetch, decode, analysis, issue, and retirement correspond to functions in a runtime. Fetch corresponds to creation of the instruction-level work-unit. Decode and analysis logic generate the new UCC, as part of the runtime. Issue logic performs the scheduling function of a runtime. Retirement is also part of enforcing the dependencies, and so is also part of the runtime's scheduler. Most of the overhead of these stages is overlapped by the pipeline effect, but can contribute to the critical path during pipeline disruptions like mis-predicted branches. It is counted as non-overlapped runtime overhead.
2.118 -
2.119 -%To give insight into going from the SSR work-unit level up to the application-level, let us now go from the instruction level up to the SSR unit level, looking up' so to speak. Considering overhead, from inside the core,  the SSR level overhead looks just like more instructions.
2.120 -
2.121 -%Nina: we should talk this part over.. (I should have left more of your wording in place in comments..  I like this way of co-editing.. : )
2.122 -
2.123 -%One level up, the same holds true: The overhead of the unique application-unit would be the time spent in the OS, doing things such as setting up the application's memory space. This time is not visible from inside the application. However, it uses the same processing units (in this case, cores), and to them appears not very different from application work. The noticeable difference, in this case, is that the OS runtime is written using a different programming model instead of SSR.
2.124 -
2.125 -
2.126 -%Consider, in the matrix multiply code, the core usage spent  while  dividing the work and handing it to other cores. This is not work of the application, but overhead spent breaking the single application-unit into multiple sub-units.   Even though it is in the application code, it's purpose is implementing the execution model, which makes it runtime overhead. But which runtime level? It's not part of the SSR language runtime, so not overhead of a unit inside the application,   but rather it's for the application itself, as a unit! So, the core time spent calculating the division  gets counted towards the application-level unit, while the time spent inside the SSR runtime creating the meta-units is counted towards those lower SSR-level units. But both are in the critical path, so both charged as work time of the higher-level unit.
2.127 -
2.128 -
2.129 -% Another way to view this is that  by the process of elimination, the only core-time not accounted for elsewhere is the time spent dividing up a unit into smaller ones, and time spent accumulating the individual results back together. So this is what gets charged to the  higher-level unit.
2.130 -
2.131 -%The last question is how to handle communication consequences, which result from decisions made in all levels? The decisions in higher-level runtimes set the context for decisions in lower-level ones, which links a higher-level choice  to the consequences resulting from the lower-level choices. But the value of a consequence graph comes from linking the size of boxes in it to the decisions made by the scheduler. It's not clear how to divide-up  the time  cores spend waiting for non-overlapped communication, to assign portions to different levels. We have no good answer at the moment and leave it for future work.
2.132 -
2.133 -In practice, the perspective of levels should become more relevant as machine complexity increases. For example, when considering  heterogeneous machines, large servers, and clusters.
2.134 -Viewing the system in this way may lead to insights about runtime design, which can benefit from a hierarchical architecture \cite{HWSupportStanford10}. Incidentally, the multi-level view extends down into the compilation process, and below to the instruction level and the hardware itself.
2.135 -
2.136 -
2.137 +Naturally a concrete UCC has a corresponding SCG. The  definition of  SCG given above  counted   time on a \textit{core} for each kind of activity.  This remains true at the leaf level, but in a higher-level SCG, a column represents a runtime instead of a core. The consequence thus shifts to that of scheduling a given work unit onto  a particular runtime. If the  goal is to  improve performance, then the consequence of interest is the time taken until the lower runtime completes the higher level unit of work.
2.138 +
2.139 +The regions of time counted within a unit remain the same, being the primitive elements of the theory. Also, the critical path through the SCG must remain equal to the execution time. However, the execution time  is counted as the work region of that parent unit. Runtime activity on the parent's level is extra.  The additional, non-overlapped, time spent creating the lower-level units and combining results into the format of the higher-level unit is counted towards the runtime overhead region. Divide and recombine activity are not part of the  lower level UCC/SCG, rather they create  the lower concrete UCC.
2.140 +
2.141 +
2.142
2.143
2.144  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2.145  \section{Implementation}
2.146  \label{sec:Implementation}
2.147
2.148 -Now that the usage and theory are in hand, we give the details of implementation. We attempt a bottom-up view, to provide an alternate path to understanding the model and visualizations,  as well to provide a path to derive the benefits directly from the details.
2.149 -
2.150 -This section will show the level of effort needed to implement our approach for a new language. In short, this involves inserting collection points into the runtime, then modifying the post-processing that produces the visualization.
2.151 -
2.152 -We cast the implementation in terms of the computation model, then identify the
2.153 -points inside the runtime that correspond to points in the model. It is only in these spots that instrumentation gets inserted into runtime code.
2.154 -
2.155 -One benefit seen directly from these details is that many things remain the same across languages, because the computation model is invariant across languages.  This includes the basic nature of the visualizations and details of the implementation process.
2.156 -The reason is that visualizations are based on units, constraints, and causality, all of which exist in every language (even though some older ones obfuscate units and constraints on them).
2.157 -
2.158 -Another benefit evident from the details in this section is that the instrumentation is implemented only once, for a language. All applications written in the language inherit the visualizations, without  change to the application code.
2.159 +Now that the usage and theory are in hand, we give the details of implementation. %We attempt a bottom-up view, to provide an alternate path to understanding the model and visualizations,  as well to provide a path to derive the benefits directly from the details.
2.160 +
2.161 +%This section will show the level of effort needed to implement our approach for a new language. In short, this involves inserting collection points into the runtime, then modifying the post-processing that produces the visualization.
2.162 +
2.163 +Instrumenting a language involves inserting collection points into the runtime,  then modifying the post-processing that produces the visualization. A collection point uses either performance counters or time stamps to measure the boundary  of a region.
2.164 +
2.165 +Each portion of the runtime code is chosen to belong to a particular type of region, and a reading is placed into the runtime code at the transition, from code assigned to one region to  code of a different one. The reading records the ending region type, the unit the reading is counted against, and any region-specific information.
2.166 +
2.167 +
2.168 +
2.169 +The readings collected define the units, constraints, and causality,  which all exist in every language (even though some older ones obfuscate units and the constraints on them).
2.170 +Therefore the basic  visualization remains the same across languages.
2.171 +
2.172 +Another benefit  is that the instrumentation is implemented only once, for a language. All applications written in the language inherit the instrumentation, without  change to the application code.
2.173
2.174  \subsection{Meta-units and unit life-line in the computation model}
2.175  \label{subsec:LifeLine}
2.176 -In preparation for mapping the model onto implementation details, we define a meta-unit and unit life-line. These form the basis for deciding points in the runtime  where data is collected.
2.177 -
2.178 -Every unit has a meta-unit that represents it in the runtime. A  unit is defined as the work that exists after leaving the runtime, up until re-entering it. For example, the trace of instructions on a core, from the point of leaving the runtime up until the next invocation. Looking at this in more detail, every runtime has some form of internal bookkeeping state for a unit, used to track constraints on it and make decisions about when and where to execute. This exists even if that state is just a pointer to a function that sits in a queue. We call this bookkeeping state for a unit the meta-unit.
2.179 -
2.180 -Each  unit also has a life-line, which progresses so:  creation of the meta-unit~\pointer~state updates that affect constraints on the unit~\pointer~the decision is made to animate the unit~ \pointer~movement of the meta-unit plus data to physical resources that do the animation~\pointer~animation of the unit, which does the work~\pointer~communication of state-update, that unit has completed, and hardware is free~\pointer~constraint updates within runtime, possibly causing new meta-unit creations or freeing other meta-units to be chosen for animation.  This repeats for each unit. Each step is part of the model.
2.181 -
2.182 - Note a few implications: first, many activities internal to the runtime are part of a unit's life-line, and take place when only the meta-unit exists, before or after the work of the actual unit; second, communication that is internal to the runtime is part of the unit life-line, such as state updates; third, creation may be implied, such as in pthreads, or triggered such as in dataflow, or be by explicit command such as in StarSs. Once created, a meta-unit may languish before the unit it represents is free to be animated.
2.183 -
2.184 -This explains why the visualizations remain largely the same across languages. The concepts of a meta-unit, a unit, constraints on a unit, and a unit life-line are all valid in every language.  The visualizations are based on these concepts, and so likewise largely remain the same.  In the UCC, only the constraint patterns that represent  the language's constructs change between languages. In the SCG, only which construct a line in the SCG represents changes.
2.185 -
2.186 -\subsection{Mapping model onto implementation details in runtime}
2.187 -
2.188 -The model's concepts of meta-unit and unit life-line  map directly to the UCC visualization. The constraints in the UCC visualization are those stated in or implied by the application (with the complexities about UCC modifications and levels noted in Section \ref{sec:theory}).
2.189 -
2.190 -However, the SCG is not a strict expression of the model, rather it's purpose is practical. It shows usage of cores, and relates that to the quantities in the model. Hence, the measurements for the SCG all are boundaries of where the core's time switches from one category in the model to a different.
2.191 -
2.192 -This differs from the model in subtle ways. Most notably, the model declares segments of time where communications take place, while the SCG doesn't measure the communication time directly, rather it captures idleness of the core caused by the non-overlapped portion of that communication. Also, when calculating the critical path, the SCG only counts non-overlapped portions of runtime activity.
2.193 -
2.194 -This difference stems from the SCG's focus on core usage, and assigning each idle period to a cause. The runtime's choice of units to cores is what determined the source and destination of communications, which caused the idling. Hence, idle periods due to non-overlapped communication  are consequences of the  assignment choices made by the scheduler.  This supports the name: scheduling consequence graph.
2.195 +In preparation, we define  \textit{meta-unit} and unit \textit{life-line}. These help decide where in  the runtime  code that a collection point is inserted.
2.196 +
2.197 +Every unit has a meta-unit that represents it in the runtime. A  unit is defined as the work that exists after leaving the runtime code, up until re-entering it. For example, the trace of instructions on a core, from the point of leaving the runtime up until the next invocation.
2.198 +
2.199 +In fact, every runtime has some form of internal bookkeeping for a unit, which is used to track constraints on it and make decisions about when and where to execute. This exists even if that state is just a pointer to a function that sits in a queue. We call this bookkeeping state for a unit the meta-unit.
2.200 +
2.201 +Each  unit also has a life-line, which is broken into segments. Each segment is one of the regions counted against the unit. These life-line segments are sequential and adjoin each other in the life-line. However, the life-line segments are not sequential and adjoining on a core.  Life-line segments for the same unit often occur on different cores.
2.202 +
2.203 +The segments of a life-line are 1) work of: creating the meta-unit 2) work of: managing constraints, which free the unit 3) work of: choosing where to animate it 4) movement of: the meta-unit plus data to the physical resources that do the animation 5) work of: animating the unit  6) movement of: state update when the unit is done  7) work of constraint management triggered by the state update (if the update doesn't cause a new meta-unit to be freed).
2.204 +
2.205 + A few implications to note: first, a unit's life-line includes runtime internal activities. They take place when only the meta-unit exists, before or after the work of the unit; second, communication that is internal to the runtime is part of the unit life-line, such as state updates; third, creation of a meta-unit may be implied, such as in pthreads, or triggered such as in dataflow, or be by explicit command such as in StarSs. Once created, a meta-unit may languish before the unit it represents is free to be animated.
2.206 +
2.207 +This explains why the visualizations remain largely the same across languages. The concepts of a meta-unit, a unit, constraints on a unit, and a unit life-line are all valid in every language.   The only change in the visualization is that constraint patterns that represent  the language's constructs may differ, and of course which construct is represented by a line of a given color changes.
2.208 +
2.209 +\subsection{Mapping theory onto implementation details in runtime}
2.210 +
2.211 +The thing is to pick the points in the runtime code that correspond to points on a unit lifeline. Read the time stamp and instruction count at each point. Record which unit the runtime was triggered by.  Take all those readings and sort them, so that all readings for a given unit are together, and in order.  Finally, take the differences between readings, to get the size of each region.  Then draw the regions in the SCG.
2.212 +
2.213 +The cause-and-effects for language constructs are determined by the semantics of the construct.  But those semantics are implemented inside the runtime, so the location in the code where the cause-and-effect is recorded is clear.  For example, for pthreads, the handler invoked for mutex release records which unit was ended by the release construct.  That unit is passed to the mutex acquire handler. It records the releasing unit as the construct-type cause, of the start of the acquiring unit.
2.214 +
2.215 +The cause-and-effects for runtime internals are determined by the nature of the runtime implementation.  For example, the runtime on a multi-socket multi-core machine that uses a shared constraint state, has to protect that state. It does so by using atomic instructions. The time to acquire exclusive access to the state may be quite high in some cases, so two points are added to measure the time taken.  At these points, record the unit that triggered the constraint management, for which access is being acquried. Likewise record the time and unit at the point in the code where access to the shared state is released.  When the time stamps from these recordings are sorted, the unit that released is linked to the unit that acquired in a cause-and-effect relationship.  This is visualized as a link between runtime regions of the two units.
2.216 +
2.217 +
2.218 +One effect to note is that some of the time spans within a unit's life-line are communication, which can be overlapped with core time. The SCG, though, only records core time. So the regions in an SCG box that represent communication only display the portion of the life-line segment that was not overlapped.
2.219 +
2.220 +%The model's concepts of meta-unit and unit life-line  map directly to the UCC visualization. The constraints in the UCC visualization are those stated in or implied by the application (with the complexities about UCC modifications and levels noted in Section \ref{sec:theory}).
2.221 +
2.222 +%However, the SCG is not a strict expression of the model, rather it's purpose is practical. It shows usage of cores, and relates that to the quantities in the model. Hence, the measurements for the SCG all are boundaries of where the core's time switches from one category in the model to a different.
2.223 +
2.224 +%This differs from the model in subtle ways. Most notably, the model declares segments of time where communications take place, while the SCG doesn't measure the communication time directly, rather it captures idleness of the core caused by the non-overlapped portion of that communication. Also, when calculating the critical path, the SCG only counts non-overlapped portions of runtime activity.
2.225 +
2.226 +%This difference stems from the SCG's focus on core usage, and assigning each idle period to a cause. The runtime's choice of units to cores is what determined the source and destination of communications, which caused the idling. Hence, idle periods due to non-overlapped communication  are consequences of the  assignment choices made by the scheduler.  This supports the name: scheduling consequence graph.
2.227
2.228
2.229 -What must be collected during the run differs between the two types of visualization. For the UCC it is unit boundaries and the constraints related to each unit. For the SCG, the same units must  be collected, but also the time a core spends on each segment of the unit's life-line.  Also, implementation details of the runtime will cause things such as idling the core during lock acquisition to be counted towards a unit's life segment. What core activities go to which life segments changes from runtime to runtime. For example, our implementation includes idle time due to acquiring the lock on shared runtime state as part of the  life-line step of state update.
2.230 -
2.231 -The SCG represents each  cause of a transition from one segment of core usage to another as an arc between regions or boxes. Such a causation is always a causal dependency of some kind, because the SCG only represents physical events, even if it corresponds to a complex construct in the application. These causations are collected and  tied to one of: construct constraint, runtime internal constraint (such as must acquire lock), or hardware constraint (such as only one activity at a time on a core). In this paper, all are collected, but the only causations visualized are constructs  that cross cores, with propendent on one core and its dependent on another.
2.232 -
2.233 -\subsection{Instrumenting our implementation of SSR on top of VMS}
2.234 -
2.235 -We instrumented a version of SSR implemented on top of a proto-runtime system called VMS \cite{VMS_LCPC_11}.  This proto-runtime embodies most of a runtime implementation, but has replaced two key portions with interfaces. Those portions are the handling of language construct-constraints and the decision of which core to assign a unit to. To implement a language, one simply supplies those two portions of code, via the interface.
2.236 -
2.237 -VMS also has the advantage for our approach of being written in accordance with the computation model, which makes instrumenting it especially convenient. Each language construct has its own handler into which to insert measurement code, and transitions in unit life-lines also have convenient locations in VMS to insert instrumentation code.
2.238 -
2.239 -\subsubsection{SSR background}
2.240 -A distinction important to understanding SSR and other parallel languages is being task-based versus virtual processor (VP) based. Task-based languages include dataflow, CnC, and StarSs.  These tasks don't suspend and resume, but rather execute to completion. Hence, such a task is the same as our definition of unit. They have no state that persists across calls to the runtime. In contrast, a virtual processor does suspend and resume and so has state that persists across runtime calls. Examples include pthreads, OpenMP thread-based constructs, UPC, and so on.
2.241 -
2.242 -SSR is based on virtual processors. They execute sequential code that occasionally calls a parallel construct, which suspends the VP and switches to the runtime. This means that each VP contains a sequence of units, with each unit the trace-segment between two SSR library calls.
2.243 +%What must be collected during the run differs between the two types of visualization. For the UCC it is unit boundaries and the constraints related to each unit. For the SCG, the same units must  be collected, but also the time a core spends on each segment of the unit's life-line.  Also, implementation details of the runtime will cause things such as idling the core during lock acquisition to be counted towards a unit's life segment. What core activities go to which life segments changes from runtime to runtime. For example, our implementation includes idle time due to acquiring the lock on shared runtime state as part of the  life-line step of state update.
2.244 +
2.245 +%The SCG represents each  cause of a transition from one segment of core usage to another as an arc between regions or boxes. Such a causation is always a causal dependency of some kind, because the SCG only represents physical events, even if it corresponds to a complex construct in the application. These causations are collected and  tied to one of: construct constraint, runtime internal constraint (such as must acquire lock), or hardware constraint (such as only one activity at a time on a core). In this paper, all are collected, but the only causations visualized are constructs  that cross cores, with propendent on one core and its dependent on another.
2.246 +
2.247 +\subsection{Instrumenting our implementation of SSR}
2.248 +
2.249 +We instrumented a version of SSR that was implemented using the proto-runtime toolkit \cite{VMS_LCPC_11}.  A proto-runtime embodies most of a runtime implementation, but has replaced two key portions with interfaces. Those portions are the handling of language constructs and the decision of which core to assign a unit to. To implement a language, one simply supplies those two portions of code, using the interface.
2.250 +
2.251 +Proto-runtime has simplified instrumentation because its structure closely matches the theory. Each language construct has its own handler into which to insert measurement code, and transitions in unit life-lines have convenient, well-define locations in proto-runtime at which to insert instrumentation code.
2.252 +
2.253 +
2.254 +\subsubsection{Identifying units within SSR execution}
2.255 +SSR is a virtual-processor (VP) based programming model, in the same category as pthreads and OpenMP, which has an implication for how units are chosen.  A VP based model differs from task-based programming models like dataflow, StarSs, and Cilk.   A task is defined as, once it starts it executes to completion.  Hence, a task is one-to-one with a unit.  But a virtual processor (thread) suspends and resumes. Thus it has multiple units over its lifetime.
2.256 +
2.257 +%the same as our definition of unit. They have no state that persists across calls to the runtime. In contrast, a virtual processor does suspend and resume and so has state that persists across runtime calls. Examples include pthreads, OpenMP thread-based constructs, UPC, and so on.
2.258 +
2.259 +We have to identify the start and end of a unit in SSR, and assign a unique ID to each.  We do that by making each invocation of an SSR construct act as the boundary between units. It marks the end of the preceding unit and the beginning a new unit, which follows.
2.260 +
2.261 +We collect the transition points on a units lifeline inside the proto-runtime code. Each unit has its own data structure instance, into which the recordings are placed.
2.262 +
2.263 +%For example, in some VP, some work is performed, followed by a send construct.  The send suspends the VP and switches to the runtime, where the send handler is invoked. It tries to match the send with a receive. At some point a match succeeds and the VP resumes.
2.264 +%We define the segment of application code that ran in the VP before the send as one unit, and the segment of code after the send construct as a different unit. Hence, inside the handler for the send, the preceding unit is ended and a new one is created and assigned a new, unique, ID.
2.265 +
2.266 +%The processing of trying to match the send with a receive is the work of the constraint management.  If it succeeds in matching, then the work is assigned to the unit that is freed, which logically could be either the unit that follows the send, or the unit that follows the receive matched to the send.  For simplicity we choose to assign it to the send, because it's on the same core.
2.267 +
2.268 +%In the send handler, we record the unit that the send ended, and the unit that the resume from the send will start. We also create a cause-and-effect link between them that records their sequential dependency, which is due to the sequential nature of a VP.  A time stamp is taken just before start of the handler, and just after its end.  The difference is the size of the constraint management region.
2.269 +
2.270 +%They execute sequential code that occasionally calls a parallel construct, which suspends the VP and switches to the runtime. This means that each VP contains a sequence of units, with each unit the trace-segment between two SSR library calls.
2.271
2.272
2.273 -SSR has both deterministic constraints, which specify the source and destination VP, such as send\_from\_to, and non-de\-ter\-mi\-nis\-tic ones, in which the runtime is what chooses which VPs interact, such as send\_of\_type\_to\ and singleton. Deterministic ones display the same in the UCC and the SCG. However, non-deterministic ones need all possibilities to be determined for the UCC, requiring extra instrumentation code.
2.274 -
2.275  \subsubsection{Collecting a unit}
2.276 -Code to record a new unit is inserted into VMS at the transition out of the runtime and into application code. Code to record the unit end is inserted into the VMS primitive that performs switching back to the runtime.
2.277 -
2.278 -\subsubsection{Collecting the constraints} In VMS, each language construct has its own handler. Code is inserted into each handler, to record which unit invoked the construct, and any units freed by it. The SCG links the unit that made a construct call to the units freed by that call.
2.279 -
2.280 -What information needs to be collected for SCG and UCC and how it is done depends on the construct:
2.281 +Every SSR construct invokes the proto-runtime, which in turn hands off to the construct's handler. So we record the end of the preceding unit inside proto-runtime, just before invocation of the handler. Likewise, we record the time stamp for the start of a new unit  in the proto-runtime code just before  switching back to the VP.
2.282 +
2.283 +\subsubsection{Collecting the constraints}
2.284 +
2.285 +
2.286 +
2.287 + Code is inserted into each of SSR's construct handlers, to record which unit invoked the construct, and any units freed by it.
2.288 +What information needs to be collected  depends on the construct:
2.289  \begin{itemize}
2.290  \setlength{\itemsep}{0em}
2.291 -\item create\_VP:  We place code into the create\_VP\ handler, which records  the calling VP\ + unit, along with the newly created unit, and the VP it is assigned to. Both the SCG and UCC draw arcs between creating unit and created.
2.292 -\item send\_from\_to and receive\_from\_to: \textbf{} Code is placed into both handlers at the point that checks if both the rendez-vous requests are present. When true, it records the two units + VPs that connected. The UCC  and SCG both represent this by two crossing dependencies.
2.293 +\item create\_VP:  Code is inserted into the handler that records  the calling VP\ + unit just ended. It also creates a new unit, and records the VP it is assigned to. %Both the SCG and UCC draw arcs between creating unit and created.
2.294 +
2.295 +\item send\_from\_to and receive\_from\_to: \textbf{} Code is placed into both handlers at the point that checks if both the rendez-vous requests are present. When true, it records the two units + VPs that connected. %The UCC  and SCG both represent this by two crossing dependencies.
2.296 +
2.297  \item Send\_to\_of\_type and receive\_to\_of\_type:   The same code is inserted to record  the  unit+VPs that connected. This is enough for the SCG.  But for the UCC, we want to capture all sending and receiving permutations available, so we add code that collects the group of senders and the corresponding group of receivers.
2.298 +
2.299  \item Singleton:   The singleton unit has a group of predecessor units and a group of successor units. The first predecessor to complete enables the singleton unit, while all successors must wait for its completion. We insert code into the handler, which records the predecessor that enabled the singleton unit. This is all that the SCG needs. For the UCC, we add code inside the singleton call that collects the calling unit, adding it to the predecessor group, and the unit it jumps to, adding that to the successor group.
2.300
2.301  \end{itemize}
2.302
2.303  \subsubsection{Recording time, instructions, and cache misses }
2.304 - Just recording the units and connections between them is not enough for the SCG. Because it represents core usage, the SCG also needs  the cycles spent on each activity to be collected, including internal runtime activities. The size of each interval of core usage is recorded and assigned to a  segment of some  unit's life-line.
2.305 -
2.306 -The UCC also makes use  of the number of instructions in a unit, as an estimate of size of work in the unit, as illustrated by Fig. \ref{fig:UCC_expl}. Without knowing the relative size of the units, it is hard to estimate the amount of parallelism \emph{usefully} available in the application.
2.307 -
2.308 -To measure the instructions, cycles, and communication (cache misses), we use hardware performance counters. Readings are inserted into the runtime code to capture core time spent on each segment of the life-line of a unit:
2.309 + Just recording the units and connections between them is not enough for the SCG. Because it represents core usage, the SCG also needs to collect the cycles spent on each activity, including internal runtime activities. The size of each interval of core usage is recorded and assigned to a  segment of some  unit's life-line.
2.310 +
2.311 +The UCC  makes use  of the number of instructions in a unit, as an estimate of size of work in the unit, as illustrated by Fig. \ref{fig:UCC_expl}. Without knowing the relative size of the units, it is hard to estimate the amount of parallelism \emph{usefully} available in the application.
2.312 +
2.313 +To measure the instructions, cycles, and communication (cache misses), we use hardware performance counters. Readings are inserted into the runtime code to capture core time spent on each segment of the life-line of a unit. We number the segments, then map points in the runtime to the segments it measures. The segments:
2.314  \begin{enumerate}
2.315 -\item Create meta-unit: This is the time spent inside the create\_VP construct's handler function.
2.316 -\item Update constraints: This is the time spent inside the handler functions that implement the constructs.
2.317 -\item Decision to animate: This is the time spent inside the language-supplied assigner function.
2.318 -\item Move meta-unit to core: This is via shared variables, recorded as part of 3.
2.319 -\item Move work data to core: This is via cache misses, recorded as part of 6.
2.320 -\item Do the work of the unit: This is the cycles between the switch-to-unit  and the following switch-to-runtime.
2.321 -\item Communicate state update: This is the time between leaving the application code and starting the construct handler (which includes lock acquisition).
2.322 -\item Resulting constraint updates: This is the time spent inside the construct handler, and is the same as 2.
2.323 +\item Create meta-unit: The time spent inside the create\_VP construct's handler.
2.324 +\item Update constraints: The time spent inside the handler functions.
2.325 +\item Decision to animate: The time spent inside the language-supplied assigner function.
2.326 +\item Move meta-unit to core: rolled into 3, as part of the decision.
2.327 +\item Move work data to core: This happens via cache misses, so we roll together with 6, which is doing the work.
2.328 +\item Do the work of the unit: The cycles from the switch-to-VP that starts the unit's work, until the following switch-to-runtime that ends it.
2.329 +\item Communicate state update: The time from switch-to-runtime until start of the construct handler (which includes lock acquisition).
2.330 +\item Resulting constraint updates: The time spent inside the construct handler, and is the same as 2, but may be assigned differently.
2.331  \end{enumerate}
2.332
2.333  To cover each of those segments of a unit's life-line, code to read  the performance counters is inserted at:
2.334
2.335  \begin{itemize}
2.336  \setlength{\itemsep}{0em}
2.337 -\item Construct handler: To measure 2 and 8, reading is done before and after VMS calls the language-supplied construct handler function.\item
2.338 -Assigner: To measure 3 and 4, reading is done before and after VMS calls  the language-supplied assigner function.
2.339 -\item Work: To measure 5 and 6, reading is done by reading inside the VMS switch-to-unit operation, and the switch-to-runtime operation.
2.340 +\item Construct handler: To measure 2 and 8, reading is done before and after the proto-runtime calls the  construct handler function.
2.341 +
2.342 +\item
2.343 +Assigner: To measure 3 and 4, reading is done before and after proto-runtime calls  the language-supplied assigner function.
2.344 +\item Work: To measure 5 and 6,  read inside the switch-to-VP operation, and inside the switch-to-runtime operation.
2.345
2.346  \item
2.347 -Dual-use:  1 is measured by using the construct handler reads for the create\_VP construct handler. For 7, the switch-to-runtime read is subtracted from  the read at the start of the construct handler function.
2.348 +Dual-use:  1 is measured by using the read for the create\_VP construct handler.  For 7, subtract the switch-to-runtime read  from  the read at the start of the construct handler function.
2.349  \end{itemize}
2.350
2.351
2.352 -For clarity, all but work are grouped as overhead in the visualization, but they could be displayed separately if needed.
2.353 +
2.354
2.355   All the measurements are output into a trace file, which is then evaluated after the run to build the visualizations.
2.356
2.357 @@ -687,7 +660,7 @@
2.358  \subsubsection{UCC}
2.359  For the UCC, units can be either unweighted or weighted. Weighted units appear as rectangles with height proportional to the weight, unweighted units appear as circles. We weight the units with the number of instructions in the work. This is not entirely execution-independent, but removes some of the influence of scheduling and data, such as cache misses.
2.360
2.361 -A critical path algorithm calculates vertical position of a unit, as its position within the critical path. The nodes are spread horizontally such that none overlap.
2.362 +A critical path algorithm calculates vertical position of a unit, which is its position in the critical path. The nodes are spread horizontally such that none overlap.
2.363  Simple constraints (dependencies) are painted as arcs. Complicated constraints are for now displayed as an additional node bearing information on the constraint, with incoming arcs from all units whose execution status affects the constraint, and outgoing arcs going to the constrained units.
2.364
2.365
2.366 @@ -695,7 +668,7 @@
2.367
2.368  For the SCG, all nodes are weighted with the number of cycles spent on the unit in total (work + overhead). For display, the nodes  are  split into overhead and work. The same critical path algorithm as for the UCC is used to place nodes vertically, but this time horizontal placement is determined by the core on which the unit was executed  (hardware dependencies ensure no overlap).
2.369
2.370 -Constraints can then be overlaid, color coded by type. By default, in SSR, we display creation, direct and typed message sending (but not the crossing dependency from the receiver back to the sender), and singleton outgoing dependencies; but each type can be individually hidden or shown.
2.371 +Constraints can then be overlaid, and color coded by type. By default, in SSR, we display creation, direct and typed message sending (but not the crossing dependency from the receiver back to the sender), and singleton outgoing dependencies; but each type can be individually hidden or shown.
2.372
2.373  All this information is taken purely from the runtime, leading to a rich, configurable visualization without needing to add anything to the application.
2.374
2.375 @@ -787,13 +760,136 @@
2.376
2.377
2.378
2.379 -
2.380 -
2.381 -
2.382  \bibliography{bib_for_papers_12_Jy_15}
2.383
2.384  \end{document}
2.385
2.386 +
2.387 +
2.388 +
2.389 +
2.390 +
2.391 +
2.392 +
2.393 +
2.394 +
2.395 +
2.396 +
2.397 +
2.398 +====================================================================
2.399 +=
2.400 +=
2.401 +====================================================================
2.402 +
2.403 +
2.404 +\subsection{Applying UCC and SCG to hierarchical hardware}
2.405 +
2.406 +
2.407 +\subsubsection{more on SCG}
2.408 +
2.409 +Whereas the UCC has value in understanding the structure of the application,  the second part of the model shows how that structure  was made use of during a run. We called it the Scheduling Consequence Graph because it links scheduling decisions to their performance consequences. It does so by indicating distinct activities in the fundamental logical structures, and linking them causally to each other. As a result it provides a link from each instance of lost performance, back through each element that contributed to causing the loss, as illustrated in Section \ref{subsec:holes}.
2.410 +
2.411 +
2.412 +A consequence graph accounts for each bit of core time, arranged into boxes and arcs, with boxes further divided into regions. A box  represents all core time assigned to one work unit. Each region inside represents a segment of time that the core was engaged in a specific type of activity.  An arc links  regions (or boxes)  and represents a causality of some kind.
2.413 +
2.414 +There is one kind of region for each kind of activity that can source or sink a causal link, and a region may source or sink multiple causalities. Each type of causal link is represented by a corresponding type of arc.
2.415 +
2.416 +The core activities/\ region-types are: doing application work, waiting idly for communication of work data, managing constraints, deciding assignment of work onto cores, and doing internal runtime activities. The runtime region has sub-activities, but we do not detail them here due to space. However, some will be stated in Section \ref{sec:Implementation} when we talk about instrumenting a runtime.
2.417 +
2.418 +Arcs are gathered into groups according to the nature of the causality they represent.  The kinds of causal links are: satisfaction of a sequential dependency in the base language; satisfaction of a parallel constraint  (i.e., one unit did something to satisfy a constraint on the other, causing it to be free to be scheduled); a causal link internal to the runtime (for example, the runtime on one core releasing a shared lock, causing the other core to acquire it); and causal links in the hardware (for example, one work-unit finishes on a core, causing another work-unit to start there, modulo a choice by the runtime).
2.419 +
2.420 +We will now expand on each of those kinds of causal link.
2.421 +
2.422 +\paragraph{Constraint  causal link} Two entire boxes (units) are linked this way when   action by one unit contributes to satisfaction of a constraint blocking the other unit. This includes sequential dependencies from the base language (which are noted in the tool but normally not displayed).
2.423 +
2.424 +Sequential dependencies may add superfluous constraints that  eliminate some otherwise allowed choices in the UCC. An example would be a sequential \texttt{for} loop that creates work-units -- no parallelism constructs cause the creations to be done in sequence, but the base C language sequentializes it nonetheless.
2.425 +
2.426 +\paragraph{Runtime internal causal link} Runtime implementation details may introduce extra" causalities between units. For example, the runtime we instrumented for this paper runs separately on each core and relies upon a global lock for accessing shared runtime information. This lock introduces a causal relationship  when the runtime on one core is attempting to process one unit, but must wait for the runtime on a different core to finish with its unit.
2.427 +
2.428 + Normally, these are not displayed explicitly, due to clutter, but can be turned on when needed, for instance to determine the cause of a particular pattern of core usage.
2.429 +
2.430 +\paragraph{Hardware causal link} The physical fact that a given resource can only be used by one work-unit at a time introduces hardware causalities. When multiple units are free to execute, but all cores are busy, then completion of a unit  on one core causes (in part) the next ready unit to run on that core.
2.431 +
2.432 +These are also not normally displayed, due to clutter, and not all hardware dependencies are directly measured. Future work will focus on using the performance counters and other instrumentation to add more information about communication paths taken as a consequence of the scheduling decisions made. It will start with the current linkage of application-code to runtime decisions, and add consequent usage of communication hardware. This gives an end-to-end linkage between runtime choices and caused behavior on the hardware.
2.433 +
2.434 +
2.435 +
2.436 +
2.437 +
2.438 +
2.439 +
2.440 +
2.441 +
2.442 +\subsection{Breadth of Applicability of UCC and SCG}
2.443 +
2.444 +The question arises of how to apply the UCC and SCG to heterogeneous machines that have multiple levels of runtime, and thus multiple levels of scheduler. To answer that, let's explore a specific machine. In this case, a cluster in one rack. The rack has a back-plane that boards plug into, while a board has its own memory and four sockets, each socket holds a chip with four cores. This system has corresponding physically separate networks:  a back-plane network connecting the boards; an inter-socket network on each board; and a  cache hierarchy connecting cores within a socket.
2.445 +
2.446 +A runtime is provided for each network, to distribute work among the networked components. The top runtime divides work among the boards, then a board's runtime divides work among the sockets, and a socket's  runtime divides work among the cores.
2.447 +
2.448 +To a given runtime, each runtime below it is treated as an indivisible  machine. So the backplane runtime treats each board as a complete machine, creates work for each board, and schedules it. To send work to a board, it  communicates with the runtime that manages that board, but without knowing the internal details of that runtime.  It just sees the runtime as an opaque machine" that it gives work to. Hence the top runtime schedules onto board-level runtimes, which each schedule onto socket-level runtimes, and so on.This multi-level nature should be reflected in the UCC and SCG.
2.449 +
2.450 +To facilitate this approach, the application  has been written in a language that provides a construct for work-unit division. This construct provides a manipulator to the runtime, giving the runtime the ability to dynamically  divide up the work it is given into smaller work-units. Examples include the DKU pattern \cite{DKUTechRep_09}, TBB's loop constructs \cite{LangTBBRef}, HPF's matrix division facilities \cite{LangHPF}, and so on. Note that with such a manipulator, the UCC is maximally abstract on the unit axis.
2.451 +
2.452 +
2.453 +
2.454 +
2.455 +
2.456 +\subsubsection{Levels of UCC}So what does the \textit{concrete} UCC produced during a run on this machine look like? Well,  scheduling decisions bound a work-unit, and the decisions take place inside a runtime. Hence the  units are defined relative to a runtime. Each runtime, then has its own concrete UCC for the units it schedules.
2.457 +\begin{figure}[ht]
2.458 +  \centering
2.459 +  \includegraphics[width = 3.2in, height = 1.8in]{../figures/UCC_Levels_2.pdf}
2.460 +  \caption{Representation of multiple levels of  UCC.}
2.461 +  \label{fig:UCC_Levels}
2.462 +\end{figure}
2.463 +
2.464 +As seen in Fig. \ref{fig:UCC_Levels}, this means there are multiple levels of  concrete UCC, which  are related to each other in the same hierarchy as the runtimes. A single unit that is scheduled by one level of runtime has an entire concrete UCC in the level below. That unit is broken into sub-units, each of those is separately scheduled within the lower-level runtime. Hence, there's a whole lower-level UCC inside that unit, as the figure shows. The UCC hierarchy matches the  runtime hierarchy, which matches the hardware hierarchy.
2.465 +
2.466 +\subsubsection{Levels of SCG}
2.467 +Each concrete UCC has a corresponding SCG. In the  SCG definition given above,  each column represents time on a core. But in a higher-level SCG, a column represents a runtime instead of a core. The consequence thus shifts to that of scheduling a given work unit onto  a particular runtime. Now, if the tuning goal is to  improve performance, then the consequence of interest is the time taken until the lower runtime completes the higher level work unit.
2.468 +
2.469 +The completion is when all the lower-level units are completed, plus any additional work spent dividing up the work before and combining results after. Once the lower-level consequence graph is established, this  time can be easily determined: it is the critical path across the lower-level consequence graph plus the additional, non-overlapped, time spent creating the lower-level units and combining results into the format of the higher-level unit. The divide and recombine are not part of the  lower level UCC/SCG, because the divide creates  the lower concrete UCC.
2.470 +
2.471 +\subsubsection{Applying the multi-level UCC and SCG}
2.472 +
2.473 +Let us apply this multi-level definition  to  the matrix multiply  from Section \ref{sec:casestudy}. We start at a level above the process that was created to execute the application. Here, a unit is the execution of an entire application, invoked via OS commands. The UCC at that level  treats each application execution as a unit. The SCG at that level treats the runtime" region as the time spent by the OS creating a new process and running the loader and linker. The work time of a unit is the execution time of the application, from the first instruction until it exits.
2.474 +
2.475 +The application then contains code that breaks up the work, creates virtual processors to perform the work, and code that collects the results.  Breaking up the work is the calculations that determine the dimensions of each sub-matrix, and pairs up the sub-matrices. Each pair makes one work-unit, and has a virtual processor created to perform the multiply. These work-units comprise the second-level UCC, which is the UCC shown in \S\ref{sec:UCC}. The corresponding second-level SCGs are the ones shown during the walk-through.
2.476 +
2.477 +
2.478 +
2.479 +%The divide and recombine, in this case, are themselves composed of smaller work-units.  However, they are not visible to units in the work portion, although there are causalities between them. The divide units end with creation of work-level units, while the end of work-level units causes creation of recombine units.
2.480 +
2.481 +%Hence, divide and recombine are inside of the higher-level unit while not part of the lower-level SCG. They can be conveniently displayed in a single visualization because of the clean connection between levels provided by the create and end causalities.
2.482 +
2.483 +%To complete the discussion, we consider going down another level, into the physical processor.  In modern super-scalar multi-cores, each core is a parallel processor composed of functional blocks such as arithmetic or floating-point blocks, along with logic that issues instructions to those blocks. We first break down a single SSR work unit of the kind seen in Section \ref{sec:casestudy} into smaller work units, namely individual instructions. Then we break instructions down into pipeline stages.
2.484 +
2.485 +%The shape of the UCC for the instructions inside a single SSR level work-unit is a straight line of sequential dependencies. However, the pipeline stages inside a core modify the UCC. They analyse the register use dependencies between instructions and generate a new UCC consisting of  functions constrained by the data dependencies. The dependencies are enforced by the issue logic, which  goes on to assign ready functions to the functional blocks.  The choices made by the issue logic  determine the shape of the SCG for this level, and the critical path of that SCG is the work time of the higher, SSR level, work-unit.
2.486 +
2.487 +%The stages spent performing fetch, decode, analysis, issue, and retirement correspond to functions in a runtime. Fetch corresponds to creation of the instruction-level work-unit. Decode and analysis logic generate the new UCC, as part of the runtime. Issue logic performs the scheduling function of a runtime. Retirement is also part of enforcing the dependencies, and so is also part of the runtime's scheduler. Most of the overhead of these stages is overlapped by the pipeline effect, but can contribute to the critical path during pipeline disruptions like mis-predicted branches. It is counted as non-overlapped runtime overhead.
2.488 +
2.489 +%To give insight into going from the SSR work-unit level up to the application-level, let us now go from the instruction level up to the SSR unit level, looking up' so to speak. Considering overhead, from inside the core,  the SSR level overhead looks just like more instructions.
2.490 +
2.491 +%Nina: we should talk this part over.. (I should have left more of your wording in place in comments..  I like this way of co-editing.. : )
2.492 +
2.493 +%One level up, the same holds true: The overhead of the unique application-unit would be the time spent in the OS, doing things such as setting up the application's memory space. This time is not visible from inside the application. However, it uses the same processing units (in this case, cores), and to them appears not very different from application work. The noticeable difference, in this case, is that the OS runtime is written using a different programming model instead of SSR.
2.494 +
2.495 +
2.496 +%Consider, in the matrix multiply code, the core usage spent  while  dividing the work and handing it to other cores. This is not work of the application, but overhead spent breaking the single application-unit into multiple sub-units.   Even though it is in the application code, it's purpose is implementing the execution model, which makes it runtime overhead. But which runtime level? It's not part of the SSR language runtime, so not overhead of a unit inside the application,   but rather it's for the application itself, as a unit! So, the core time spent calculating the division  gets counted towards the application-level unit, while the time spent inside the SSR runtime creating the meta-units is counted towards those lower SSR-level units. But both are in the critical path, so both charged as work time of the higher-level unit.
2.497 +
2.498 +
2.499 +% Another way to view this is that  by the process of elimination, the only core-time not accounted for elsewhere is the time spent dividing up a unit into smaller ones, and time spent accumulating the individual results back together. So this is what gets charged to the  higher-level unit.
2.500 +
2.501 +%The last question is how to handle communication consequences, which result from decisions made in all levels? The decisions in higher-level runtimes set the context for decisions in lower-level ones, which links a higher-level choice  to the consequences resulting from the lower-level choices. But the value of a consequence graph comes from linking the size of boxes in it to the decisions made by the scheduler. It's not clear how to divide-up  the time  cores spend waiting for non-overlapped communication, to assign portions to different levels. We have no good answer at the moment and leave it for future work.
2.502 +
2.503 +In practice, the perspective of levels should become more relevant as machine complexity increases. For example, when considering  heterogeneous machines, large servers, and clusters.
2.504 +Viewing the system in this way may lead to insights about runtime design, which can benefit from a hierarchical architecture \cite{HWSupportStanford10}. Incidentally, the multi-level view extends down into the compilation process, and below to the instruction level and the hardware itself.
2.505 +
2.506 +
2.507 +
2.508 +
2.509 +===============
2.510 +
2.511 +
2.512  ===============
2.513
2.514  (This is not true:) plus each point that state update directly affects    the life-state of the unit. Such state updates include life-state of other units involved in constraints with the target unit (because they may cause a change in target unit's life-state), updates of hardware state that directly cause a change in target unit's life state (such as hardware becoming free coupled to the scheduler choosing the target unit), and lastly any internal runtime state changes related to the unit (such as idling until a lock is acquired). The language provides construct-based constraints, while the runtime and hardware provide additional ones.
2.515 @@ -818,6 +914,7 @@
2.516
2.517  The main elements of the approach are: collection points, the output from them, and the post-processing that draws the visualizations.
2.518
2.519 +
2.520  \subsubsection{Collection points}
2.521
2.522
2.523 @@ -1465,4 +1562,258 @@
2.524
2.525   trying to pinpoint the causes of performance loss,  internal structure of
2.526
2.527 - involves  tuning both sequential behavior, inside units of work, and also
2.528 \ No newline at end of file
2.529 + involves  tuning both sequential behavior, inside units of work, and also  of causes of idle cores, and form hypotheses for how
2.530 +
2.531 +
2.532 +
2.533 +
2.534 +
2.535 +
2.536 +
2.537 +
2.538 +
2.539 +
2.540 +
2.541 +
2.542 +
2.543 +
2.544 +
2.545 +
2.546 +
2.547 +
2.548 +
2.549 +
2.550 +
2.551 +
2.552 +
2.553 +
2.554 +
2.555 +
2.556 +
2.557 +
2.558 +
2.559 +
2.560 +
2.561 +
2.562 +
2.563 +
2.564 +
2.565 +
2.566 +
2.567 +
2.568 +
2.569 +
2.570 +
2.571 +
2.572 +
2.573 +
2.574 +
2.575 +
2.576 +
2.577 +
2.578 +
2.579 +
2.580 +
2.581 +
2.582 +
2.583 +
2.584 +
2.585 +
2.586 +
2.587 +
2.588 +
2.589 +
2.590 +
2.591 +
2.592 +
2.593 +
2.594 +
2.595 +
2.596 +
2.597 +
2.598 +
2.599 +
2.600 +
2.601 +
2.602 +
2.603 +
2.604 +
2.605 +
2.606 +
2.607 +
2.608 +
2.609 +
2.610 +
2.611 +
2.612 +
2.613 +
2.614 +
2.615 +
2.616 +
2.617 +
2.618 +
2.619 +
2.620 +
2.621 +
2.622 +
2.623 +
2.624 +
2.625 +
2.626 +
2.627 +
2.628 +
2.629 +
2.630 +
2.631 +
2.632 +
2.633 +
2.634 +
2.635 +
2.636 +
2.637 +
2.638 +
2.639 +
2.640 +
2.641 +
2.642 +
2.643 +
2.644 +
2.645 +
2.646 +
2.647 +
2.648 +
2.649 +
2.650 +
2.651 +
2.652 +
2.653 +
2.654 +
2.655 +
2.656 +
2.657 +
2.658 +
2.659 +
2.660 +
2.661 +
2.662 +
2.663 +
2.664 +
2.665 +
2.666 +
2.667 +
2.668 +
2.669 +
2.670 +
2.671 +
2.672 +
2.673 +
2.674 +
2.675 +
2.676 +
2.677 +
2.678 +
2.679 +
2.680 +
2.681 +
2.682 +
2.683 +
2.684 +
2.685 +
2.686 +
2.687 +
2.688 +
2.689 +
2.690 +
2.691 +
2.692 +
2.693 +
2.694 +
2.695 +
2.696 +
2.697 +
2.698 +
2.699 +
2.700 +
2.701 +
2.702 +
2.703 +
2.704 +
2.705 +
2.706 +
2.707 +
2.708 +
2.709 +
2.710 +
2.711 +
2.712 +
2.713 +
2.714 +
2.715 +
2.716 +
2.717 +
2.718 +
2.719 +
2.720 +
2.721 +
2.722 +
2.723 +
2.724 +
2.725 +
2.726 +
2.727 +
2.728 +
2.729 +
2.730 +
2.731 +
2.732 +
2.733 +
2.734 +
2.735 +
2.736 +
2.737 +
2.738 +
2.739 +
2.740 +
2.741 +
2.742 +
2.743 +
2.744 +
2.745 +
2.746 +
2.747 +
2.748 +
2.749 +
2.750 +
2.751 +
2.752 +
2.753 +
2.754 +
2.755 +
2.756 +
2.757 +
2.758 +
2.759 +
2.760 +
2.761 +
2.762 +
2.763 +
2.764 +
2.765 +
2.766 +
2.767 +
2.768 +
2.769 +
2.770 +
2.771 +
2.772 +
2.773 +
2.774 +
2.775 +
2.776 +
2.777 +
2.778 +
2.779 +
2.780 +
2.781 +
2.782 +
2.783 +