### changeset 77:1dd96de6e570

perf tuning updates to levels of SCG
author Sean Halle Fri, 10 Aug 2012 02:21:44 -0700 0c973449ccdd 328f337153e3 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 1 files changed, 10 insertions(+), 8 deletions(-) [+]
line diff
     1.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Thu Aug 09 17:04:10 2012 -0700
1.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Fri Aug 10 02:21:44 2012 -0700
1.3 @@ -422,17 +422,19 @@
1.4
1.5  Hence, the question becomes,  when can the upper-level work unit be considered completed? The answer is, when all the lower-level units of which it is comprised are completed, plus any additional work spent dividing up the work and combining results. Once the lower-level consequence graph is established, this  time can be easily determined: it is the critical path across the lower-level consequence graph plus the additional, non-overlapped, time spent creating the lower-level units and combining results into the format of the higher-level unit.
1.6
1.7 -Consider the concrete example of the SSR matrix multiply application, from the story in section \ref{sec:casestudy}. Going up,  the level above the execution under study involves the invocations of entire applications, via OS commands. At that level, a unit is an entire process, and the  work-time of that unit is the execution time of the application. The SSR matrix multiply execution time includes the critical path through the matrix multiply, plus creation of the various VPs, and collection of the results by the results VP. If the story in section \ref{sec:casestudy} were done in hierarchical fashion, the creation
1.8 +Consider the concrete example of the SSR matrix multiply application, from the story in section \ref{sec:casestudy}. Going up,  the level above the execution under study involves the invocations of entire applications, via OS commands. At that level, a unit is an entire process, and the  work-time of that unit is the execution time of the application. The SSR matrix multiply execution time includes the critical path through the matrix multiply work, plus creation of the various VPs, and collection of the results by the results VP. If the story in section \ref{sec:casestudy} were done in hierarchical fashion, the SCG seen there would be divided between levels, with some pieces moved to the application-level SCG.  In particular, the creation time and results accumulation time would be moved to the application-level SCG and represent overhead added to the lower-level SCG. The lower-level would only contain the work of the multiply.
1.9
1.10 -One thing that is specific to the consequence graph and does not appear in the UCC is overhead. How is overhead dealt with across levels? Looking again at our SSR matrix-multiply, it is visible that the overhead recorded in the SSR-level consequence graph, when it contributes to the critical path, is counted as part of the work-time of the application unit. Only the overhead of the runtime level under investigation is distinguished in the SCG.
1.11 +One thing that is specific to the consequence graph and does not appear in the UCC is overhead. How is overhead dealt with across levels? We must determine what activities in higher levels count as overhead, and what portion of that is overlapped with work in the lower level SCG, and hence does not contribute to the critical path.
1.12
1.13 -That this is congruent with how overhead is often intuitively treated becomes apparent when we consider the next lower level. Each core, indeed, is a parallel processing unit composed of several sub-units, namely functional units such as arithmetic or floating-point units. What is shown as one SSR work unit in section \ref{sec:casestudy} is further broken down into smaller work units: individual instructions. Here too, the execution time of a higher-level work unit is the critical path from the start of the first instruction in the unit to the end of the last instruction.
1.14 -The issue logic in a modern out-of-order processor is in fact a relatively complicated scheduler that analyses the register use dependencies between instructions (=constraints between work units) and dispatches (=assigns) them to different functional units. Overhead, at this level, would be the issue logic, and it contributes to the critical path whenever dispatch is the bottleneck. Yet nobody would think to measure differently the execution time of a calculation depending on whether the limiting factor was the number of arithmetic units available or the size of the instruction buffer.
1.15 +To complete the discussion, we consider going down yet another level, into the physical processor.  In modern super-scalar multi-cores, each core, is a parallel processor composed of functional blocks such as arithmetic or floating-point blocks. At this level, we break down a single  SSR work unit, as seen in section \ref{sec:casestudy}, into into smaller work units: individual instructions.
1.16
1.17 -Let us now consider how this mapping between higher and lower level appears when viewed from the lower level, looking up' so to speak.
1.18 -From inside the core, what is considered overhead at the SSR level are in fact just more instructions. They appear just like a different unit would.
1.19 -One level up, the same holds true:
1.20 -The overhead of the unique application-unit would be the time spent in the OS, doing things such as setting up the application's memory space. This time is not visible from inside the application. However, it uses the same processing units (in this case, cores), and to them appears not very different from application work. The noticeable difference, in this case, is that the OS runtime is written using a different programming model instead of SSR.
1.21 +Now, what does the SCG look like for the instructions inside an SSR level work-unit? For the SSR unit, the work-time  is the critical path  from the jump out of the runtime into the first instruction of the unit, up until the jump back from the unit's code, into the runtime. This critical path is set by the issue logic in the core, which is in fact a relatively complicated scheduler that generates the SCG. It analyses the register use dependencies between instructions (=constraints between work units) and dispatches (=assigns) them to different functional blocks. The overhead for an instruction is the pipestages spent in fetch, decode, rename, and the issue logic. Most of this is overlapped by the pipeline effect. but  contributes to the critical path during pipeline disruptions like mis-predicted branches.
1.22 +
1.23 +%To give insight into going from the SSR work-unit level up to the application-level, let us now go from the instruction level up to the SSR unit level, looking up' so to speak. Considering overhead, from inside the core,  the SSR level overhead looks just like more instructions.
1.24 +
1.25 +%Nina: we should talk this part over.. (I should have left more of your wording in place in comments..  I like this way of co-editing.. : )
1.26 +
1.27 +%One level up, the same holds true: The overhead of the unique application-unit would be the time spent in the OS, doing things such as setting up the application's memory space. This time is not visible from inside the application. However, it uses the same processing units (in this case, cores), and to them appears not very different from application work. The noticeable difference, in this case, is that the OS runtime is written using a different programming model instead of SSR.
1.28
1.29
1.30  %Consider, in the matrix multiply code, the core usage spent  while  dividing the work and handing it to other cores. This is not work of the application, but overhead spent breaking the single application-unit into multiple sub-units.   Even though it is in the application code, it's purpose is implementing the execution model, which makes it runtime overhead. But which runtime level? It's not part of the SSR language runtime, so not overhead of a unit inside the application,   but rather it's for the application itself, as a unit! So, the core time spent calculating the division  gets counted towards the application-level unit, while the time spent inside the SSR runtime creating the meta-units is counted towards those lower SSR-level units. But both are in the critical path, so both charged as work time of the higher-level unit.