VMS/0__Writings/kshalle

changeset 76:0c973449ccdd

perf-tune Finished modifying UCC concreteness subsection
author Sean Halle <seanhalle@yahoo.com>
date Thu, 09 Aug 2012 17:04:10 -0700
parents f0347dbf1702
children 1dd96de6e570
files 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex
diffstat 1 files changed, 5 insertions(+), 5 deletions(-) [+]
line diff
     1.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Thu Aug 09 16:46:09 2012 -0700
     1.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Thu Aug 09 17:04:10 2012 -0700
     1.3 @@ -343,15 +343,15 @@
     1.4  Figure \ref{fig:UCC_Concreteness} shows the two axes and the four sets of additional information required to determine the units and constraints in the final concrete UCC. The position an application-derived UCC lands on the grid indicates how far it is from being fully concrete.  The horizontal   indicates what inputs are still needed to determine the units, and vertical the constraints.  0 indicates that the units (constraints) are fully determined by the application code alone; 1 means parameter values also must be known; 2 means input data values also play a role, and 3 means the units (constraints) can only become known  after runtime scheduling decisions have been made. 
     1.5  
     1.6  The closer the application-derived UCC is to the origin, the less additional information is needed to obtain a  concrete UCC descendant of it. The UCC labeled A in the figure is fully concrete just from the source code alone (representing for example, matrix multiply with fixed size matrices and fixed division). The UCC labeled B requires the input data and parameters to be specified before its units are concrete, but just parameters to make its constraints fully concrete (as per ray-tracing, with bounce depth specified as a parameter). The UCC labeled C only has variability in its constraints, which require input data (for example, H.264 motion vectors).
     1.7 -But even the least concrete UCC, out at the end of the diagonal (D in the figure), becomes concrete during a run of the application.
     1.8 +But even the least concrete UCC, out at the end of the diagonal (D in the figure), generates a concrete descendant during a run of the application.
     1.9   
    1.10 -Notice, though, that even a fully concrete UCC still has degrees of freedom, in which units to run on which hardware and in the order of execution. These decisions fix interactions within the hardware, to yield the communication patterns and consequent performance seen during the run. 
    1.11 +Notice, though, that even a fully concrete UCC still has degrees of freedom, in deciding which units to run on which hardware and in what order of execution. These decisions fix interactions within the hardware, to yield the communication patterns and consequent performance seen during the run. 
    1.12  
    1.13 -An added twist is that an application has a life-line, spanning from code all the way through the run, and its representation  may change at the different stages of life. It starts as pristine source, then moves  into specialization where code is translated into different representations than the original, and finally the specialized code is run. The UCC often changes between these points in  the life-line.
    1.14 +An added twist is that an application has a life-line, spanning from code all the way through the run, and its representation  may change at the different stages of life. It starts as pristine source, then moves  into specialization where code is translated into different representations than the original, and finally the specialized code is run. A different version of UCC may be generated at each of these points in  the life-line.
    1.15  
    1.16  For example, specialization may perform a static scheduling, which fixes the units, moving the UCC towards the origin. Alternatively, the toolchain may inject manipulator code for the runtime to use, which lets it divide units during the run when it needs more. The injection of manipulator code makes the UCC less concrete, moving it further from the origin.
    1.17  
    1.18 -The UCC still indicates what is inside the application's control vs under the runtime's control, even for applications that land far out on the diagonal. It thus indicates what can be done statically: the further out on the diagonal a UCC is, the less scheduling can be done statically in the toolchain.
    1.19 +The progression of UCCs generated during the life-line indicate what entity has control over the units and constraints that appear in the final concrete UCC. Viewing the progression gives insight into what is inside the application programmer's control vs under control of each tool in the toolchain or  the runtime. For example, the original application-derived UCC  indicates what can be done statically: the further out on the diagonal that UCC is, the less scheduling can be done statically in the toolchain.
    1.20  
    1.21  In this paper, we do not suggest how to represent UCCs far out on the diagonal. One of those actually indicates a multi-verse of concrete-UCCs. Which of them materializes  depends on the data and what the scheduler decides. We only represent the concrete UCC that materializes during a run and leave the question of representing less concrete ones to future work.
    1.22    
    1.23 @@ -420,7 +420,7 @@
    1.24  
    1.25  For a given level, a lower-level runtime is seen as a single processing unit. So, what is the consequence of deciding to schedule a given work unit onto one of those  processing units? When the goal is  improving performance, the consequence of interest is the time taken until completion of the work unit.
    1.26  
    1.27 -Hence, the question becomes,  when can the upper-level work unit be considered completed? The answer is, when all the lower-level units of which it is comprised are completed, plus any additional work spent dividing up the work and combining results. Once the lower-level consequence graph is established, this  time can be easily determined: it is the critical path across the lower-level consequence graph plus the additional time spent creating the lower-level units and combining results into the format of the higher-level unit.
    1.28 +Hence, the question becomes,  when can the upper-level work unit be considered completed? The answer is, when all the lower-level units of which it is comprised are completed, plus any additional work spent dividing up the work and combining results. Once the lower-level consequence graph is established, this  time can be easily determined: it is the critical path across the lower-level consequence graph plus the additional, non-overlapped, time spent creating the lower-level units and combining results into the format of the higher-level unit.
    1.29  
    1.30  Consider the concrete example of the SSR matrix multiply application, from the story in section \ref{sec:casestudy}. Going up,  the level above the execution under study involves the invocations of entire applications, via OS commands. At that level, a unit is an entire process, and the  work-time of that unit is the execution time of the application. The SSR matrix multiply execution time includes the critical path through the matrix multiply, plus creation of the various VPs, and collection of the results by the results VP. If the story in section \ref{sec:casestudy} were done in hierarchical fashion, the creation
    1.31