Mercurial > cgi-bin > hgwebdir.cgi > VMS > 0__Writings > kshalle

changeset 48:f184ed659caa
perf tune: add UCC/SCG recording implementation
author: Nina Engelhardt <nengel@mailbox.tu-berlin.de>
date: Wed, 30 May 2012 17:31:19 +0200
parents: 364de5b006db
children: 9a695032203b
files: 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex
diffstat: 2 files changed, 21 insertions(+), 1 deletions(-) [+]
[-]

0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0

0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 22 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 22
0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0
0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 22
     2.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Thu May 24 12:35:57 2012 -0700
     2.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Wed May 30 17:31:19 2012 +0200
     2.3 @@ -640,6 +640,26 @@
     2.4  
     2.5  The last question is how to handle communication consequences. This is tricky because decisions in higher-level runtimes set the context for decisions in lower-level ones. This means a higher-level choice is linked to the consequences from lower-level choices. The value of a consequence graph is linking the size of boxes in it to the decisions made by the scheduler, as represented by the shape. It's not clear how to divide, among the levels, the time that cores spend waiting for non-overlapped communication. We have no good answer at the moment and leave it for future work.
     2.6  
     2.7 +\section{Implementation}
     2.8 +%%how are graphs generated, what is needed
     2.9 +
    2.10 +The visualization relies on data collected from the runtime during execution. There are two kinds of information that need to be recorded: identification of units and constraints, and execution metric measurements.
    2.11 +The first can be obtained from the language runtime at the places where constraints are checked and modified and units are created. The second have to be recorded as the unit progresses through different stages of execution.
    2.12 +
    2.13 +As units are defined by scheduling decisions, the creation of a unit is easiest to register at the point where the unit is assigned to a processing element. This ensures that all units that are executed are recorded, and all units that are recorded are really executed. There is no significant variation in this between languages, and the units are the same for the concrete UCC and the SCG. If the language captures sufficient information to reconstruct an abstract UCC (possibly only for simple types of UCC), this information can also be captured, allowing more general analysis.
    2.14 +Language constructs specify constraints on units. The connections between units can be very complex depending on the language, so the instrumentation needs to be tailored to the constructs.
    2.15 +In SSR, we have several constructs, all of which simultaneously mark boundaries between tasks:
    2.16 +\begin{description}
    2.17 +\item[Create VP] The creation of a new VP creates a simple dependency: the first task in the new VP may only execute after the creating task has finished.
    2.18 +\item[Simple send and receive] Send to and receive from a specific VP is rendez-vous based, so that the units following the communication in both VPs can only execute after the units preceding the rendevouz point in both VPs have finished. This can easily be represented by two crossing dependencies. These are deterministic, so the record is the same for the UCC or the SCG.
    2.19 +\item[Typed send and receive] Typed send/receive is also rendez-vous based, but contrary to simple send/receive, the pairing of sender and receiver is not deterministic. For the SCG, which represents a specific run, the actual communications observed can be recorded in the same way as simple send/receive, but for the UCC, we want to capture all sending and receiving permutations available. In this case, since the construct specifies no further constraints beyond the type of the message, we simply record for each type a group of senders and a group of receivers.
    2.20 +\item[Singleton] For the singleton construct, the units and constraints are actually deterministic, only the assignment of the unit to a VP is decided dynamically.
    2.21 +\end{description}
    2.22 +Additionally, the virtual processor structure means that tasks inside a VP are sequentially dependent, i.e. the things happening in a thread after it gets suspended can only happen after the things that happened before it got suspended.
    2.23 +
    2.24 +
    2.25 +
    2.26 +
    2.27  \section{Conclusion}
    2.28  \label{conclusion}
    2.29  We have shown how to apply a computation model to instrument a language runtime for collecting measurements that connect: to each other, to application structure, to scheduling decisions, and to hardware. A simple visualization of the data has features that indicate lost performance, and features that visually link the lost performance to the cause, no matter if the cause is application structure, language runtime implementation, or hardware feature.  It is this linkage, due to the computation model, that sets this approach apart from others. 
    2.30 @@ -898,4 +918,4 @@
    2.31  %%----------------------------------------------------------------------
    2.32  
    2.33   destroying virtutal processors, and three kinds of send-receive pairs. The 
    2.34 - graph. It depicts all the scheduling operations performed by the runtime, 
    2.35 \ No newline at end of file
    2.36 + graph. It depicts all the scheduling operations performed by the runtime,