Mercurial > cgi-bin > hgwebdir.cgi > VMS > 0__Writings > kshalle

changeset 35:663b333e36c0
perf tuning - add to case study
author: Nina Engelhardt <nengel@mailbox.tu-berlin.de>
date: Tue, 24 Apr 2012 12:38:21 +0200
parents: dfd7c969dbae
children: 5357bdecea5a
files: 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex
diffstat: 2 files changed, 15 insertions(+), 6 deletions(-) [+]
[-]

0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0

0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 21 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 21
0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0
0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 21
     2.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Mon Apr 23 10:49:03 2012 -0700
     2.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Tue Apr 24 12:38:21 2012 +0200
     2.3 @@ -450,17 +450,19 @@
     2.4  
     2.5  We wish to use a standard program that the reader has likely already experienced attempting to performance tune, or else wondered why its performance wasn't better. The best example is likely matrix multiply, with which the reader should be familiar, allowing concentration on the tool without distraction about the application. We run it on a machine with 4 sockets by 10 cores each, for a total of 40 physical cores.
     2.6  
     2.7 -The application code includes a function that automatically divides the work into a number of tasks, based on the number of cores and a tuning parameter. It distributes the tasks across the cores in a round-robin fashion. The answers produced by the tasks are collected by a ``receiver" task and accumulated into a result matrix.
     2.8 +The application code includes a function that automatically divides the work into a number of tasks, based on the number of cores and a tuning parameter. It distributes the tasks across the cores in a round-robin fashion, and then waits for completion of the calculation before initiating shutdown. The answers produced by the tasks are collected by a ``receiver" function and accumulated into the result matrix, whereupon it notifies the setup function of completion.
     2.9  
    2.10 -
    2.11 -The language used is SSR, which is based on rendez-vous style send and receive operations made between virtual processors. It has commands for  creating and destroying virtual processors,  and three kinds of send-receive pairs. The first, send\_from\_to specifies a specific sender and specific receiver. The second, send\_of\_type\_to, specifies a specific receiver, but the sender is anonymous, which increases flexibility while maintaining some control over scope. The third kind, send\_of\_type, only specifies the type, and so acts as a global communication channel. The language also includes performance constructs such as the ability to force which core a virtual processor is assigned to.
    2.12 +The language used is SSR, which is based on rendez-vous style send and receive operations made between virtual processors. It has comands for creating and destroying virtutal processors, and three kinds of send-receive pairs. The first, \emph{send\_from\_to} specifies a specific sender and specific receiver. We use it to notify the setup/shutdown processor that the result processor has received all sub-results and output the final result. The second, \emph{send\_of\_type\_to}, specifies a specific receiver, but the sender is anonymous, which increases flexibility while maintaining some control over scope. This construct is used by the processors multiplying sub-matrices to send their sub-result to the results processor. The third kind, \emph{send\_of\_type}, only specifies the type, and so acts as a global communication channel; this is not used our application. The language also includes a \emph{singleton} construct that designates a piece of code as to be executed only once, which we use to encapsulate a transposition of the second operand sub-matrix to column-major representation in order to fit access order to the cache lines. Each sub-matrix is shared between several sub-processors, but the transposition only needs to be performed once. Also available are some miscellaneous performance constructs such as the ability to force which core a virtual processor is assigned to.
    2.13  
    2.14  A note on terminology: the word ``task''  has acquired multiple  meanings in the literature, making it a vague term. We often use, instead, the term  ``work-unit'', which is defined precisely as the trace-segment performed on a core, between two successive scheduling events, plus the set of datums consumed by that trace segment. The word task often maps well onto this precise definition, and we use both words, but mean work-unit when we say task.
    2.15  
    2.16  
    2.17 -After functional debugging, the first run produces the visualization seen in Fig X. This is what we refer to as a consequence graph. It depicts all the scheduling operations performed by the runtime, along with the consequent usage of the cores. 
    2.18 +After functional debugging, the first run produces the visualization seen in Fig X. This is what we refer to as a scheduling consequence graph. It depicts all the scheduling operations performed by the runtime, along with the consequent usage of the cores. 
    2.19  
    2.20 -A blue vertical block represents the time the core spends doing the actual work of one work-unit. Just above each is the runtime overhead spent on that work-unit, broken into pieces representing acquisition of the lock on the shared semantic state, time spent performing the semantics of the parallelism construct, time spent deciding which ready task to execute next, and time spent switching from virtual processor, to the runtime, and back. 
    2.21 +Each column is associated with one core. 
    2.22 +A blue vertical block represents the time the core spends doing the actual work of one work-unit, the height being proportional to the number of cycles taken for execution. Just above each, in light blue, is the runtime overhead spent on that work-unit, which can be broken into pieces representing acquisition of the lock on the shared semantic state, time spent performing the semantics of the parallelism construct, time spent deciding which ready task to execute next, and time spent switching from virtual processor, to the runtime, and back. 
    2.23 +
    2.24 +
    2.25  
    2.26  A second visualization, seen in Figure X, depicts the constraints on the scheduling decisions the runtime is allowed to make. 
    2.27  
    2.28 @@ -468,7 +470,14 @@
    2.29  
    2.30  Many different orderings could also have been validly chosen. Which scheduler choices are valid is determined by three kinds of constraints: the application code constraints, hardware constraints, and runtime implementation imposed constraints. 
    2.31  
    2.32 -Returning to  Fig X, the lines in red, cyan, and green represent application-code constraints that each tie two work-units together. The color represents the  kind of constraint imposed by one kind of  parallelism construct. Red is point-to-point send\_from\_to, while green is many-to-one send\_of\_type\_to, and cyan is the singleton construct.\  
    2.33 +Returning to Fig X, the lines in red, orange, and green represent application-code constraints that each tie two work-units together. The color represents the  kind of constraint imposed by one kind of  parallelism construct. Red is creation of a virtual processor, green is many-to-one send\_of\_type\_to, and orange is the singleton construct. For better visibility, only constraints that cross cores are shown.
    2.34 +
    2.35 +The columns represent each core, and the vertical axis represents the progress of the program execution, starting from the top. This gives at a glance the total runtime (height), level of parallelism during the run (number of filled columns at a specific height), and processor utilization (blue-covered area vs. white). All figures are at the same scale, so they can be compared directly.
    2.36 +
    2.37 +The first thing to notice, then, is that of the 40 available cores, only 13 were being used. As the application places work on cores explicitly, this must be a bug in the dividing code. A cursory inspection revealed that a closing bracket in the distribution loop had been misplaced. This may be a very simple bug, but it went unnoticed despite using this application as test program for development of the language runtime, including analyzing performance, for several months.
    2.38 +
    2.39 +The second run (Fig b) already corresponds much more to the expected execution behaviour. However, there remains a noticeable section at the beginning where only 3 cores have work and the other 37 remain idle. Focusing on core 0, we can see that the task creation VP (short tasks with red edges outgoing) creates work in order of cores, starting with core 0. The core's scheduler operates a simple round-robin between VPs assigned to its core, so the creator VP gets switched out for the newly created work unit quickly. 
    2.40 +
    2.41  
    2.42  %\begin{figure}[ht]
    2.43  % \begin{minipage}[b]{0.5\linewidth}