changeset 38:d66564c88e9a

Perf tune: add description of singleton problem
author Nina Engelhardt <nengel@mailbox.tu-berlin.de>
date Fri, 27 Apr 2012 18:47:48 +0200
parents a64326ad1b5b
children 95c7bc4d8cc9
files 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex
diffstat 2 files changed, 6 insertions(+), 1 deletions(-) [+]
line diff
     1.1 Binary file 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf has changed
     2.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Wed Apr 25 20:17:38 2012 +0200
     2.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Fri Apr 27 18:47:48 2012 +0200
     2.3 @@ -479,7 +479,7 @@
     2.4  
     2.5  The columns represent each core, and the vertical axis represents the progress of the program execution, starting from the top. This gives at a glance the total runtime (height), level of parallelism during the run (number of filled columns at a specific height), and processor utilization (blue-covered area vs. white). All figures are at the same scale, so they can be compared directly.
     2.6  
     2.7 -The first thing to notice, then, is that of the 40 available cores, only 13 were being used. As the application places work on cores explicitly, this must be a bug in the dividing code. A cursory inspection revealed that a closing bracket in the distribution loop had been misplaced. This may be a very simple bug, but it went unnoticed despite using this application as test program for development of the language runtime, including analyzing performance, for several months.
     2.8 +The first thing to notice, then, is that the first picture is slimmer than expected: of the 40 available cores, only 13 were being used. As the application places work on cores explicitly, this must be a bug in the dividing code. A cursory inspection revealed that a closing bracket in the distribution loop had been misplaced. This may be a very simple bug, but it went unnoticed despite using this application as test program for development of the language runtime, including analyzing performance, for several months.
     2.9  
    2.10  The second run (Fig \ref{story:b}) already corresponds much more to the expected execution behaviour. However, there remains a noticeable section at the beginning where only 3 cores have work and the other 37 remain idle. Focusing on core 0, we can see that the task creation VP (short tasks with red edges outgoing) creates work in order of cores, starting with core 0. The core's scheduler operates a simple round-robin between VPs assigned to its core, so the creator VP gets switched out for the newly created work unit quickly. The work tasks take a large amount of time to complete, during which task creation is suspended.
    2.11  
    2.12 @@ -492,6 +492,11 @@
    2.13  At this point we wanted to try out if taking the road not chosen, dedicating a core to the create VP, would improve performance more.
    2.14  Going back to version b of the code and implementing this solution instead lead to fig. \ref{story:f}. The delay between the two sets has disappeared, leading to a 4\% shorter execution time.
    2.15  
    2.16 +As core 0 is now empty after the creation phase at the beginning, we also moved the receive VP there (fig. \ref{story:g}). This added only a minimal improvement at this size of work unit, but allows overlapping the result collection with other work, which is an advantage when cutting up the work into more pieces, requiring longer collection (fig. \ref{story:h}).
    2.17 +
    2.18 +Overall it is also noticeable that as work units become smaller, execution becomes more irregular. Variability in task length is likely due to cache misses or page faults, but for verification of this hypothesis more data would need to be collected (for instance, the "cycles stalled due to cache misses" counter available on most modern Intel chips could be tracked and displayed for each unit). 
    2.19 +
    2.20 +In some places ``holes'' are noticeable. Inspecting these holes closer, we can see that the stalling tasks are waiting upon the completion of a singleton. However, the operations enclosed in the singleton take only a short time, and start much later than the idle periods. Once again, the simple round-robin scheduler is at fault: When the first VP reaches the singleton code portion, it sends a request to acquire the singleton. This request succeeds, but as sending a request results in suspension of the requesting VP, a new task is scheduled. If this happens to be a long task, the singleton is suspended for a long time, but all other VPs with this singleton have to wait, because it has already been reserved. Because several VPs assigned to the same core share the same matrix pieces so as to increase cache locality, this can result in all VPs on a core being stalled, leading to the observed idle times. This is a property of the language runtime, so the application programmer cannot change this, but making the work units smaller helps minimize these effects. 
    2.21  
    2.22  %\begin{figure}[ht]
    2.23  % \begin{minipage}[b]{0.5\linewidth}