changeset 37:a64326ad1b5b

more case study
author Nina Engelhardt <nengel@mailbox.tu-berlin.de>
date Wed, 25 Apr 2012 20:17:38 +0200
parents 5357bdecea5a
children d66564c88e9a
files 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 0__Papers/Holistic_Model/Perf_Tune/latex/figures.tex
diffstat 3 files changed, 110 insertions(+), 51 deletions(-) [+]
line diff
     1.1 Binary file 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf has changed
     2.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Wed Apr 25 06:12:16 2012 -0700
     2.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Wed Apr 25 20:17:38 2012 +0200
     2.3 @@ -26,7 +26,7 @@
     2.4  \documentclass[conference]{IEEEtran}
     2.5  %
     2.6  %\usepackage{makeidx,geometry,amssymb,graphicx,calc,ifthen}
     2.7 -\usepackage{amssymb,graphicx,calc,ifthen,subfigure,dblfloatfix,fixltx2e}
     2.8 +\usepackage{amssymb,graphicx,calc,ifthen,subfig,dblfloatfix,fixltx2e}
     2.9  %
    2.10  
    2.11  % *** CITATION PACKAGES ***
    2.12 @@ -457,7 +457,7 @@
    2.13  
    2.14  The application code includes a function that automatically divides the work into a number of tasks, based on the number of cores and a tuning parameter. It distributes the tasks across the cores in a round-robin fashion, and then waits for completion of the calculation before initiating shutdown. The answers produced by the tasks are collected by a ``receiver" function and accumulated into the result matrix, whereupon it notifies the setup function of completion.
    2.15  
    2.16 -The language used is SSR, which is based on rendez-vous style send and receive operations made between virtual processors. It has comands for creating and destroying virtutal processors, and three kinds of send-receive pairs. The first, \emph{send\_from\_to} specifies a specific sender and specific receiver. We use it to notify the setup/shutdown processor that the result processor has received all sub-results and output the final result. The second, \emph{send\_of\_type\_to}, specifies a specific receiver, but the sender is anonymous, which increases flexibility while maintaining some control over scope. This construct is used by the processors multiplying sub-matrices to send their sub-result to the results processor. The third kind, \emph{send\_of\_type}, only specifies the type, and so acts as a global communication channel; this is not used our application. The language also includes a \emph{singleton} construct that designates a piece of code as to be executed only once, which we use to encapsulate a transposition of the second operand sub-matrix to column-major representation in order to fit access order to the cache lines. Each sub-matrix is shared between several sub-processors, but the transposition only needs to be performed once. Also available are some miscellaneous performance constructs such as the ability to force which core a virtual processor is assigned to.
    2.17 +The language used is SSR, which is based on rendez-vous style send and receive operations made between virtual processors (VPs). It has comands for creating and destroying virtutal processors, and three kinds of send-receive pairs. The first, \emph{send\_from\_to} specifies a specific sender and specific receiver. We use it to notify the VP dedicated to dividing the work (``creator VP'') that the VP collecting results (``results VP'') has received all sub-results and output the final result. The second, \emph{send\_of\_type\_to}, specifies a specific receiver, but the sender is anonymous, which increases flexibility while maintaining some control over scope. This construct is used by the processors multiplying sub-matrices to send their sub-result to the results processor. The third kind, \emph{send\_of\_type}, only specifies the type, and so acts as a global communication channel; this is not used our application. The language also includes a \emph{singleton} construct that designates a piece of code as to be executed only once, which we use to encapsulate a transposition of the second operand sub-matrix to column-major representation in order to fit access order to the cache lines. Each sub-matrix is shared between several sub-processors, but the transposition only needs to be performed once. Also available are some miscellaneous performance constructs such as the ability to force which core a virtual processor is assigned to.
    2.18  
    2.19  A note on terminology: the word ``task''  has acquired multiple  meanings in the literature, making it a vague term. We often use, instead, the term  ``work-unit'', which is defined precisely as the trace-segment performed on a core, between two successive scheduling events, plus the set of datums consumed by that trace segment. The word task often maps well onto this precise definition, and we use both words, but mean work-unit when we say task.
    2.20  
    2.21 @@ -475,13 +475,22 @@
    2.22  
    2.23  Many different orderings could also have been validly chosen. Which scheduler choices are valid is determined by three kinds of constraints: the application code constraints, hardware constraints, and runtime implementation imposed constraints. 
    2.24  
    2.25 -Returning to Fig X, the lines in red, orange, and green represent application-code constraints that each tie two work-units together. The color represents the  kind of constraint imposed by one kind of  parallelism construct. Red is creation of a virtual processor, green is many-to-one send\_of\_type\_to, and orange is the singleton construct. For better visibility, only constraints that cross cores are shown.
    2.26 +Returning to Fig \ref{story}, the lines in red, orange, and green represent application-code constraints that each tie two work-units together. The color represents the  kind of constraint imposed by one kind of  parallelism construct. Red is creation of a virtual processor, green is many-to-one send\_of\_type\_to, and orange is the singleton construct. For better visibility, only constraints that cross cores are shown.
    2.27  
    2.28  The columns represent each core, and the vertical axis represents the progress of the program execution, starting from the top. This gives at a glance the total runtime (height), level of parallelism during the run (number of filled columns at a specific height), and processor utilization (blue-covered area vs. white). All figures are at the same scale, so they can be compared directly.
    2.29  
    2.30  The first thing to notice, then, is that of the 40 available cores, only 13 were being used. As the application places work on cores explicitly, this must be a bug in the dividing code. A cursory inspection revealed that a closing bracket in the distribution loop had been misplaced. This may be a very simple bug, but it went unnoticed despite using this application as test program for development of the language runtime, including analyzing performance, for several months.
    2.31  
    2.32 -The second run (Fig b) already corresponds much more to the expected execution behaviour. However, there remains a noticeable section at the beginning where only 3 cores have work and the other 37 remain idle. Focusing on core 0, we can see that the task creation VP (short tasks with red edges outgoing) creates work in order of cores, starting with core 0. The core's scheduler operates a simple round-robin between VPs assigned to its core, so the creator VP gets switched out for the newly created work unit quickly. 
    2.33 +The second run (Fig \ref{story:b}) already corresponds much more to the expected execution behaviour. However, there remains a noticeable section at the beginning where only 3 cores have work and the other 37 remain idle. Focusing on core 0, we can see that the task creation VP (short tasks with red edges outgoing) creates work in order of cores, starting with core 0. The core's scheduler operates a simple round-robin between VPs assigned to its core, so the creator VP gets switched out for the newly created work unit quickly. The work tasks take a large amount of time to complete, during which task creation is suspended.
    2.34 +
    2.35 +Two solutions came to mind: distribute work to all other cores first so that they would be busy when the creator VP gets interrupted, or dedicate a core to the creator VP. The first solution has the advantage of preserving performance of the application even when run on a machine with a single-digit number of cores, so we tried it first. This gave us Fig \ref{story:c}. 
    2.36 +
    2.37 +The section with many cores idling at the beginning has disappeared. A small idle period can still be observed between the first and the second set of work tasks, because the work tasks have roughly the same length (some of them are slightly longer because they perform a copy-transpose singleton, and small variations can be caused by cache misses etc.), so the work on core 0 holding up the creator VP, being last to be distributed, is also last to finish. It is also noticeable that in the second set of work units to be distributed, not enough work pieces remain to fill all cores, so that 16 out of 40 remain idle.
    2.38 +
    2.39 +To try to fill that space, we tried to modify the size of the work units. However, as figures \ref{story:d} and \ref{story:e} show, this did not help, as almost immediately the time spent creating the increased number of units becomes the bottleneck, and the time lost between sets grows larger than the time lost on the cores not receiving any work.
    2.40 +
    2.41 +At this point we wanted to try out if taking the road not chosen, dedicating a core to the create VP, would improve performance more.
    2.42 +Going back to version b of the code and implementing this solution instead lead to fig. \ref{story:f}. The delay between the two sets has disappeared, leading to a 4\% shorter execution time.
    2.43  
    2.44  
    2.45  %\begin{figure}[ht]
    2.46 @@ -500,44 +509,94 @@
    2.47  % \end{minipage}
    2.48  %\end{figure}
    2.49  
    2.50 -\begin{figure*}
    2.51 -\centering
    2.52 -\mbox
    2.53 - { \subfigure[description of left graph]
    2.54 -    {\includegraphics[width=0.2in]{../figures/192.pdf} 
    2.55 -    }\quad
    2.56 -   \subfigure[description of right graph]
    2.57 -    {\includegraphics[width=1in]{../figures/194.pdf} 
    2.58 -    }
    2.59 -   \subfigure[description of right graph]
    2.60 -    {\includegraphics[width=1in]{../figures/196.pdf} 
    2.61 -    }
    2.62 -   \subfigure[description of right graph]
    2.63 -    {\includegraphics[width=1in]{../figures/199.pdf} 
    2.64 -    }
    2.65 -   \subfigure[description of right graph]
    2.66 -    {\includegraphics[width=1in]{../figures/201.pdf} 
    2.67 -    }
    2.68 -   \subfigure[description of right graph]
    2.69 -    {\includegraphics[width=1in]{../figures/204.pdf} 
    2.70 -    }
    2.71 -   \subfigure[description of right graph]
    2.72 -    {\includegraphics[width=1in]{../figures/208.pdf} 
    2.73 -    }
    2.74 - }
    2.75 -\mbox
    2.76 - {  \subfigure[description of left graph]
    2.77 -    {\includegraphics[width=1in]{../figures/209.pdf} 
    2.78 -    }\quad
    2.79 -   \subfigure[description of right graph]
    2.80 -    {\includegraphics[width=1in]{../figures/210.pdf} 
    2.81 -    }
    2.82 -   \subfigure[description of right graph]
    2.83 -    {\includegraphics[width=1in]{../figures/212.pdf} 
    2.84 -    }
    2.85 - }
    2.86 -\caption{Text pertaining to all graphs ...} \label{fig12}
    2.87 -\end{figure*}
    2.88 +
    2.89 +\begin{figure}[b]
    2.90 +  \begin{minipage}[b]{0.2\textwidth}
    2.91 + 	\subfloat[Original]
    2.92 +    {\quad\quad \includegraphics[scale=0.015]{../figures/192.pdf} \quad
    2.93 +    \label{story:a}}\quad
    2.94 +  \end{minipage}
    2.95 +  \begin{minipage}[b]{0.7\textwidth}
    2.96 +  % \subfloat[]
    2.97 +  %  {\includegraphics[scale=0.015]{../figures/194.pdf} 
    2.98 +  %  }\quad
    2.99 +   \subfloat[Fix load balancer]
   2.100 +    {\includegraphics[scale=0.015]{../figures/196.pdf} 
   2.101 +    \label{story:b}}\quad
   2.102 +   \subfloat[Solution~1 (portable)]
   2.103 +    {\includegraphics[scale=0.015]{../figures/199.pdf} 
   2.104 +    \label{story:c}}\quad
   2.105 +  % \subfloat[S1+divide factor 0.2]
   2.106 +  %  {\includegraphics[scale=0.015]{../figures/208.pdf} 
   2.107 +  %  }\quad
   2.108 +  % \subfloat[S1+divide factor 0.3]
   2.109 +  %  {\includegraphics[scale=0.015]{../figures/209.pdf} 
   2.110 +  %  }\quad
   2.111 +   \subfloat[S1+divide factor 0.5]
   2.112 +    {\includegraphics[scale=0.015]{../figures/212.pdf} 
   2.113 +    \label{story:d}}\quad
   2.114 +   \subfloat[S1+divide factor 0.4]
   2.115 +    {\includegraphics[scale=0.015]{../figures/210.pdf} 
   2.116 +    \label{story:e}}\quad\\
   2.117 +   \subfloat[Solution~2]
   2.118 +    {\includegraphics[scale=0.015]{../figures/201.pdf} 
   2.119 +    \label{story:f}}\quad
   2.120 +   \subfloat[S2+move receive pr]
   2.121 +    {\includegraphics[scale=0.015]{../figures/204.pdf} 
   2.122 +    \label{story:g}}\quad
   2.123 +   \subfloat[S2+divide factor 0.4]
   2.124 +    {\includegraphics[scale=0.015]{../figures/216.pdf} 
   2.125 +    \label{story:h}}\quad
   2.126 +  % \subfloat[S2+divide factor 0.3]
   2.127 +  %  {\includegraphics[scale=0.015]{../figures/217.pdf} 
   2.128 +  %  }\hspace{2in}
   2.129 +  % \subfloat[scheduling issue with singleton]
   2.130 +  %  {\includegraphics[scale=0.015]{../figures/194.pdf} 
   2.131 +  %  }\quad
   2.132 +  \end{minipage}
   2.133 +\caption{Performance tuning with Scheduling Consequence Graphs}
   2.134 +\label{story}
   2.135 +\end{figure}
   2.136 +
   2.137 +
   2.138 +%\begin{figure*}
   2.139 +%\centering
   2.140 +%\mbox
   2.141 +% { \subfigure[description of left graph]
   2.142 +%    {\includegraphics[width=0.2in]{../figures/192.pdf} 
   2.143 +%    }\quad
   2.144 +%   \subfigure[description of right graph]
   2.145 +%    {\includegraphics[width=1in]{../figures/194.pdf} 
   2.146 +%    }
   2.147 +%   \subfigure[description of right graph]
   2.148 +%    {\includegraphics[width=1in]{../figures/196.pdf} 
   2.149 +%    }
   2.150 +%   \subfigure[description of right graph]
   2.151 +%    {\includegraphics[width=1in]{../figures/199.pdf} 
   2.152 +%    }
   2.153 +%   \subfigure[description of right graph]
   2.154 +%    {\includegraphics[width=1in]{../figures/201.pdf} 
   2.155 +%    }
   2.156 +%   \subfigure[description of right graph]
   2.157 +%    {\includegraphics[width=1in]{../figures/204.pdf} 
   2.158 +%    }
   2.159 +%   \subfigure[description of right graph]
   2.160 +%    {\includegraphics[width=1in]{../figures/208.pdf} 
   2.161 +%    }
   2.162 +% }
   2.163 +%\mbox
   2.164 +% {  \subfigure[description of left graph]
   2.165 +%    {\includegraphics[width=1in]{../figures/209.pdf} 
   2.166 +%    }\quad
   2.167 +%   \subfigure[description of right graph]
   2.168 +%    {\includegraphics[width=1in]{../figures/210.pdf} 
   2.169 +%    }
   2.170 +%   \subfigure[description of right graph]
   2.171 +%    {\includegraphics[width=1in]{../figures/212.pdf} 
   2.172 +%    }
   2.173 +% }
   2.174 +%\caption{Text pertaining to all graphs ...} \label{fig12}
   2.175 +%\end{figure*}
   2.176  
   2.177  
   2.178  \section{Setup}
     3.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/figures.tex	Wed Apr 25 06:12:16 2012 -0700
     3.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/figures.tex	Wed Apr 25 20:17:38 2012 +0200
     3.3 @@ -44,15 +44,15 @@
     3.4    % \subfloat[S1+divide factor 0.2]
     3.5    %  {\includegraphics[scale=0.015]{../figures/208.pdf} 
     3.6    %  }\quad
     3.7 -   \subfloat[S1+divide factor 0.3]
     3.8 -    {\includegraphics[scale=0.015]{../figures/209.pdf} 
     3.9 -    }\quad
    3.10 +  % \subfloat[S1+divide factor 0.3]
    3.11 +  %  {\includegraphics[scale=0.015]{../figures/209.pdf} 
    3.12 +  %  }\quad
    3.13 +   \subfloat[S1+divide factor 0.5]
    3.14 +    {\includegraphics[scale=0.015]{../figures/212.pdf} 
    3.15 +    }\quad\\
    3.16     \subfloat[S1+divide factor 0.4]
    3.17      {\includegraphics[scale=0.015]{../figures/210.pdf} 
    3.18      }\quad
    3.19 -   \subfloat[S1+divide factor 0.5]
    3.20 -    {\includegraphics[scale=0.015]{../figures/212.pdf} 
    3.21 -    }\quad\\
    3.22     \subfloat[Solution~2]
    3.23      {\includegraphics[scale=0.015]{../figures/201.pdf} 
    3.24      }\quad
    3.25 @@ -62,9 +62,9 @@
    3.26     \subfloat[S2+divide factor 0.4]
    3.27      {\includegraphics[scale=0.015]{../figures/216.pdf} 
    3.28      }\quad
    3.29 -   \subfloat[S2+divide factor 0.3]
    3.30 -    {\includegraphics[scale=0.015]{../figures/217.pdf} 
    3.31 -    }\hspace{2in}
    3.32 +  % \subfloat[S2+divide factor 0.3]
    3.33 +  %  {\includegraphics[scale=0.015]{../figures/217.pdf} 
    3.34 +  %  }\hspace{2in}
    3.35    % \subfloat[scheduling issue with singleton]
    3.36    %  {\includegraphics[scale=0.015]{../figures/194.pdf} 
    3.37    %  }\quad