VMS/0__Writings/kshalle

changeset 79:4433a26ff153

perf tune: abstract + add execution time numbers to graphs
author Nina Engelhardt <nengel@mailbox.tu-berlin.de>
date Fri, 10 Aug 2012 18:44:02 +0200
parents 328f337153e3
children 2bf63d88116a
files 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex
diffstat 2 files changed, 14 insertions(+), 14 deletions(-) [+]
line diff
     1.1 Binary file 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf has changed
     2.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Fri Aug 10 05:32:19 2012 -0700
     2.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Fri Aug 10 18:44:02 2012 +0200
     2.3 @@ -68,10 +68,10 @@
     2.4  
     2.5  
     2.6  \begin{abstract}
     2.7 -Performance tuning is an important aspect of parallel programming.  Yet when trying to pinpoint the causes of performance loss,  internal structure of the application and the runtime is often needed in order to understand how the observed patterns of performance have come to pass.
     2.8 -The trend in parallel programming languages has been towards models that capture more structural information about the application, in an effort to increase both performance and ease of programming. This structural information can be used to improve performance tuning tools by making the causes of performance loss more readily apparent.
     2.9 -We propose using a new computation model  to collect application structure and produce performance visualizations. The visualizations clearly identify idle cores, and tie the idleness to causal interactions within the runtime and hardware, and from there to the parallelism constructs that constrained the runtime and hardware behavior, thereby eliminating guesswork.
    2.10 -The approach is used to instrument the runtime of any language without  application modifications. This is implemented for multi-core hardware, and we walk through a tuning session on a large multi-core machine to illustrate how performance loss is identified and how hypotheses for the cause are generated. We also give a concise description of the implementation and the computation model. 
    2.11 +Performance tuning is an important aspect of parallel programming. Yet when trying to pinpoint the causes of performance loss, often insufficient knowledge of the internal structure of the application and the runtime is available to understand how the observed patterns of performance have come to pass.
    2.12 +A trend in parallel programming languages is towards models that capture more structural information about the application, in an effort to increase both performance and ease of programming. This structural information can be used to improve performance tuning tools by making the causes of performance loss more readily apparent.
    2.13 +We propose a universal, adaptable set of performance visualizations that integrates more application structure, via a new model of parallel computation. The visualizations clearly identify idle cores, and tie the idleness to causal interactions within the runtime and hardware, and from there to the parallelism constructs that constrained the runtime and hardware behavior, thereby eliminating guesswork.
    2.14 +This approach can be used to instrument the runtime of any parallel programming model without modifying the application. As a case study, we applied it to a message-passing model, and we walk through a tuning session on a large multi-core machine to illustrate how performance loss is identified and how hypotheses for the cause are generated. 
    2.15  \end{abstract}
    2.16  
    2.17  
    2.18 @@ -141,7 +141,7 @@
    2.19  
    2.20  It then creates a results VP that receives a partial-result from each piece and accumulates the results. The  divider VP  then waits for the results VP to indicate completion, after which the language runtime shuts down.
    2.21  
    2.22 -\subsection{The language}
    2.23 +\subsection{The Language}
    2.24  The language used is SSR, which is based on rendez-vous style send and receive operations made between virtual processors (VPs), which are more commonly known as `threads'. It has commands for creating and destroying VPs, and three kinds of send-receive paired operations. 
    2.25  
    2.26  The first, \emph{send\_from\_to} specifies both sender and receiver VPs. It is used by the results VP to tell the divider VP that the work is complete. The second, \emph{send\_of\_type\_to}, specifies only a specific receiver, leaving the sender  anonymous, which increases flexibility while maintaining some control over scope. This  is used by the worker VPs doing the pieces to send their partial-result to the results processor. The third kind, \emph{send\_of\_type}, only specifies the type, and so acts as a global communication channel; this is not used in our application.
    2.27 @@ -220,7 +220,7 @@
    2.28  
    2.29  \begin{figure*}[t!]
    2.30    \begin{minipage}[b]{0.2\textwidth}
    2.31 -        \subfloat[Original]
    2.32 +        \subfloat[35.8 Gcycles\\Original]
    2.33      {\quad\quad \includegraphics[scale=0.015]{../figures/222.pdf} \quad
    2.34      \label{fig:story:a}}\quad
    2.35    \end{minipage}
    2.36 @@ -228,10 +228,10 @@
    2.37    % \subfloat[]
    2.38    %  {\includegraphics[scale=0.015]{../figures/194.pdf} 
    2.39    %  }\quad
    2.40 -   \subfloat[After fixing the load balancer]
    2.41 +   \subfloat[13.0 Gcycles\\After fixing the load balancer]
    2.42      {\includegraphics[scale=0.015]{../figures/2.pdf} 
    2.43      \label{story:b}}\quad
    2.44 -   \subfloat[After changing so as to put  work on core 1 first (solution 1)]
    2.45 +   \subfloat[11.0 Gcycles\\After changing so as to put work on core 1 first (solution 1)]
    2.46      {\includegraphics[scale=0.015]{../figures/5.pdf} 
    2.47      \label{story:c}}\quad
    2.48    % \subfloat[S1+divide factor 0.2]
    2.49 @@ -240,19 +240,19 @@
    2.50    % \subfloat[S1+divide factor 0.3]
    2.51    %  {\includegraphics[scale=0.015]{../figures/209.pdf} 
    2.52    %  }\quad
    2.53 -   \subfloat[plus changing the divide factor from 0.6 to  0.5]
    2.54 +   \subfloat[10.9 Gcycles\\plus changing the divide factor from 0.6 to  0.5]
    2.55      {\includegraphics[scale=0.015]{../figures/6.pdf} 
    2.56      \label{story:d}}\quad
    2.57 -   \subfloat[ further changing the divide factor to 0.4]
    2.58 +   \subfloat[15.6 Gcycles\\further changing the divide factor to 0.4]
    2.59      {\includegraphics[scale=0.015]{../figures/7.pdf} 
    2.60      \label{story:e}}\quad\\
    2.61 -   \subfloat[Going back to put divider VP onto its own core (Solution~2)]
    2.62 +   \subfloat[10.4 Gcycles\\Going back to put divider VP onto its own core (Solution~2)]
    2.63      {\includegraphics[scale=0.015]{../figures/12.pdf} 
    2.64      \label{story:f}}\quad
    2.65 -   \subfloat[plus moving the receive VP to same core as divider VP]
    2.66 +   \subfloat[10.3 Gcycles\\plus moving the receive VP to same core as divider VP]
    2.67      {\includegraphics[scale=0.015]{../figures/10.pdf} 
    2.68      \label{story:g}}\quad
    2.69 -   \subfloat[plus changing the divide factor to 0.4]
    2.70 +   \subfloat[9.7 Gcycles\\plus changing the divide factor to 0.4]
    2.71      {\includegraphics[scale=0.015]{../figures/15.pdf} 
    2.72      \label{story:h}}\quad
    2.73    % \subfloat[S2+divide factor 0.3]
    2.74 @@ -279,7 +279,7 @@
    2.75  \subsubsection{Second Run}
    2.76   After fixing this, the next run (Fig \ref{story:b}) corresponds much more to the expected execution behaviour. However, there remains a noticeable section at the beginning where only 3 cores have work and the other 37 remain idle.
    2.77  
    2.78 -Zooming in on those  cores, we see that creation code starts running on core 0, within the creation VP, and then the next block on the core is work! Creation stops, starving the other cores. Looking at the creation code, we see that the creation VP assigns the first work VP to its own core, so that work is now waiting in the queue to execute there. When it creates the second work VP, that creation call switches core 0 to the runtime. When done with creation, the runtime takes the next VP from the queue, which is that waiting work VP. Hence core 0 does the work next instead of continuing with creation  (the merits of work stealing or other scheduling strategies are independent from this illustration of how to use this approach to performance tune).
    2.79 +Zooming in on those cores, we see that creation code starts running on core 0, within the creation VP, and then the next block on the core is work. Creation stops, starving the other cores. Looking at the creation code, we see that the creation VP assigns the first work VP to its own core, so that work is now waiting in the queue to execute there. When it creates the second work VP, that creation call switches core 0 to the runtime. When done with creation, the runtime takes the next VP from the queue, which is that waiting work VP. Hence core 0 does the work next instead of continuing with creation  (the merits of work stealing or other scheduling strategies are independent from this illustration of how to use this approach to performance tune).
    2.80  
    2.81  The hypothesis was generated by looking at the code linked to each block and noting the visual pattern that creation code stopped running on core 0. Work code started running instead, and only after it finished did creation code start again. Hence, visual cues led directly to the hypothesis. 
    2.82