## VMS/0__Writings/kshalle

### changeset 79:4433a26ff153

perf tune: abstract + add execution time numbers to graphs
author Nina Engelhardt Fri, 10 Aug 2012 18:44:02 +0200 328f337153e3 2bf63d88116a 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 2 files changed, 14 insertions(+), 14 deletions(-) [+]
line diff
     1.1 Binary file 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf has changed

     2.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Fri Aug 10 05:32:19 2012 -0700
2.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Fri Aug 10 18:44:02 2012 +0200
2.3 @@ -68,10 +68,10 @@
2.4
2.5
2.6  \begin{abstract}
2.7 -Performance tuning is an important aspect of parallel programming.  Yet when trying to pinpoint the causes of performance loss,  internal structure of the application and the runtime is often needed in order to understand how the observed patterns of performance have come to pass.
2.8 -The trend in parallel programming languages has been towards models that capture more structural information about the application, in an effort to increase both performance and ease of programming. This structural information can be used to improve performance tuning tools by making the causes of performance loss more readily apparent.
2.9 -We propose using a new computation model  to collect application structure and produce performance visualizations. The visualizations clearly identify idle cores, and tie the idleness to causal interactions within the runtime and hardware, and from there to the parallelism constructs that constrained the runtime and hardware behavior, thereby eliminating guesswork.
2.10 -The approach is used to instrument the runtime of any language without  application modifications. This is implemented for multi-core hardware, and we walk through a tuning session on a large multi-core machine to illustrate how performance loss is identified and how hypotheses for the cause are generated. We also give a concise description of the implementation and the computation model.
2.11 +Performance tuning is an important aspect of parallel programming. Yet when trying to pinpoint the causes of performance loss, often insufficient knowledge of the internal structure of the application and the runtime is available to understand how the observed patterns of performance have come to pass.
2.12 +A trend in parallel programming languages is towards models that capture more structural information about the application, in an effort to increase both performance and ease of programming. This structural information can be used to improve performance tuning tools by making the causes of performance loss more readily apparent.
2.13 +We propose a universal, adaptable set of performance visualizations that integrates more application structure, via a new model of parallel computation. The visualizations clearly identify idle cores, and tie the idleness to causal interactions within the runtime and hardware, and from there to the parallelism constructs that constrained the runtime and hardware behavior, thereby eliminating guesswork.
2.14 +This approach can be used to instrument the runtime of any parallel programming model without modifying the application. As a case study, we applied it to a message-passing model, and we walk through a tuning session on a large multi-core machine to illustrate how performance loss is identified and how hypotheses for the cause are generated.
2.15  \end{abstract}
2.16
2.17
2.18 @@ -141,7 +141,7 @@
2.19
2.20  It then creates a results VP that receives a partial-result from each piece and accumulates the results. The  divider VP  then waits for the results VP to indicate completion, after which the language runtime shuts down.
2.21
2.22 -\subsection{The language}
2.23 +\subsection{The Language}
2.24  The language used is SSR, which is based on rendez-vous style send and receive operations made between virtual processors (VPs), which are more commonly known as threads'. It has commands for creating and destroying VPs, and three kinds of send-receive paired operations.
2.25
2.26  The first, \emph{send\_from\_to} specifies both sender and receiver VPs. It is used by the results VP to tell the divider VP that the work is complete. The second, \emph{send\_of\_type\_to}, specifies only a specific receiver, leaving the sender  anonymous, which increases flexibility while maintaining some control over scope. This  is used by the worker VPs doing the pieces to send their partial-result to the results processor. The third kind, \emph{send\_of\_type}, only specifies the type, and so acts as a global communication channel; this is not used in our application.
2.27 @@ -220,7 +220,7 @@
2.28
2.29  \begin{figure*}[t!]
2.30    \begin{minipage}[b]{0.2\textwidth}
2.31 -        \subfloat[Original]
2.32 +        \subfloat[35.8 Gcycles\\Original]
2.35    \end{minipage}
2.36 @@ -228,10 +228,10 @@
2.37    % \subfloat[]
2.38    %  {\includegraphics[scale=0.015]{../figures/194.pdf}
2.40 -   \subfloat[After fixing the load balancer]
2.41 +   \subfloat[13.0 Gcycles\\After fixing the load balancer]
2.42      {\includegraphics[scale=0.015]{../figures/2.pdf}
2.44 -   \subfloat[After changing so as to put  work on core 1 first (solution 1)]
2.45 +   \subfloat[11.0 Gcycles\\After changing so as to put work on core 1 first (solution 1)]
2.46      {\includegraphics[scale=0.015]{../figures/5.pdf}
2.48    % \subfloat[S1+divide factor 0.2]
2.49 @@ -240,19 +240,19 @@
2.50    % \subfloat[S1+divide factor 0.3]
2.51    %  {\includegraphics[scale=0.015]{../figures/209.pdf}
2.53 -   \subfloat[plus changing the divide factor from 0.6 to  0.5]
2.54 +   \subfloat[10.9 Gcycles\\plus changing the divide factor from 0.6 to  0.5]
2.55      {\includegraphics[scale=0.015]{../figures/6.pdf}
2.57 -   \subfloat[ further changing the divide factor to 0.4]
2.58 +   \subfloat[15.6 Gcycles\\further changing the divide factor to 0.4]
2.59      {\includegraphics[scale=0.015]{../figures/7.pdf}
2.61 -   \subfloat[Going back to put divider VP onto its own core (Solution~2)]
2.62 +   \subfloat[10.4 Gcycles\\Going back to put divider VP onto its own core (Solution~2)]
2.63      {\includegraphics[scale=0.015]{../figures/12.pdf}
2.65 -   \subfloat[plus moving the receive VP to same core as divider VP]
2.66 +   \subfloat[10.3 Gcycles\\plus moving the receive VP to same core as divider VP]
2.67      {\includegraphics[scale=0.015]{../figures/10.pdf}
2.69 -   \subfloat[plus changing the divide factor to 0.4]
2.70 +   \subfloat[9.7 Gcycles\\plus changing the divide factor to 0.4]
2.71      {\includegraphics[scale=0.015]{../figures/15.pdf}
`