### changeset 88:bb5df2b662dd

perf tuning -- merged differing ideas on conclusion
author Sean Halle Wed, 15 Aug 2012 09:42:57 -0700 bc83d94128d0 196871d9eaef 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 1 files changed, 14 insertions(+), 6 deletions(-) [+]
line diff
     1.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Wed Aug 15 16:58:30 2012 +0200
1.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Wed Aug 15 09:42:57 2012 -0700
1.3 @@ -467,7 +467,7 @@
1.4
1.5  The model's concepts of meta-unit and unit life-line  map directly to the UCC visualization. The constraints in the UCC visualization are those stated in or implied by the application (with the complexities about UCC modifications and levels noted in Section \ref{sec:theory}).
1.6
1.7 -However, the SCG is not a strict expression of the model, rather it's purpose is practical. It shows usage of cores, and relates that to the quantities in the model. Hence, the measurements for the SCG all are boundaries, where the core's time switches from one category in the model to a different.
1.8 +However, the SCG is not a strict expression of the model, rather it's purpose is practical. It shows usage of cores, and relates that to the quantities in the model. Hence, the measurements for the SCG all are boundaries of where the core's time switches from one category in the model to a different.
1.9
1.10  This differs from the model in subtle ways. Most notably, the model declares segments of time where communications take place, while the SCG doesn't measure the communication time directly, rather it captures idleness of the core caused by the non-overlapped portion of that communication. Also, when calculating the critical path, the SCG only counts non-overlapped portions of runtime activity.
1.11
1.12 @@ -562,13 +562,21 @@
1.13  \section{Conclusion}
1.14  \label{sec:conclusion}
1.15
1.16 -We have shown how to apply a generalized model of parallel computation to build adaptable performance visualizations, relying only on information collected through instrumenting the language runtime, with no modification to the application.
1.17 -The approach is demonstrated through the case study of instrumenting the SSR message-passing language runtime and using it to tune a simple parallel matrix multiply.
1.18 +We have shown how to apply a new, and general, model of parallel computation to build  performance visualizations that simplify identifying instances of performance loss and linking them to details of application code responsible. They rely only on information collected through instrumenting the language runtime, with no modification to the application.
1.19
1.20 -The resulting visualizations show that the focus on the parallelism-relevant concepts of work units and constraints on their execution allows a clearer view of parallelism-specific issues.
1.21 -By integrating visual display of constraints stemming from application structure, language runtime implementation, or hardware features, the various possible causes for performance loss are covered. A flexible filtering system for different types of constraints avoids overcharging the display.
1.22 +By integrating visual display of constraints due to application structure, language runtime implementation, and hardware features, all relevant causes for performance loss are covered. The semantic information collected allows filtering   for the relevant types of constraints, to avoid overcharging the display.
1.23
1.24 -As the approach relies on information available to the runtime, we expect that even better results will be observed for high-level'' parallel languages that more closely match application concepts instead of hardware concepts.
1.25 +We demonstrated, with a case study, how this  improves usability and eliminates  guesswork, by providing a direct path to details in the application code where changes should be made. These benefits derive from the computation model, which focuses on the aspects of parallelism relevant to performance in a way that makes generation of the correct hypothesis for performance loss straight forward.
1.26 +
1.27 +%I'd like to avoid weaknesses of our approach, in the conclusion.. and this wasn't discussed much in the body.
1.28 +%As the approach relies on information available to the runtime, we expect that even better results will be observed for high-level'' parallel languages that more closely match application concepts instead of hardware concepts.
1.29 +
1.30 +
1.31 +
1.32 +
1.33 +
1.34 +
1.35 +
1.36
1.37  \bibliography{bib_for_papers_12_Jy_15}
1.38