Mercurial > cgi-bin > hgwebdir.cgi > VMS > 0__Writings > kshalle

changeset 19:8ecba6eccef8
Perf tuning -- intro that Nina's happy with and Merten approves
author: Some Random Person <seanhalle@yahoo.com>
date: Thu, 12 Apr 2012 07:46:22 -0700
parents: 53991637cae5
children: 1de9173d4226
files: 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex
diffstat: 1 files changed, 35 insertions(+), 16 deletions(-) [+]
[-]

0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 51 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 51
0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 51
     1.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Thu Apr 12 06:53:51 2012 -0700
     1.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Thu Apr 12 07:46:22 2012 -0700
     1.3 @@ -64,32 +64,25 @@
     1.4  
     1.5  \section{Introduction and Motivation}
     1.6  
     1.7 -(Where reader is, they can all say "yeah, I agree")
     1.8 -Visualizations have been around, and performance tuning tools have been around, but they leave something to be desired. They are a bit fragmented, focusing on one specific view of the application, like statistics of line of code or function call or message sends and receives. They have core time-lines, showing which function runs by time, but the user doesn't know why a function runs in that particular spot on that particular core, nor whether that  is desired behavior vs erroneous behavior. 
     1.9 +Visualizations have been around, and performance tuning tools have been around, but they leave something to be desired. They are a bit fragmented, focusing on one specific view of the application, like statistics of line of code or function call or message sends and receives. They have core time-lines, showing which function runs by time, but the user doesn't know why a function runs in that particular spot on that particular core, nor whether that  is desired behavior vs erroneous behavior.
    1.10 +
    1.11 + 
    1.12  
    1.13  There are a variety of tools, but they have in common that they are composed of several disjointed parts that fail to tell the user what they want to know. This leaves the user to guess. It's like the allegory of the five blind people and the elephant: one touches the ear and says it's round and flat, another touches the leg and says it's a tree, and so on. They each are correct, but the views don't connect to tell them the whole picture. 
    1.14  
    1.15 -The user usually knows what the application is doing semantically, but parallel performance is all about scheduling and other runtime behavior.  The choices about which task or virtual processor is assigned to which core at what point in time is the heart of performance. The causes of that behavior is a big part of the missing information the user wants to know.
    1.16 +It is the parallel aspects of code and runtime decisions that the tools fall short on.
    1.17 +The user usually knows what the application is doing semantically, but parallel performance is all about scheduling and other runtime behavior.  The choices about which task or virtual processor is assigned to which core at what point in time is the heart of parallel performance. The causes of that behavior is a big part of the missing information the user wants to know.
    1.18  
    1.19 -=============
    1.20  
    1.21 - the   or have cache misses over time, but never have a coherent view of how application code connects to what happens where and when. 
    1.22 +To fix this, a mental framework is needed that the views all fit into, to connect them together.  The framework should be in terms of scheduling decisions, including the units decisions are made on, and the various sources of constraints on those decisions. 
    1.23  
    1.24 -Can have a timeline view, and next to that execution by function and what percent by function, next to that histogram of cache misses.  But no coherent view of how to connect these things. There is information missing that connects the views. The user still has to guess about what the cause might be.
    1.25 +The views should indicate the units, and visually indicate the constraints, showing  which are imposed by the application, which by the hardware, and which by the runtime implementation's details. 
    1.26  
    1.27 -The problem of the other vis is they don't give shape of application..
    1.28 - the fundamental parallelism related structure.
    1.29 +The views should also connect the units to specific segments of code that compose the units, and connect each constraint on scheduling choice to the source of the constraint, within the code, hardware, or runtime. They should also separate resource usage into categories of: application work, scheduling/runtime overhead, and parameter choices that affect unit creation and constraints.
    1.30  
    1.31 -===================
    1.32  
    1.33 -To fix this, want is a mental framework that the views all fit into, so that they connect to each other when one looks at the information. 
    1.34  
    1.35 - Scheduling is a fundamental part of parallel execution. The views must include both constraints on scheduling and the actual scheduling choices. The parts that affect what scheduling choices are possible must connect to the parts that show which ones were taken.
    1.36 -
    1.37 -to have more theoretical underpinning and several views that connect to each other. The user needs more information, with some mental framework to 
    1.38 -
    1.39 -a lot of the time it's lists of measurements, or bar graphs, things like that -- over the whole application or by function -- forcing guessing of how it connects to -- if it tells you that this line creates a lot of level 2 cache misses, that doesn't tell you what the application is doing to cause this.. but when have whole UCC along with it, have context for the measurements -- puts the line of code into a framework -- it's necessary but not useful by itself -- it needs to be connected -- the unit information is more interesting than the line of code information -- line of code has only sequential meaning, missing scheduling connection -- need the scheduling behavior added -- need to know the unit of work that's causing problem, not the line of code -- unit provides a parallelism context, line of code does not.. unit provides an execution order and execution location, with implied communication -- line of code does not.
    1.40 -
    1.41 +\section{}
    1.42  Performance tuning, as does functional debugging, has steps that are iterated: Use measurements to discover discrepancies from desired behavior, use structure info together with that to form hypothesis for cause of discrepancy, use strucuture info together with hypothesis of cause to create plan to fix, then implement and re-execute and gather new measurements, repeat until satisfied.
    1.43     
    1.44  Expl of what is meant by "structure" info -- example where meas of runtime system showed that overhead of task creation took longer than task execution. Hypothesis was trivial: cause of lost performance is runtime overhead of creation is larger than work in a scheduled unit.  The plan to fix is to change the number of work-units created, by changing the parameter in the divider code.  Implementing this and re-executing showed that this source of performance loss was fixed by the change.
    1.45 @@ -108,6 +101,7 @@
    1.46  <maybe some stuff about features and benefits of our approach: no app instrumentation, it's all inside language runtime, very low overhead, integrated with VMS-based functional debugging, and so on>
    1.47  
    1.48  
    1.49 +
    1.50  \section{Setup}
    1.51  
    1.52  Preview of what will see in setup
    1.53 @@ -302,6 +296,31 @@
    1.54  
    1.55  \end{document}
    1.56  
    1.57 +?
    1.58 +
    1.59 +=============
    1.60 +
    1.61 + the   or have cache misses over time, but never have a coherent view of how application code connects to what happens where and when. 
    1.62 +
    1.63 +Can have a timeline view, and next to that execution by function and what percent by function, next to that histogram of cache misses.  But no coherent view of how to connect these things. There is information missing that connects the views. The user still has to guess about what the cause might be.
    1.64 +
    1.65 +The problem of the other vis is they don't give shape of application..
    1.66 + the fundamental parallelism related structure.
    1.67 +
    1.68 +===================
    1.69 +
    1.70 +
    1.71 +
    1.72 +as well as which choices on them are allowed, and which were actually taken. Finally, 
    1.73 +
    1.74 +the application-imposed constraints on scheduling them, the hardware-imposed constraints on scheduling them, and the runtime-implementation imposed constraints on scheduling them. 
    1.75 +
    1.76 + Scheduling is a fundamental part of parallel execution. The views must include both constraints on scheduling and the actual scheduling choices. The parts that affect what scheduling choices are possible must connect to the parts that show which ones were taken.
    1.77 +
    1.78 +to have more theoretical underpinning and several views that connect to each other. The user needs more information, with some mental framework to 
    1.79 +
    1.80 +a lot of the time it's lists of measurements, or bar graphs, things like that -- over the whole application or by function -- forcing guessing of how it connects to -- if it tells you that this line creates a lot of level 2 cache misses, that doesn't tell you what the application is doing to cause this.. but when have whole UCC along with it, have context for the measurements -- puts the line of code into a framework -- it's necessary but not useful by itself -- it needs to be connected -- the unit information is more interesting than the line of code information -- line of code has only sequential meaning, missing scheduling connection -- need the scheduling behavior added -- need to know the unit of work that's causing problem, not the line of code -- unit provides a parallelism context, line of code does not.. unit provides an execution order and execution location, with implied communication -- line of code does not.
    1.81 +
    1.82  %%
    1.83  %% EOF ieeepes_skel.tex
    1.84  %%----------------------------------------------------------------------