Mercurial > cgi-bin > hgwebdir.cgi > VMS > 0__Writings > kshalle
changeset 20:1de9173d4226
Perf tuning -- working on background and related work section
| author | Some Random Person <seanhalle@yahoo.com> |
|---|---|
| date | Thu, 12 Apr 2012 08:25:57 -0700 |
| parents | 8ecba6eccef8 |
| children | dd038db1f191 |
| files | 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex |
| diffstat | 1 files changed, 25 insertions(+), 5 deletions(-) [+] |
line diff
1.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex Thu Apr 12 07:46:22 2012 -0700 1.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex Thu Apr 12 08:25:57 2012 -0700 1.3 @@ -66,7 +66,6 @@ 1.4 1.5 Visualizations have been around, and performance tuning tools have been around, but they leave something to be desired. They are a bit fragmented, focusing on one specific view of the application, like statistics of line of code or function call or message sends and receives. They have core time-lines, showing which function runs by time, but the user doesn't know why a function runs in that particular spot on that particular core, nor whether that is desired behavior vs erroneous behavior. 1.6 1.7 - 1.8 1.9 There are a variety of tools, but they have in common that they are composed of several disjointed parts that fail to tell the user what they want to know. This leaves the user to guess. It's like the allegory of the five blind people and the elephant: one touches the ear and says it's round and flat, another touches the leg and says it's a tree, and so on. They each are correct, but the views don't connect to tell them the whole picture. 1.10 1.11 @@ -78,12 +77,34 @@ 1.12 1.13 The views should indicate the units, and visually indicate the constraints, showing which are imposed by the application, which by the hardware, and which by the runtime implementation's details. 1.14 1.15 -The views should also connect the units to specific segments of code that compose the units, and connect each constraint on scheduling choice to the source of the constraint, within the code, hardware, or runtime. They should also separate resource usage into categories of: application work, scheduling/runtime overhead, and parameter choices that affect unit creation and constraints. 1.16 +The views should also connect the units to specific segments of code that compose the units, and connect each constraint on scheduling choice to the precise source of the constraint, within the code, hardware, or runtime. They should also separate resource usage into categories of: application work, non-overlapped communication (which results from scheduling decisions), and scheduling/runtime overhead. They should also integrate parameter choices within the code that affect unit creation and constraints. 1.17 1.18 1.19 1.20 -\section{} 1.21 -Performance tuning, as does functional debugging, has steps that are iterated: Use measurements to discover discrepancies from desired behavior, use structure info together with that to form hypothesis for cause of discrepancy, use strucuture info together with hypothesis of cause to create plan to fix, then implement and re-execute and gather new measurements, repeat until satisfied. 1.22 +\section{Background and Related Work} 1.23 +Performance tuning, as does functional debugging, has steps that are iterated until the person tuning is satisfied. First take measurements and display them, in order to discover discrepancies from desired behavior. Next, connect the details of the discrepancies with structure information to form a hypothesis for the cause of the discrepancy. The cause should suggest a plan to fix the problem. Then implement the plan, re-execute, gather new measurements, and repeat until satisfied. 1.24 + 1.25 +? 1.26 + 1.27 +Talk about other tools: 1.28 + 1.29 +most of older more established, come from threads world, conceive the application as a processor that does things and don't know what things are -- Tau had model but was cores and memories and contexts -- not scheduling or runtime or units of work (no tasks) -- no tasks nor constraints on tasks. 1.30 + 1.31 +Seeing need task based languages now -- people who dev lang also dev tools to go with it. Direction is clearly going towards task-based, but not there yet 1.32 + 1.33 +MPI is also machine-based abstraction, that gives communication information, but doesn't have concept of constraints . Its sort of in-between.. 1.34 + 1.35 +For communication, want two things: 1) idle-time on cores that is consequence from particular communication pattern, and in some cases 2) the energy due to the volume of communication. Both are consequence of the scheduling choices made. 1.36 + 1.37 + 1.38 + 1.39 +StarSs is clearly thinking about tasks, and even some about scheduling. but limited on scheduling -- can just place task into a queue, so can sort of manipulate scheduling but view doesn't really give all the constraints -- missing the runtime overhead, and missing the idle-time consequent from non-overlapped comm. 1.40 + 1.41 +The StarSs tool tries to simplify the view for the user. It doesn't give performance information directly, but instead identifies tasks and tells the user whether it thinks the task size is too small, just right, or -- instead, it has a recommended task size, which is between too small, which has too much overhead and too big, which has too few tasks to load balance. It makes task a one color if too short, another if just right, and a third if too long. 1.42 + 1.43 +? 1.44 + 1.45 +======================= 1.46 1.47 Expl of what is meant by "structure" info -- example where meas of runtime system showed that overhead of task creation took longer than task execution. Hypothesis was trivial: cause of lost performance is runtime overhead of creation is larger than work in a scheduled unit. The plan to fix is to change the number of work-units created, by changing the parameter in the divider code. Implementing this and re-executing showed that this source of performance loss was fixed by the change. 1.48 1.49 @@ -92,7 +113,6 @@ 1.50 Hence, each step of performance debugging involves several aspects, including mental model of computation, application code, runtime implementation, scheduling choices, and hardware. In order to be effective, a tool used during performance tuning must be part of a complete model of computation that ties all aspects of the debugging/tuning steps together. Current tools fall short, both because they lack an encompassing model of computation, and because the tools are isolated from each other. Without integration, the user gets an incomplete picture of the computation and must resort to guesses either of where the problem lies or of what to do to fix it. 1.51 1.52 1.53 - 1.54 We introduce in this paper a model of computation that ties all aspects of performance together, along with instrumentation and visualization that is guided by the model and that links all relevant performance 1.55 tuning information together. The model and visualization tools are illustrated with a story line, which shows how they are used to performance tune the standard matrix-multiply application on two multi-core systems. 1.56
