Mercurial > cgi-bin > hgwebdir.cgi > VMS > 0__Writings > kshalle

changeset 16:be5673d9658b
Clarified steps of perf tuning and info it needs as input
author: Some Random Person <seanhalle@yahoo.com>
date: Wed, 11 Apr 2012 10:20:53 -0700
parents: d885f1eb9ad5
children: 07c466b1006d
files: 0__Papers/Future_Architecture/latex/Future_Architecture.tex 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex
diffstat: 2 files changed, 21 insertions(+), 13 deletions(-) [+]
[-]

0__Papers/Future_Architecture/latex/Future_Architecture.tex 16

0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 18 0__Papers/Future_Architecture/latex/Future_Architecture.tex 16 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 18
0__Papers/Future_Architecture/latex/Future_Architecture.tex 16
0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 18
     1.1 --- a/0__Papers/Future_Architecture/latex/Future_Architecture.tex	Wed Apr 11 08:40:05 2012 -0700
     1.2 +++ b/0__Papers/Future_Architecture/latex/Future_Architecture.tex	Wed Apr 11 10:20:53 2012 -0700
     1.3 @@ -279,25 +279,25 @@
     1.4  This is a position paper, to provide food for thought and a starting point for debate.  Even so, the ideas are extrapolated from published work on runtime systems and hardware abstractions that have been implemented and successfully demonstrated.
     1.5  
     1.6  To bring parallel programming into the mainstream, it needs to be productive, source must port easily with high performance, and parallel programming has to be favorable to industry for adoption.
     1.7 -In previous work, we took the position that to attain all three,  software should be organized into a stack, based around \emph{specialization} of source to target hardware. Each layer of the stack has a role in the specialization process, which spans the lifetime of application code as it goes through the stages of, development, transformation to hardware-specific form, installation, and execution.  Hence, specialization is viewed as including the toolchain, hand-tuning, auto-tuners, multi-kernels, profiling, and binary optimization. We briefly restate the elements of such a stack, which encapsulates and organizes these.
     1.8 +In previous work, we took the position that to attain all three,  software should be organized into a stack, based around \emph{specialization} of source to target hardware. Each layer of the stack has a role in the specialization process, which spans the lifetime of application code as it goes through the stages of: development, transformation to hardware-specific form, installation, and execution.  Hence, specialization is viewed as including the toolchain, hand-tuning, auto-tuners, multi-kernels, profiling, and binary optimization. Here, we briefly restate the elements of such a stack, which encapsulates and organizes these.
     1.9  
    1.10  If the premise of such a stack is accepted, then in this paper we take the position that hardware should support tightly-integrated \emph{firm-ware} based runtime systems rather than specific parallelism constructs.
    1.11 -This is a new category of firm-ware that is tightly integrated into the processor pipeline and managed by the OS.  We describe hardware structures that allow traditional thread constructs, domain-specific constructs, transactional memory,  and even consistency models be implemented via such firm-ware, with extremely low overhead, as well as cooperatively engage the language's runtime into pipeline-level  hardware-resource management.
    1.12 +This is a new category of firm-ware that is tightly integrated into the processor pipeline and managed by the OS.  We describe hardware structures that allow traditional thread constructs, domain-specific constructs, transactional memory,  and even consistency models be implemented via such firm-ware, with extremely low overhead, as well as engage the language runtime into pipeline-level  hardware-resource management.
    1.13  \end{abstract}
    1.14  
    1.15  \section{Introduction}
    1.16 - current  parallel programming is blocked from hitting main-stream industry because it has lower productivity than sequential, have to re-write source for each new target to get good performance, and disrupts ways programmers currently think, and disrupts the tools and work-flow, and it's too expensive to do.
    1.17 + Current  parallel programming is blocked from hitting main-stream industry because it has lower productivity than sequential, requires to re-write source for each new target to get good performance, and disrupts ways programmers  think and their work-flow. All of which makes it too expensive.
    1.18  
    1.19 -Many believe a solution to productivity is domain-specific languages, which implies a large number of languages, each with a small user-base. 
    1.20 +Many believe a solution to productivity is domain-specific languages. To be a solution, a large number of such domain-specific languages has to be created and ported to each hardware target. Such creation and porting has to be done inexpensively because each language has a small user-base. 
    1.21  
    1.22 -Solving performant-portability is more difficult. It means source is written once, then automatically specialized to all hardware targets, so that it runs high performance on each.
    1.23 -The source has to capture all information needed by all specialization techniques for all hardware, current and future.
    1.24 +Solving performant-portability is more difficult. Such portability means source is written once, then automatically specialized to all hardware targets, so that it runs high performance on each.
    1.25 +To achieve this, the one source has to capture all information needed by all specialization techniques for all hardware, current and future.
    1.26  
    1.27 -To get adopted by industry,  the approach would have to be flexible enough to fit with each of the wide array of programming styles and work-flows out there, and offer smooth transition from current practices to the new software running high performance on the new hardware.
    1.28 +Adoption by industry is the least research-oriented aspect, but for parallel programming may be the most important. To be adopted, a solution would have to be flexible enough to support all the domain-specific languages, fit with any of the array of programming styles and work-flows used in the industry adopting, and offer smooth transition from current programming practices to the new ones, seamlessly working on both current hardware and future parallelism-supported hardware.
    1.29  
    1.30  We call this the triple-goal of Productivity,  Performant-Portability and Adoptability for parallel software. Throughout the paper, we tie specific details of our proposed approach to these three goals.
    1.31   
    1.32 -In previous work, we suggested a software stack based around specialization will address the three goals. Productivity is solved by efficient and practical support of domain-specific languages. Performant-portability is solved by conveniently supporting the full range of specialization techniques. Adoptability is solved by flexibility to adapt and gentle transition and practical/cost-sensitive/effort-reducing.
    1.33 +A previously suggested solution to the triple-goal is a software stack that is based around specialization, and is oriented towards independent, small, contributions to the stack, which collectively improve the specialization process. Productivity is solved by efficient and practical support of domain-specific languages. Performant-portability is solved by conveniently supporting the full range of specialization techniques. Adoptability is solved by flexibility to adapt to current and future hardware, with gentle transition that is practical, cost-sensitive, and effort-reducing.
    1.34  
    1.35  In this paper, if the premise of such a software stack is accepted, and the premise that domain-specific solves the productivity problem, then we propose that supporting runtimes in hardware is better than supporting any particular set of parallelism constructs, even ones as basic as the Compare And Swap instruction or Thread constructs.
    1.36  

     2.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Wed Apr 11 08:40:05 2012 -0700
     2.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Wed Apr 11 10:20:53 2012 -0700
     2.3 @@ -64,8 +64,15 @@
     2.4  
     2.5  \section{Introduction and Motivation}
     2.6  
     2.7 -Performance tuning has two phases that each involve several aspects, including code, runtime implementation, scheduling choices, and hardware. In order to be effective, a tool used during performance tuning must be part
     2.8 -of a complete model of computation that ties all aspects of both phases of tuning together. Current tools fall short, both because they lack an encompassing model of computation, and because the tools are isolated from each other. Without integration, the user gets an incomplete picture of the computation and must resort to guesses either of where the problem lies or of what to do to fix it.
     2.9 +Performance tuning, as does functional debugging, has steps that are iterated: Use measurements to discover discrepancies from desired behavior, use structure info together with that to form hypothesis for cause of discrepancy, use strucuture info together with hypothesis of cause to create plan to fix, then implement and re-execute and gather new measurements, repeat until satisfied.
    2.10 +   
    2.11 +Expl of what is meant by "structure" info -- example where meas of runtime system showed that overhead of task creation took longer than task execution. Hypothesis was trivial: cause of lost performance is runtime overhead of creation is larger than work in a scheduled unit.  The plan to fix is to change the number of work-units created, by changing the parameter in the divider code.  Implementing this and re-executing showed that this source of performance loss was fixed by the change.
    2.12 +
    2.13 +The example shows that theory is part of hypothesis generation, because it required knowledge of the runtime and understanding that creation of a task is work performed as overhead inside the runtime.  The example also shows that generating the plan to fix required understanding the segment of code that divided work into tasks, and the relationship between parameters to that code and execution time of the resulting tasks.
    2.14 +
    2.15 +Hence, each step of performance debugging involves several aspects, including mental model of computation, application code, runtime implementation, scheduling choices, and hardware. In order to be effective, a tool used during performance tuning must be part of a complete model of computation that ties all aspects of the debugging/tuning steps together. Current tools fall short, both because they lack an encompassing model of computation, and because the tools are isolated from each other. Without integration, the user gets an incomplete picture of the computation and must resort to guesses either of where the problem lies or of what to do to fix it.
    2.16 +
    2.17 +
    2.18  
    2.19  We introduce in this paper a model of computation that ties all aspects of performance together, along with instrumentation and visualization that is guided by the model and that links all relevant performance
    2.20  tuning information together. The model and visualization tools are illustrated with a story line, which shows how they are used to performance tune the standard matrix-multiply application on two multi-core systems. 
    2.21 @@ -112,7 +119,8 @@
    2.22  
    2.23  State will see all of these in practice during the story lines of using the visualizations.
    2.24  
    2.25 -\section{Illustrative Story of Performance Tuning Matrix Multiply on Two Different Machines}
    2.26 +
    2.27 +\section{Illustrative Story of Performance Tuning}
    2.28  
    2.29  Overview of steps in story, and what each step will show
    2.30  
    2.31 @@ -124,9 +132,9 @@
    2.32  \subsection{Performance Tuning on 4 socket by 10 core by 2 context Machine}
    2.33  
    2.34  Same as for 4 core machine..  this time, point out what choices are different between 40 core and 4 core.
    2.35 +\end{document}
    2.36  
    2.37 -
    2.38 -======================================================
    2.39 +==============================
    2.40  
    2.41  \section{Random Early Thoughts}
    2.42