Mercurial > cgi-bin > hgwebdir.cgi > VMS > 0__Writings > kshalle

changeset 14:d72bb1ea1427
Added "integration" as theme to holistic perf-tuning paper, and big-picture to future-arch paper
author: Some Random Person <seanhalle@yahoo.com>
date: Wed, 11 Apr 2012 07:51:08 -0700
parents: 83b3b9e15fb2
children: d885f1eb9ad5
files: 0__Papers/Future_Architecture/latex/Future_Architecture.tex 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex
diffstat: 2 files changed, 36 insertions(+), 61 deletions(-) [+]
[-]

0__Papers/Future_Architecture/latex/Future_Architecture.tex 29

0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 68 0__Papers/Future_Architecture/latex/Future_Architecture.tex 29 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 68
0__Papers/Future_Architecture/latex/Future_Architecture.tex 29
0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 68
     1.1 --- a/0__Papers/Future_Architecture/latex/Future_Architecture.tex	Tue Apr 10 07:23:24 2012 -0700
     1.2 +++ b/0__Papers/Future_Architecture/latex/Future_Architecture.tex	Wed Apr 11 07:51:08 2012 -0700
     1.3 @@ -259,7 +259,7 @@
     1.4  \bibliographystyle{plain}
     1.5  %
     1.6  
     1.7 -\title{Position:  Support Runtimes in Hardware,\\  Rather than Specific Parallelism Constructs}
     1.8 +\title{Position:  Flexibility of Runtime Support Beats Specific Parallelism Construct Support}
     1.9  
    1.10  \author
    1.11  {
    1.12 @@ -276,18 +276,37 @@
    1.13  %
    1.14  
    1.15  \begin{abstract}
    1.16 -This is a position paper, whose purpose is to provide food for thought and a starting point for debate.  Although, the ideas are extrapolations from published work on runtime systems and hardware abstractions that have been implemented and successfully demonstrated.
    1.17 +This is a position paper, to provide food for thought and a starting point for debate.  Even so, the ideas are extrapolated from published work on runtime systems and hardware abstractions that have been implemented and successfully demonstrated.
    1.18  
    1.19 -The main premise is that no parallelism constructs should  be directly implemented in hardware, but rather separated into a new category of \emph{firmware} that is tightly integrated into the processor pipeline and managed by the OS.  We describe hardware structures that allow traditional thread constructs, domain-specific constructs, transactional memory,  and even consistency models be implemented as firm-ware, with extremely low overhead, as well as cooperatively engage the language's runtime into pipeline-level  hardware-resource management.
    1.20 +To bring parallel programming into the mainstream, it needs to be productive, source must port easily with high performance, and parallel programming has to be favorable to industry for adoption.
    1.21 +In previous work, we took the position that to attain all three,  software should be organized into a stack, based around \emph{specialization} of source to target hardware. Each layer of the stack has a role in the specialization process, which spans the lifetime of application code as it goes through the stages of, development, transformation to hardware-specific form, installation, and execution.  Hence, specialization is viewed as including the toolchain, hand-tuning, auto-tuners, multi-kernels, profiling, and binary optimization. We briefly restate the elements of such a stack, which encapsulates and organizes these.
    1.22  
    1.23 -We further take the position that software should be organized into a stack, based around \emph{specialization} of source to target hardware. Each layer of the stack has a role in the specialization process, which spans the lifetime of application code as it goes through the stages of, development, transformation to hardware-specific form, installation, and execution.  Hence, specialization is viewed as including the toolchain, hand-tuning, auto-tuners, multi-kernels, profiling, and binary optimization. We describe the elements of a stack that encapsulates and organizes these.
    1.24 +If the premise of such a stack is accepted, then in this paper we take the position that hardware should support tightly-integrated \emph{firm-ware} based runtime systems rather than specific parallelism constructs.
    1.25 +This is a new category of firm-ware that is tightly integrated into the processor pipeline and managed by the OS.  We describe hardware structures that allow traditional thread constructs, domain-specific constructs, transactional memory,  and even consistency models be implemented via such firm-ware, with extremely low overhead, as well as cooperatively engage the language's runtime into pipeline-level  hardware-resource management.
    1.26  \end{abstract}
    1.27  
    1.28 +\section{Introduction}
    1.29 + current  parallel programming is blocked from hitting main-stream industry because it has lower productivity than sequential, have to re-write source for each new target to get good performance, and disrupts ways programmers currently think, and disrupts the tools and work-flow, and it's too expensive to do.
    1.30  
    1.31 +Many believe a solution to productivity is domain-specific languages, which implies a large number of languages, each with a small user-base. 
    1.32 +
    1.33 +Solving performant-portability is more difficult. It means source is written once, then automatically specialized to all hardware targets, so that it runs high performance on each.
    1.34 +The source has to capture all information needed by all specialization techniques for all hardware, current and future.
    1.35 +
    1.36 +To get adopted by industry,  the approach would have to be flexible enough to fit with each of the wide array of programming styles and work-flows out there, and offer smooth transition from current practices to the new software running high performance on the new hardware.
    1.37 +
    1.38 +We call this the triple-goal of Productivity,  Performant-Portability and Adoptability for parallel software. Throughout the paper, we tie specific details of our proposed approach to these three goals.
    1.39 + 
    1.40 +In previous work, we suggested a software stack based around specialization will address the three goals. Productivity is solved by efficient and practical support of domain-specific languages. Performant-portability is solved by conveniently supporting the full range of specialization techniques. Adoptability is solved by flexibility to adapt and gentle transition and practical/cost-sensitive/effort-reducing.
    1.41 +
    1.42 +In this paper, if the premise of such a software stack is accepted, and the premise that domain-specific solves the productivity problem, then we propose that supporting runtimes in hardware is better than supporting any particular set of parallelism constructs, even ones as basic as the Compare And Swap instruction or Thread constructs.
    1.43 +
    1.44 +In Section X we give details of the hardware we propose to support the runtimes. In Section X we expand on the software stack and how it fits with the runtime hardware and how the two together support the three goals.  In Section X we apply the proposal to the topics of interest of this workshop to see if they are consistent and address the concerns. And we conclude in Section X with a summary.
    1.45 +
    1.46 +(In stack section, be sure to mention that to achieve portable, have to get to point that no software uses shared variables without protecting via a language construct.  Also, software does all sync via language constructs, doesn't roll its own via flags on shared vars or something like shared-mem sync impl -- end result is no comm of any kind outside of language construct "protection".
    1.47  
    1.48  \section{What parallel abstractions should the hardware provide?}
    1.49  
    1.50 -
    1.51  Our position is that the hardware should not directly supply any parallel abstractions.  Instead, it should supply a mechanism that elevates the language runtime to the status of a Hardware Abstraction Layer, which is separate from the executable and separate from the OS.  Thus, parallel abstractions are implemented as soft-extensions to the hardware.  With suitable support, many firmware-implemented parallel abstractions would require only a handful of instructions with a similarly low number of cycles of overhead.
    1.52  
    1.53  This arrangement solves a number of problems currently facing language designers and runtime implementers, as shall be seen throughout the rest of the paper. First, it makes all application-resident information available to control the innermost level of hardware, right down to swapping contexts in and out of registers. Second it increases practicality of domain-specific languages, which is one main path to high programmer productivity. Third it improves portability directly and supports a software stack arrangement that  may be a viable long-term solution to portability.

     2.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Tue Apr 10 07:23:24 2012 -0700
     2.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Wed Apr 11 07:51:08 2012 -0700
     2.3 @@ -34,7 +34,7 @@
     2.4  
     2.5  
     2.6  
     2.7 -\title{Multi-core Performance Tuning Using Semantic Information Collected by Instrumenting the Language Runtime}
     2.8 +\title{Performance Tuning Requires Integration of Multiple Aspects of Application, Runtime, Scheduling, and Hardware..  OR Integrated Performance Tuning Using Semantic Information Collected by Instrumenting the Language Runtime}
     2.9  
    2.10  \author{
    2.11          Nina Englehardt\\
    2.12 @@ -58,12 +58,21 @@
    2.13  
    2.14  
    2.15  \begin{abstract}
    2.16 -Put the text of your abstract here.
    2.17 +abstract here.
    2.18  \end{abstract}
    2.19  
    2.20  
    2.21 +\section{Introduction and Motivation}
    2.22  
    2.23 -\section{Section}
    2.24 +Performance tuning has two phases that each involve several aspects of: code, runtime implementation, scheduling choices, and hardware. In order to be effective, a tool used during performance tuning must be part
    2.25 +of a complete model of computation that ties all aspects of both phases of tuning together. Current tools fall short, both because they lack an encompassing model of computation, and because the tools are isolated from each other. Without integration, the user gets an incomplete picture of the computation and must resort to guesses either of where the problem lies or of what to do to fix it.
    2.26 +
    2.27 +We introduce in this paper a model of computation that ties all aspects of performance together, along with instrumentation and visualization that is guided by the model and links all relevant performance
    2.28 +tuning information together. The model and visualization tools are illustrated with a story line that shows how they are used to performance tune the standard matrix-multiply application on two multi-core systems. 
    2.29 +
    2.30 +Although we use standard visualization techniques [cite], our approach differs from previous work in both theoretical and practical aspects. The theory we  use is The Holistic Model of Parallel Computation, which ties together parallelism construct semantics with scheduling choices made during a run, and specific measurements made on the cores.  When put into practice, new kinds of measurements are taken, which complete the picture presented to the user, and each measurement is tied to a specific segment of code. The resulting combination not only identifies the source of performance loss, but ties it back to specific sources and suggests precise fixes, all of which is illustrated in our story line.
    2.31 +
    2.32 +\section{Random Early Thoughts}
    2.33  
    2.34  The units are semantic information, the constraints on them are semantic information, the type of constraint is semantic information.  The code executed inside a unit is semantic information.
    2.35  
    2.36 @@ -197,62 +206,9 @@
    2.37  text
    2.38  
    2.39  \begin{biography}{Author 1}[0mm]{file.eps}
    2.40 -text
    2.41 -% there must be enough text in the first paragraph to flow around the
    2.42 -% photo!
    2.43  
    2.44 -text
    2.45  \end{biography}
    2.46  
    2.47 -\begin{biography}{Author 2}[0mm]{}
    2.48 -text
    2.49 -% there must be enough text in the first paragraph to flow around the
    2.50 -% photo!
    2.51 -% Leave filename empty if photo is to be pasted in.
    2.52 -
    2.53 -text
    2.54 -\end{biography}
    2.55 -
    2.56 -\begin{biography}{Author 3}[0mm]{nophoto}
    2.57 -text
    2.58 -% Use filename nophoto if you don't want to put a photo there at all.
    2.59 -
    2.60 -text
    2.61 -\end{biography}
    2.62 -
    2.63 -
    2.64 -% The columns on the last page must be justified manually using
    2.65 -% \columnbreak.
    2.66 -
    2.67 -
    2.68 -
    2.69 -\summary
    2.70 -
    2.71 -text
    2.72 -
    2.73 -
    2.74 -
    2.75 -\begin{discussion}
    2.76 -        {PAPER NUMBER}%
    2.77 -        {PAPER TITLE}%
    2.78 -        {AUTHOR NAMES}%
    2.79 -        {DISCUSSER NAME}%
    2.80 -        {AFFILIATION INCL ADDRESS}%
    2.81 -        {SHORT AFFILIATION}
    2.82 -
    2.83 -text
    2.84 -
    2.85 -\end{discussion}
    2.86 -
    2.87 -
    2.88 -
    2.89 -\begin{closure}{AUTHOR NAME}
    2.90 -
    2.91 -text
    2.92 -
    2.93 -\end{closure}
    2.94 -
    2.95 -
    2.96  
    2.97  \end{document}
    2.98