changeset 96:45975739549a

Perf Tune paper
author Sean Halle <seanhalle@yahoo.com>
date Fri, 30 Nov 2012 13:26:57 -0800
parents df22471e0fdb
children 3f338effbfd9
files 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 0__Papers/VMS/VMS__Foundation_Paper/VMS__Full_conference_version/latex/VMS__Full_conf_paper_2.tex CVs/12_Ag__Industry_CV/CV_w_pubs.pdf CVs/12_Ag__Industry_CV/CV_w_pubs.tex
diffstat 5 files changed, 244 insertions(+), 181 deletions(-) [+]
line diff
     1.1 Binary file 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.pdf has changed
     2.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Mon Nov 12 16:52:20 2012 -0800
     2.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Fri Nov 30 13:26:57 2012 -0800
     2.3 @@ -49,7 +49,7 @@
     2.4  
     2.5  
     2.6  %MOIRAI: MOdel for Integrated Runtime Analysis through Instrumentation
     2.7 -\title{Performance Tuning Scheduling Behavior Using Semantic Information Collected by Instrumenting the Language Runtime}
     2.8 +\title{An Approach to Augment Performance Tuning via Visualization of Scheduling Decisions}
     2.9  
    2.10  %\authorinfo{Nina Engelhardt}
    2.11  %           {TU Berlin}
    2.12 @@ -71,98 +71,77 @@
    2.13  
    2.14  
    2.15  \begin{abstract}
    2.16 -Performance tuning is an important aspect of parallel programming that involves understanding both communication behavior and scheduling behavior. Many good tools exist for identifying hot-spots in code, and idleness due to waiting on synchronization constructs.  These tools help in tuning data layout, and to find  constructs that spend a long time blocked, but leave the user guessing as to why they block for so long, with no other work to fill in.   The answer      often involves finding complex chain-reactions of scheduling decisions.  We propose applying a novel model of parallel computation to guide the gathering and display of runtime scheduling decisions.  which makes such chain-reactions easy to spot and easy to fix. Our work produces a set of performance visualizations which are similar to existing ones, but include details to tie idleness to causal interactions within the runtime and hardware, and from there to the parallelism constructs responsible.
    2.17 -The computation model can be employed to guide instrumenting the runtime of any parallel language or programming model,  and the needed data then collected, without modifying the application.  To simplify illustration, we instrumented the runtime of our SSR pi-calculus inspired programming model, and we walk through a tuning session on a large multi-core machine to illustrate the improvements in  generating hypotheses for the causes of idle cores, and how to reduce the idleness. 
    2.18 +Performance tuning is an important aspect of parallel programming that involves understanding both sequential behavior inside work-units and scheduling behavior that maps those units onto hardware. Many good tools exist for identifying hot-spots in code, where the code inside scheduled units needs to be adjusted.  These tools help to tune data layout, and to find  constructs that spend a long time blocked. However, they provide less help in understanding  why  constructs block for so long, and why no other work fills in.   The answer      often involves complex chain-reactions of scheduling decisions, and finding these  in turn requires seeing the patterns of constraints the application has placed on scheduling and how the runtime uses them.  We propose supplementing existing tools with an additional tool to identify these chain reactions and tie the behaviors back to specific spots in the code. To make such a tool, we apply a newly discovered model of the structure of parallel computation, as a guide to gathering both the constraints on scheduling, and the scheduling decisions made by the runtime. The visualizations of these are superficially similar to existing tools, but include details to tie idleness to chains of causal interactions within the runtime and hardware, and then to the responsible parallelism constructs within the code.
    2.19 +We show how to instrument a parallel language or programming model,   with no need to modify the application.  To simplify illustration, we instrumented the runtime of our own  pi-calculus inspired programming model, called SSR, and we walk through a tuning session on a large multi-core machine, which demonstrates the improvements in  generating hypotheses for the causes of idle cores, and how to reduce the idleness. 
    2.20  \end{abstract}
    2.21  
    2.22  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    2.23  \section{Introduction}
    2.24  \label{sec:intro}
    2.25  
    2.26 -Performance visualization and tuning tools for parallel programs are critical to achieving good performance, and yet the experience of using current solutions leaves something to be desired. Often, performance tuning requires interpreting abstract views of usage or statistics, trying to \emph{guess} the cause of performance loss. Despite many different views of performance measurements, such as frequency by line of code, message sends and receives, and core usage timelines, the user does not know why a function runs in a particular spot on a particular core, nor whether that is desired or erroneous behavior.
    2.27 +Performance visualization and tuning tools for parallel programs are critical to achieving good performance.
    2.28 +Large factors of improvement can be realized by using existing tools to identify hot-spots and adjust the data layout and other internals of the most time-consuming functions.  
    2.29  
    2.30 -Examining these myriad views can feel like being advised by the fabled six blind men examining an elephant by touching the part nearest to each\cite{PerfToolPoem}: one, reaching the side, compares it to a wall; the second, feeling the tusk, likens it to a spear; the next, touching the trunk, is reminded of a snake, and so on. All of these views are indeed correct and supply important information, but they don't connect to each other, to provide a view of how the parts interact to form a whole.
    2.31  
    2.32 -Likewise, current tools may cover all the parts of the application code, but they often do not  adequately connect their observations to the runtime behavior, scheduling decisions, and consequent hardware behavior. The decision about which task or thread is assigned to which core at what point in time is at the heart of parallel performance, so these choices need to be displayed and connected to the application features influencing them.
    2.33  
    2.34 -Our approach provides a more complete picture by following a new model of parallel computation. The model suggests that the relevant application features are the units of work performed and the  constraints on scheduling them, and that it is crucial to record and display the scheduling decisions taken by the runtime. The model simplifies hypothesis generation by connecting the measurements to each other and to the application, analogous to providing a view of the whole elephant.
    2.35 +However, once such optimization has made progress, the main area available for performance improvement undergoes a shift towards the scheduling  choices made by the runtime. Existing  tools tend to provide less help here, in understanding the complicated interactions among runtime behavior, scheduling decisions, and consequent hardware behavior. The interactions often give rise to chain reactions that leave cores idle. These involve the constraints stated by the application, the way the work is divided, and the choices made during the run for which unit of work to assign to which hardware resource at which point in time. To tune such behaviors, the chains need to be identified and understood.   
    2.36  
    2.37 -Our instrumentation, based on the model, detects the units within the application code when they are scheduled during a run, and visually links them to the units upon which their execution is conditional. The visual arrangement makes identifying idle cores immediate, enables quickly generating the correct hypotheses for the causes of the performance losses, and quickly pinpointing where in the application code changes can be made to improve performance. In effect, the visualization serves as a map showing idle cores linked back to the section of application code responsible.
    2.38 +%The level of complexity is reflected by the fact that finding the optimal set of scheduling choices is known to be an  NP hard problem [].   
    2.39  
    2.40 -The model of computation and our visual approach to using it is illustrated with a case study of performance tuning a standard parallel application on a large multi-core system.
    2.41 +Although current tools do a good job of identifying which constructs block, and which cores sit idle, they don't help much in deciphering such complex chain reactions. This needs a visualization of the possibilities open to the scheduler and which choices it made.
    2.42  
    2.43 -This paper is organized as follows: We start with an overview of previous approaches in Section \ref{sec:related}. We show usage of our visualizations through a case study in Section \ref{sec:casestudy}, and then expand on the model behind it in Section \ref{sec:theory}. Section \ref{sec:Implementation} will relate the model to implementation details. Finally, we conclude in Section \ref{sec:conclusion}.
    2.44 +    
    2.45 +
    2.46 +
    2.47 +To help with this particular aspect of tuning, we propose an additional tool that visualizes the constraints, the runtime choices, and interactions among them. Our approach  is based on  a newly discovered model of the structure of parallel computation. It can capture the parallel aspects of a program written in a parallel langauge through instrumentation of its runtime. It augments existing tools, adding illumination of the chains  of scheduling decisions that lead to lost performance and tying the causes back to particular code features. 
    2.48 +
    2.49 +The theoretical model adds value by establishing a mental framework that simplifies hypothesis generation, by indicating particular quantities to measure that connect the scheduling measurements to each other and to specific  application features.
    2.50 +
    2.51 +The information used during tuning is collected by the runtime as an application executes, which makes it practical and backwards compatible.  An existing runtime is instrumented, at key points that correspond to features of the computation model. The instrumentation detects the application-defined units of work and constraints on scheduling them, as well as recording the choices made by the runtime's scheduler.    
    2.52 +
    2.53 +The information is used to visually link each unit to the units upon which its execution is conditional. This includes causes due to application constraints, runtime constraints, and hardware constraints. The visual links establish the chain-reactions of scheduling decisions that lead to lost processing opportunity. Semantics attached to the visual features  enable quickly generating the correct hypotheses for the causes of  lost computation opportunities, and quickly narrowing down what can be done, in the application code, to improve performance. In effect, the visualization serves as a map showing idle cores linked back to the section of application code responsible.
    2.54 +
    2.55 +
    2.56 +To simplify seeing the value of the approach,  we walk through a session of tuning scheduling choices, in which we point out the semantics of features in the visualization, what each visual feature implies, and how these implications lead to hypotheses and fixes.   Although the views look similar to current tools, they differ in the semantics of the visual features.
    2.57 +
    2.58 +  
    2.59 +
    2.60 +The  contribution of this paper is the way the parallel model has been applied, rather than any aspects of the model itself.  The distinction is important because the space is used to convey the value gained by the  process of applying the model. Only enough of the model is stated to understand  where the value comes from and how to  instrument a runtime to gain it.
    2.61 +
    2.62 +This paper is organized as follows:  we introduce the semantics of features of our visualization  by walking through a tuning session in Section \ref{sec:casestudy}, and then expand on the theoretical model behind it in Section \ref{sec:theory}. Section \ref{sec:Implementation} gives  details of instrumenting a runtime to collect data according to the model. We relate the approach to other tuning tools in Section \ref{sec:related} and draw conclusions in Section \ref{sec:conclusion}.
    2.63 +
    2.64 +
    2.65  
    2.66  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    2.67 -\section{Background and Related Work}
    2.68 -\label{sec:related}
    2.69 -
    2.70 -When choosing a performance tuning tool, three primary factors to consider are:
    2.71 -\vspace{-3pt}
    2.72 -\begin{itemize} 
    2.73 -\setlength{\itemsep}{0em}
    2.74 -\item the ease of recognizing performance loss
    2.75 -\item the ease of forming a hypothesis to explain the loss
    2.76 -\item the ease of linking the hypothesis to changes to make in the application code
    2.77 -\end{itemize}
    2.78 -\vspace{-3pt}
    2.79 -The mental model that underlies the tool plays a key role in these three usefulness factors. This can be seen  by examining the  activity of performance tuning.  It is an iterative activity. To start, the programmer takes measurements during execution and compares them to the desired outcome. A hypothesis is generated for what caused any differences, by using a mental model, of causal relationships within  execution. The mental model is then also used to link the hypothesis to code, to suggest a change to make to improve performance. After modification, the code is run again, and these steps are repeated until the programmer is satisfied with the performance of the program. Thus, the mental model is central during performance tuning.
    2.80 -
    2.81 -As a result of its role when using  a tuning tool, the mental model underlying the tool is   a major focus when comparing  tools.  The mental model of our approach is based on a new model of parallel computation and is the main way  our approach differs from other approaches. The new model makes it  easier to form a hypothesis of the cause of performance losses, and easier to link that back to changes to make in application code.
    2.82 -
    2.83 - Historically, the dominant mental model for parallel computation has been the software-thread abstraction. The model is very close to the hardware, without much higher-level semantics. This makes it  a problem when analyzing parallel performance. Yet as the earliest and most widespread parallel programming model, and one that serves as the basis on top of which many other models are implemented, it must be supported.
    2.84 -
    2.85 -A popular version is  pthreads, which introduces randomness through OS time-slicing, and whose synchronization constructs  imply only indirect connections. For instance, very little conclusion can be drawn about the relationship between the computations in two separate threads when they access the same lock consecutively. Consequently, tools such as Paradyn \cite{PerfToolParadyn} or VTune \cite{PerfToolVTune} that rely on this model  have a hard time connecting measurements to the application code details. They collect a wealth of statistics, but the application is to the tool a foreign process, where things ``just happen'' without causal links between events and code. Such tools generate conclusions such as ``application bottleneck is synchronization'', and  ``too much time is spent in spinlocks'', without help as to what in the code causes this. For example, if there is more than one use of spinlocks in the application, it is not clear what point in the code is the problem.
    2.86 -
    2.87 -One fix to these issues is to allow users to introduce measuring points into their own code, at the cost of increased programmer effort.  Because instrumentation code is written in the source language, it has access to application concepts. This advantage can be kept with automated instrumentation, by providing an instrumenting compiler, like the Tau \cite{PerfToolTau} project does.
    2.88 -
    2.89 -However, as long as the underlying execution model is still threads, there is no meaningful structure common to all applications to use to generate expressive measurement quantities. Function boundaries and the call graph are not sufficient when parallel execution is performed. The sequence and frequency of function calls tell little about parallel performance impacts because they have no bearing on synchronization events.  Unfortunately, pthreads does not capture even hints as to \emph{why} a given function call ends up blocking or not blocking, and what the effects on other threads are.
    2.90 -
    2.91 -Higher level parallel abstractions help alleviate that particular problem. For instance, in an application with MPI message passing \cite{MPI}, the information ``thread 2 spends little time waiting for messages from thread 0 but a lot of time waiting for messages from thread 1'' can be recorded, where in pthreads only ``thread 2 spends a lot of time waiting for a signal'' would be visible. It is much easier to reach the conclusion that the bottleneck is the slow rate at which thread 1 produces data.
    2.92 -
    2.93 -Leveraging the MPI library is a well-trodden path, with many tools available \cite{PerfToolParaver} \cite{PerfToolVampir} \cite{PerfToolParadyn} \cite{PerfToolTau}. But a frequent complaint of users of these tools is that in complex programs, the large number of messages sent makes it difficult to see problems. As there is little that distinguishes messages besides size, it is difficult to implement good filtering. Once again, we find that the limitations come from a parallel model that captures insufficient information.
    2.94 -
    2.95 -Fortunately, parallel languages are evolving in the same direction, and it is widely believed that semantically rich parallel constructs, when they match the structure of the application well, are key to improved productivity. With languages such as CnC \cite{CnCInHotPar} and StarSs \cite{StarSs}, the information available to the language runtime offers a much clearer picture of the constraints  placed on the execution of tasks and where they come from. Consequently, the application structure can be well reflected in visualizations, as the Temanejo debugger  \cite{PerfToolStarSs} for StarSs demonstrates.
    2.96 -
    2.97 -In the following section, we will demonstrate how our approach overcomes the various challenges of these other approaches through a combination of the following features:
    2.98 -\vspace{-3pt}
    2.99 -\begin{itemize}
   2.100 -\setlength{\itemsep}{0em}
   2.101 -\item  it visualizes all the factors influencing scheduling decisions, given sufficient language and runtime pre-requisites
   2.102 -\item it has a filtering mechanism that shows only the most relevant groups of constraints, as desired
   2.103 -\item it has flexibility to capture and represent any type of constraints
   2.104 -\item it works for any parallel programming library or language
   2.105 -\item it instruments only the runtime, once,  for all applications
   2.106 -\end{itemize}
   2.107 -\vspace{-5pt}
   2.108 -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   2.109 -\section{Case Study: Analyzing an SSR Application}
   2.110 +\section{Illustration of Tuning Scheduling Decisions}
   2.111  \label{sec:casestudy}
   2.112  
   2.113 -To concretely illustrate the benefits of connecting measurements to an effective computation model, in this section we walk through a typical performance tuning session using our approach. This shows the features  in action, and indicates how having those features provide an advantage to the user over methods that lack them.
   2.114 +To demonstrate the value of the approach, we create a slightly contrived example  that brings out particular features of the tool. but uses a real application running on real hardware. 
   2.115  
   2.116 -We first describe the program and language used, and then the features of our visualization. After this preparation, we show a sequence of the visualizations. In each,  we point out how the  performance loss is identified, and which visual features suggest the hypothesis of the cause of the loss.  We show how the visual features direct the user to the specific sections of code that need to be changed, and how the model suggests what changes to try. 
   2.117 +We start  by describing the hardware being run on and structure of the program being tuned, and follow by describing the parallel language used and instrumented. We then  advance to the features of the visualization.
   2.118  
   2.119 -\subsection{Application and Target Hardware}
   2.120 + A sequence of visualizations follow. In each,  we point out how the  performance loss is identified, and which visual features suggest the hypothesis of the cause of the loss.  We show how the visual features direct the user to the specific sections of code that need to be changed, and how the model suggests what changes to try. 
   2.121  
   2.122 -In our session, we wish to tune a standard program that the reader  knows well. The best example is likely matrix multiply, with which the reader should be familiar, allowing concentration on the tool without distraction about the application. 
   2.123 +\subsection{Setup: Hardware and Application}
   2.124  
   2.125 -We run it on a machine with 4 sockets by 10 cores each, for a total of 40 physical cores. They are Intel WestmereEx cores running at 3.0GHz, with TurboBoost turned off for reproducability. 
   2.126  
   2.127 -The application code uses language features to create virtual processors (VP). The first VP created divides the work into a number of pieces and creates a new VP for each piece. 
   2.128  
   2.129 -How many pieces is set by a tuning parameter in the code, and the number of cores. The application uses language features to distribute the VPs across the cores,  in a round-robin fashion unless otherwise specified.
   2.130 +We run our example on a machine with 4 sockets by 10 cores each, for a total of 40 physical cores. They are Intel WestmereEx cores running at 3.0GHz, with TurboBoost turned off for reproducability. 
   2.131  
   2.132 -It then creates a results VP that receives a partial-result from each piece and accumulates the results. The  divider VP  then waits for the results VP to indicate completion, after which the language runtime shuts down.
   2.133 +We wish to tune scheduling on a standard program that the reader  knows well. We chose matrix multiply, with which the reader should be familiar, which allows concentration on the tool without distraction about the application. 
   2.134 +
   2.135 +The application is structured as follows: it  creates one entity to divide the work into a number of pieces and  creates another entity for each piece of work. How many pieces is set by the combination of a tuning parameter in the code with the number of cores. The work is distributed across the cores  in a round-robin fashion unless otherwise specified.
   2.136 +
   2.137 +The application also creates an entity that manages the partial-results. Each piece of work sends its contribution, which is accumulated into the overall  result. The entity that divides waits for the entity that accumulates to signal completion then the language runtime shuts down.
   2.138  
   2.139  \subsection{Programming Model}
   2.140 -The language used was Synchronous Send-Receive (SSR), a self-implemented language based on rendez-vous style send and receive operations made between virtual processors (VPs), which are equivalent to software threads. The example application uses commands for creating and destroying VPs, and two kinds of send-receive paired operations. 
   2.141 +We chose a simple language that was convenient to instrument. It is inspired by pi-calculus, and called Synchronous Send-Receive (SSR).  It implements  rendez-vous style send and receive operations made between virtual processors (VPs), where a VP is similar to a software thread. The example application uses the commands for creating and destroying VPs, two kinds of send-receive paired operations, a parallel singleton, and scheduling control constructs. 
   2.142  
   2.143 -The first, \emph{send\_from\_to} specifies both sender and receiver VPs. The results VP uses this to tell the divider VP that the work is complete. The second, \emph{send\_of\_type\_to}, specifies only a specific receiver, leaving the sender anonymous. It has increased flexibility while maintaining some control over scope. The worker VPs use this to send their partial result to the results VP. 
   2.144 +The first kind of send-receive pair used, and instrumented, is precise about sender and receiver. Called \emph{send\_from\_to}, it specifies both sender and receiver VPs, and is used by the results VP  to tell the divider VP that the work is complete. The second pair, \emph{send\_of\_type\_to}, specifies only a specific receiver, leaving the sender anonymous. It has increased flexibility while maintaining some control over scope. The worker VPs use this to send their partial result to the results VP. 
   2.145  
   2.146 -The language also includes a \emph{singleton} construct that designates a piece of code as to be executed only once. It is used to shift the work of copying matrix fragments out of the divider, which is sequential, over to the worker-pieces, which spreads the work across many cores. Several worker VPs share fragments, but the copy has only to be performed once. This construct allows whatever VP is ready first to perform the copy.
   2.147 +The application uses the \emph{singleton} construct to reduce the amount of work done by the (sequential) divider VP. The construct designates a piece of code as to be executed only once, even though it is invoked by multiple VPs. It is employed to shift the work of copying matrix fragments out of the divider and over to the worker-pieces. The first worker-piece to use a given input-matrix fragment performs the copy, which spreads the copying across the cores.
   2.148  
   2.149 -Miscellaneous performance constructs are also available, such as one to force which core a virtual processor is assigned to. We use this in our example program to control scheduling.
   2.150 +We control the scheduling behaviors, in order to concisely illustrate the use and features of the tool. This is done with language constructs that force which core a virtual processor is assigned to. 
   2.151  
   2.152  A note on terminology: We often use the term  ``work-unit'',  which we define precisely, instead of ``task'', which has acquired multiple  meanings in the literature. Work-unit  is defined as the trace-segment performed on a core, between two successive scheduling events, plus the set of datums consumed by that trace segment. The word task often maps well onto this  definition, and we use both words, but mean the precise work-unit definition when we say task.
   2.153  
   2.154 @@ -170,13 +149,18 @@
   2.155  
   2.156  \subsection{The Visualizations}
   2.157  \label{subsec:visualization_def}
   2.158 - The first visualization is what we refer to as a  Unit \& Constraint Collection, or UCC. It depicts the constraints on the scheduling decisions that come from the application. Constraints can be dependencies stated explicitly in the code, or can be implied by language constructs. These limit the choices the runtime is allowed to make. We use the more general term ``constraint'' instead of the more common ``dependency'' because dependency implies an order, while we also want to include non-ordering constraints such as e.g. a mutual exclusion.
   2.159 + The approach has two kinds of visualization, each corresponds to an aspect of the model. One focuses on just the application, conveying   its scheduling related structure. Its main value is understanding what's possible. The other focuses on how this structure interacts with the runtime and hardware. Its value is displaying the chain reactions and linking each step in the chain back to application code. 
   2.160  
   2.161 -The second visualization we call the scheduling consequence graph (SCG), or just consequence graph. It depicts the scheduling operations performed by the runtime, and the consequent usage of the cores. 
   2.162 + We refer to the first visualization  as a  Unit \& Constraint Collection, or UCC, which is explained in \S \ref{sec:UCCExpl}. Its elements are the application's units of work  and the constraints on scheduling them. Constraints can be explicitly stated in the code, or  implied by language constructs. They limit the choices available to the runtime. We use the  general term ``constraint'' instead of the specific term ``dependency'' because dependency only covers  one pattern: this unit before that one. Meanwhile  constraints are general,  such as e.g. a mutual exclusion on scheduling a group of units: any order, but only one at a time.
   2.163  
   2.164 -The UCC shows only application-derived information, as opposed to the consequence graph, which combines the \textit{use} of the UCC-depicted constraints along with runtime-imposed dependencies and hardware-imposed constraints.  Hence, the UCC states the degrees of freedom enabled by the application, while the consequence graph states how those were made use of, by a particular runtime on particular hardware.
   2.165 +We refer to the second visualization as a scheduling consequence graph (SCG), or just consequence graph, which is explained in \S \ref{sec:SCGExpl}. It depicts where the runtime assigned each of the units, how long the units executed, the sources of overhead, and the changes in constraint-state that triggered each runtime behavior. 
   2.166  
   2.167 -\subsubsection{UCC visualization}
   2.168 +In short, the UCC states the degrees of freedom enabled by the application, while the consequence graph states how those were made use of, by a particular runtime on particular hardware.
   2.169 +
   2.170 +\subsubsection{UCC visualization} \label{sec:UCCExpl}
   2.171 +
   2.172 +The value of the UCC is as a helper when brainstorming ways to get more usable parallelism out of an application.   It is used less often than the SCG.  
   2.173 +
   2.174  \begin{figure}[ht]
   2.175    \centering
   2.176    \includegraphics[width = 3.2in, height = 1.8in]{../figures/UCC_stylized_for_expl.pdf}
   2.177 @@ -184,13 +168,14 @@
   2.178    \label{fig:UCC_expl}
   2.179  \end{figure}
   2.180  
   2.181 -Fig. \ref{fig:UCC_expl} shows two versions of a UCC, both stylized for purposes of explanation.  These are used less often than the SCG,  mainly as a helper when brainstorming ways to get more usable parallelism out of an application. 
   2.182 -
   2.183 -Both versions of the UCC depict constraints on units. Starting  at the top, they  progress downwards,  placing units that constrain  above those constrained by them, and connecting them by a line representing the constraint. A unit is placed into the row just below the last of  the units constraining it, in the first available column. However, the rows and columns are just visual conveniences; the only information in the UCC is the lines between units, which represent constraints among the units. In practice, each unit is also labeled with a unique unit ID, from which the code of the unit and the parameters passed to it can be looked up. 
   2.184 +Fig. \ref{fig:UCC_expl} shows two alternate versions of a UCC, both stylized for purposes of explanation.  One treats units as all the same size, the other adds a notion of amount of work in a unit. They start  at the top and  progress downwards. Each  unit is placed above the units it constrains, and is connected to them by a line representing the kind of constraint. A unit is placed into the row just below the last of  the units constraining it, in the first available column. However, the rows and columns are just visual conveniences; the only information in the UCC is the units and the constraints among them. In practice, each unit is also labeled with a unique unit ID, from which the code of the unit and the parameters passed to it can be looked up. 
   2.185  
   2.186  What distinguishes the two types of UCC is the size of the units. The left one shows only the logical graph-like structure of the application, while the right weights the height of a unit by the number of instructions executed in it. When information about unit size is available, it can visually indicate the \textit{useable} parallelism available.
   2.187  
   2.188 -\subsubsection{SCG visualization}
   2.189 +\subsubsection{SCG visualization}  \label{sec:SCGExpl}
   2.190 +The Scheduling Consequence Graph is the main visualization used to detect chains of causal interactions between elements of the system.  For example, a unit may end, which sends a signal to the runtime to update the state of the unit and the state of the hardware resource it occupied. This in turn causes the runtime to choose a different unit to own that hardware and sends the meta-information for that unit to the hardware. This in turn triggers communication of the data consumed by the unit to be sent to the hardware. Then the work of the new unit takes place there.
   2.191 +
   2.192 +Any one of these interactions could be individually abnormal, and an unexpected source of performance loss. For example, there may be congestion in the network that causes the data movement to be abnormally slow for that particular new unit. The SCG allows quickly seeing where idleness takes place, then turning on visualization of each kind of causal interaction, until the culprit for the unexpected slowdown is seen.  
   2.193  \begin{figure}[ht]
   2.194    \centering
   2.195    \includegraphics[width = 2in, height = 1.8in]{../figures/SCG_stylized_for_expl.pdf}
   2.196 @@ -198,15 +183,14 @@
   2.197    \label{fig:SCG_expl}
   2.198  \end{figure}
   2.199  
   2.200 -Fig. \ref{fig:SCG_expl} shows a consequence graph, stylized for purposes of explanation. This is the main visualization used during performance tuning, because it shows the consequent uses of the cores in the machine, and links those to the constraints and code in the application. 
   2.201 -
   2.202 +Fig. \ref{fig:SCG_expl} shows a consequence graph, stylized for purposes of explanation. 
   2.203  It is composed of a number of columns, one for each core. A column represents time on the core,  increasing as one goes down, measured in clock cycles. It is broken into blocks, each representing the time accounted to one work-unit. Each block is further divided into regions, each a different color, which indicates the kind of activity the core was engaged in during that  region's time-span.
   2.204  
   2.205  The application code executed within a block is linked to the block. In our tool, the block is labelled with a unique unitID. This ID  is then linked to the code executed within that unit. In this way, the code of any block can be looked up, along with the parallelism constructs that mark the start and end of the block.
   2.206  
   2.207  The kinds of activities within a block are defined by the computation model that underlies the visualization. The first kind of activity is the actual work, plus  waiting for  cache misses. It is represented by a blue-to-red region where the color indicates  intensity of cache misses, with pure red representing at or above the maximum misses per instruction, and pure blue the minimum (the max and min are set in the tool that generates the visualization).
   2.208  
   2.209 -  The second kind of activity is runtime overhead, represented by a gray region. This is the overhead spent on that particular work-unit. When desired by the user, it is further broken into pieces representing activities inside the runtime. The options include time spent on: constraints, when determining readiness of the work-unit; deciding which ready unit to assign to which hardware; and time spent switching from virtual processor, to the runtime, and back. In this paper, we show all runtime overhead lumped together, however in other circumstances a breakdown can be key to understanding interaction between runtime and application. 
   2.210 +  The second kind of activity is runtime overhead, represented by a gray region. This is the overhead spent on that particular work-unit. When desired by the user, it is further broken into pieces representing activities inside the runtime. The options include time spent on: constraints, when determining readiness of the work-unit; deciding which ready unit to assign to which hardware; and time spent switching from virtual processor, to the runtime, and back. In this paper, we show all runtime overhead lumped together, however in other circumstances a breakdown can be key to seeing where unexpected slowdown is taking place. 
   2.211  
   2.212  The other type of visual feature seen in Fig. \ref{fig:SCG_expl} is lines. Each represents a construct that influenced scheduling, where the color indicates what kind of construct.   A line represents two things: a constraint, whose satisfaction made the lower unit ready, and a decision by the runtime to start the lower unit on that core. 
   2.213  
   2.214 @@ -430,6 +414,9 @@
   2.215  In practice, the question of levels does not come up very often for multi-core machines, as there is generally only one level of interest being investigated. However, the perspective is occasionally important here, e.g. when identifying the different components of the model in a concrete runtime so as to insert instrumentation code in the right places.
   2.216  
   2.217  
   2.218 +
   2.219 +
   2.220 +
   2.221  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   2.222  \section{Implementation}
   2.223  \label{sec:Implementation}
   2.224 @@ -555,6 +542,58 @@
   2.225  
   2.226  All this information is taken purely from the runtime, leading to a rich, configurable visualization without needing to add anything to the application.
   2.227  
   2.228 +
   2.229 +
   2.230 +
   2.231 +
   2.232 +
   2.233 +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   2.234 +\section{Background and Related Work}
   2.235 +\label{sec:related}
   2.236 +
   2.237 +When choosing a performance tuning tool, three primary factors to consider are:
   2.238 +\vspace{-3pt}
   2.239 +\begin{itemize} 
   2.240 +\setlength{\itemsep}{0em}
   2.241 +\item the ease of recognizing performance loss
   2.242 +\item the ease of forming a hypothesis to explain the loss
   2.243 +\item the ease of linking the hypothesis to changes to make in the application code
   2.244 +\end{itemize}
   2.245 +\vspace{-3pt}
   2.246 +The mental model that underlies the tool plays a key role in these three usefulness factors. This can be seen  by examining the  activity of performance tuning.  It is an iterative activity. To start, the programmer takes measurements during execution and compares them to the desired outcome. A hypothesis is generated for what caused any differences, by using a mental model, of causal relationships within  execution. The mental model is then also used to link the hypothesis to code, to suggest a change to make to improve performance. After modification, the code is run again, and these steps are repeated until the programmer is satisfied with the performance of the program. Thus, the mental model is central during performance tuning.
   2.247 +
   2.248 +As a result of its role when using  a tuning tool, the mental model underlying the tool is   a major focus when comparing  tools.  The mental model of our approach is based on a new model of parallel computation and is the main way  our approach differs from other approaches. The new model makes it  easier to form a hypothesis of the cause of performance losses, and easier to link that back to changes to make in application code.
   2.249 +
   2.250 + Historically, the dominant mental model for parallel computation has been the software-thread abstraction. The model is very close to the hardware, without much higher-level semantics. This makes it  a problem when analyzing parallel performance. Yet as the earliest and most widespread parallel programming model, and one that serves as the basis on top of which many other models are implemented, it must be supported.
   2.251 +
   2.252 +A popular version is  pthreads, which introduces randomness through OS time-slicing, and whose synchronization constructs  imply only indirect connections. For instance, very little conclusion can be drawn about the relationship between the computations in two separate threads when they access the same lock consecutively. Consequently, tools such as Paradyn \cite{PerfToolParadyn} or VTune \cite{PerfToolVTune} that rely on this model  have a hard time connecting measurements to the application code details. They collect a wealth of statistics, but the application is to the tool a foreign process, where things ``just happen'' without causal links between events and code. Such tools generate conclusions such as ``application bottleneck is synchronization'', and  ``too much time is spent in spinlocks'', without help as to what in the code causes this. For example, if there is more than one use of spinlocks in the application, it is not clear what point in the code is the problem.
   2.253 +
   2.254 +One fix to these issues is to allow users to introduce measuring points into their own code, at the cost of increased programmer effort.  Because instrumentation code is written in the source language, it has access to application concepts. This advantage can be kept with automated instrumentation, by providing an instrumenting compiler, like the Tau \cite{PerfToolTau} project does.
   2.255 +
   2.256 +However, as long as the underlying execution model is still threads, there is no meaningful structure common to all applications to use to generate expressive measurement quantities. Function boundaries and the call graph are not sufficient when parallel execution is performed. The sequence and frequency of function calls tell little about parallel performance impacts because they have no bearing on synchronization events.  Unfortunately, pthreads does not capture even hints as to \emph{why} a given function call ends up blocking or not blocking, and what the effects on other threads are.
   2.257 +
   2.258 +Higher level parallel abstractions help alleviate that particular problem. For instance, in an application with MPI message passing \cite{MPI}, the information ``thread 2 spends little time waiting for messages from thread 0 but a lot of time waiting for messages from thread 1'' can be recorded, where in pthreads only ``thread 2 spends a lot of time waiting for a signal'' would be visible. It is much easier to reach the conclusion that the bottleneck is the slow rate at which thread 1 produces data.
   2.259 +
   2.260 +Leveraging the MPI library is a well-trodden path, with many tools available \cite{PerfToolParaver} \cite{PerfToolVampir} \cite{PerfToolParadyn} \cite{PerfToolTau}. But a frequent complaint of users of these tools is that in complex programs, the large number of messages sent makes it difficult to see problems. As there is little that distinguishes messages besides size, it is difficult to implement good filtering. Once again, we find that the limitations come from a parallel model that captures insufficient information.
   2.261 +
   2.262 +Fortunately, parallel languages are evolving in the same direction, and it is widely believed that semantically rich parallel constructs, when they match the structure of the application well, are key to improved productivity. With languages such as CnC \cite{CnCInHotPar} and StarSs \cite{StarSs}, the information available to the language runtime offers a much clearer picture of the constraints  placed on the execution of tasks and where they come from. Consequently, the application structure can be well reflected in visualizations, as the Temanejo debugger  \cite{PerfToolStarSs} for StarSs demonstrates.
   2.263 +
   2.264 +In the following section, we will demonstrate how our approach overcomes the various challenges of these other approaches through a combination of the following features:
   2.265 +\vspace{-3pt}
   2.266 +\begin{itemize}
   2.267 +\setlength{\itemsep}{0em}
   2.268 +\item  it visualizes all the factors influencing scheduling decisions, given sufficient language and runtime pre-requisites
   2.269 +\item it has a filtering mechanism that shows only the most relevant groups of constraints, as desired
   2.270 +\item it has flexibility to capture and represent any type of constraints
   2.271 +\item it works for any parallel programming library or language
   2.272 +\item it instruments only the runtime, once,  for all applications
   2.273 +\end{itemize}
   2.274 +\vspace{-5pt}
   2.275 +
   2.276 +
   2.277 +
   2.278 +
   2.279 +
   2.280  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   2.281  \section{Conclusion}
   2.282  \label{sec:conclusion}
     3.1 --- a/0__Papers/VMS/VMS__Foundation_Paper/VMS__Full_conference_version/latex/VMS__Full_conf_paper_2.tex	Mon Nov 12 16:52:20 2012 -0800
     3.2 +++ b/0__Papers/VMS/VMS__Foundation_Paper/VMS__Full_conference_version/latex/VMS__Full_conf_paper_2.tex	Fri Nov 30 13:26:57 2012 -0800
     3.3 @@ -193,12 +193,37 @@
     3.4  
     3.5  We follow this in \S with an in-depth look at implementing language logic, and show how the proto-runtime interface allows it to use sequential thinking. We then give similar detail in \S on the implementation of the assigner, which chooses what core executes each chunk of work. We discuss how that has the potential to improve application performance by reducing communication between cores and reducing idle time of cores.  In \S we support our belief that the patterns we followed when modularizing are indeed fundamental and will remain valid for future languages and hardware. In \S we discuss some of the centralized services provided by the current proto-runtime implementation, as well as planned future ones. Then in \S we give an example of reusing language logic from one language implementation to another. 
     3.6  
     3.7 +
     3.8 +%%%%%%%%%%%%%%%%%%%%%%%%
     3.9 +%%
    3.10 +%%%%%%%%%%%%%%%%%%%%%%%%
    3.11  \section{Measurements}
    3.12  With the background on eDSLs and description of the proto-runtime approach behind us, we then provide overhead measurements in \S and implementation time measurements in \S
    3.13  
    3.14  \subsection{Overhead Measurements}
    3.15  For the following, we use a 4-core single socket 2.4Ghz laptop, and a 4 socket by 10 core each server.
    3.16  
    3.17 +For runtime performance:
    3.18 +
    3.19 +-- Vthread vs pthread: laptop and server on exe vs task (and fibonacci?)
    3.20 +
    3.21 +-- VCilk vs Cilk: laptop and server on fibonacci (from Albert)
    3.22 +
    3.23 +-- VOMP vs OpenMP: laptop and server on exe vs task and fibonacci
    3.24 +
    3.25 +-- VSs vs OMPSs: laptop and server on fibonacci and jpeg
    3.26 +
    3.27 +\begin{tabular}{|c|c|c|c|c|c|c|}\hline
    3.28 +a & a & a & a & a & a & a \\\hline
    3.29 +a & a & a & a & a & a & a \\\hline
    3.30 +a & a & a & a & a & a & a \\\hline
    3.31 +a & a & a & a & a & a & a \\\hline
    3.32 +\end{tabular}
    3.33 +\caption{}
    3.34 +\label{tab}
    3.35 +
    3.36 +As seen, we didn't include application performance because we have not yet taken advantage of the opportunity to use language information to predict locality.  That research is in progress and will be reported in future papers.
    3.37 +
    3.38  \subsubsection{Vthread Versus Highly Tuned Posix Threads}
    3.39  Measurements indicate that the proto-runtime approach has far lower overhead than even the current highly tuned Linux thread implementation, and discusses why equivalent user-level M to N thread packages haven't been pursued, leaving no viable user-level libraries to compare against.  
    3.40  \subsubsection{VCilk Versus Cilk 5.4}
    3.41 @@ -208,32 +233,59 @@
    3.42  \subsubsection{VOMP Versus OpenMP}
    3.43  VOMP
    3.44  
    3.45 +
    3.46 +%%%%%%%%%%%%%%%%%%%%%%%%
    3.47 +%%
    3.48 +%%%%%%%%%%%%%%%%%%%%%%%%
    3.49  \subsection{Development Time Measurements}
    3.50 -Here we summarize the time to develop each of the eDSLs created so far. As a control, we estimate how long the equivalent functionality required, using the traditional approach, based on anecdotal evidence.
    3.51 +Here we summarize the time to develop each of the eDSLs and copy-cat languages created so far. As a control, we estimate how long the equivalent functionality required, using the traditional approach, based on anecdotal evidence.
    3.52 +
    3.53 +Summarized in Table \ref{tabPersonHoursLang}, we measured the time we spent to design, code, and get an initial version working for each of the languages we created.  The results are shown in the same order we created them, with SSR the first. As we gained experience,  design and coding became more efficient. Not shown is the 7 hours required to take the send-receive code from SSR and adapt it to work with tasks in VSs.  In addition, 11 hours was spent importing the DKU constructs into VSs.  These are hours spent at the keyboard or with pen and paper, and don't include think time during other activities in the day.
    3.54 + 
    3.55 +
    3.56 +\begin{centering}
    3.57 +\begin{tabular}{|l|r|r|r|r|r|r|}
    3.58 +  \cline{2-7}
    3.59 +  \multicolumn{1}{r|}{} & SSR & Vthread & VCilk & HWSim & VOMP & VSs\\
    3.60 +  \cline{2-7}
    3.61 +  \noalign{\vskip2pt}
    3.62 +  \hline
    3.63 +  Design & 19 & 6 & 3 & 52 & 18& 6\\
    3.64 +  Code & 13 & 3 & 3& 32 & 9& 12\\
    3.65 +  Test & 7 & 2 & 2& 12 & 8& 5\\
    3.66 +  L.O.C. & 470 & 290 & 310& 3000 & 690 & 780\\
    3.67 +  \hline
    3.68 +\end{tabular}
    3.69 +\caption
    3.70 +{Hours to design, code, and test each embedded language. L.O.C. is lines of (original) C code, excluding libraries and comments.
    3.71 +}
    3.72 +\end{centering}
    3.73 +\label{tabPersonHoursLang}
    3.74  
    3.75  \subsubsection{Comparison of Design Approaches}
    3.76  We give the bigger picture of the difference in design methods between traditional approaches and the proto-runtime implementations, discussing OpenMP versus VOMP, Cilk 5.4 vs VCilk, pthread vs Vthread, and OMPSs vs VSs.  These discussions attempt to give the two design philosophies and paint a picture of the development process in the two competing approaches.  The goal is to illustrate how the proto-runtime approach maintains many of the language features, through its centralized services, while significantly reducing implementation time, through reuse of the services, elimination of concurrency concerns in design and debugging, and in the simplifications in design and implementation caused by the clean modularization of the proto-runtime approach, and the regularization of implementation from one language to another.
    3.77  
    3.78 +
    3.79 +%%%%%%%%%%%%%%%%%%%%%%%%
    3.80 +%%
    3.81 +%%%%%%%%%%%%%%%%%%%%%%%%
    3.82  \section{Related Work}
    3.83  
    3.84 -With the full understanding of the proto-runtime approach in hand, we discuss  how it compares to related work.
    3.85 -Now  we compare the proto-runtime approach to other approaches for implementing domain specific languages.  The criteria are: level of effort to implement the DSL, effort to port the DSL, runtime performance, and support for application performance. The main alternative methods are: posix threads, user-level threads, TBB, modifying libGomp, and using hardware primitives to make a custom runtime.
    3.86 +With the full understanding of the proto-runtime approach in hand, we discuss  how it compares to other approaches for implementing the runtimes of domain specific languages.  The criteria are: level of effort to implement the runtime, effort to port the runtime, runtime performance, and support for application performance. The main alternative implementation approaches are: posix threads, user-level threads, TBB, modifying libGomp, and using hardware primitives to make a custom runtime.
    3.87  
    3.88 -We talk about each of these approachs, then summarize the conclusions in Table \ref{tab:CriteriaVsApproach}.
    3.89 +We first talk about each of these approaches, then summarize the conclusions in Table \ref{tab:CriteriaVsApproach}.
    3.90  
    3.91  The first three methods involve building the DSL runtime on top of OS threads, user threads, or TBB, all of which are languages in their own right. So the DSL runtime runs on top of the runtime for that lower-level language.  This places control of work placement inside the lower-level runtime, blocking the DSL runtime, which hurts application-code performance, due to inability to use data locality. In addition, OS threads have operating system overhead and OS-imposed fairness requirements, which keeps runtime performance poor.
    3.92  
    3.93 -All three also force the DSL implementation to manage concurrency explicitly, using language primitives like locks.  TBB may have a slight advantage due to its task-scheduling commands, but only for task-based languages. Hence, implementation effort is poor for these approaches.  
    3.94 +All three also force the DSL implementation to manage concurrency explicitly, using language primitives such as locks.  TBB may have a slight advantage due to its task-scheduling commands, but only for task-based languages. Hence, implementation effort is poor for these approaches.  
    3.95  
    3.96 -For the same reason, for these three, the runtime code needs to be rewritten and tuned for each hardware platform, or else some form of hardware-abstraction placed into the runtime.  But putting in a hardware abstraction is essentially an alternative way of implementing half of the proto-runtime approach, but without the centralization, reuse, and modularization benefits.
    3.97 +For the same reason, for these three, the runtime code needs to be rewritten and tuned for each hardware platform for each language, or else some form of hardware-abstraction placed into the runtime.  But putting in a hardware abstraction is essentially an alternative way of implementing half of the proto-runtime approach, but without the centralization, reuse, and modularization benefits.
    3.98  
    3.99 -libGomp is used by many language researchers (based on informal discussions) because it has a very simple structure, making it relatively easy to modify, especially for simple languages. However, it provides no services such as debugging or performance tuning, and it has no modularization or reuse across languages benefits.  As the price of the simplicity, performance suffers, as seen in the experiments [].  Also, re-writes of the DSL runtime are required for each platform in order to tune it to hardware characteristics.
   3.100 +Many language researchers use libGomp (based on informal discussions) because of its very simple structure, which makes it relatively easy to modify, especially for simple languages. However, it provides no services such as debugging or performance tuning, and it has no modularization or reuse across languages benefits.  As the price of the simplicity, performance suffers, as seen in the experiments [].  Also, re-writes of the DSL runtime are required for each platform in order to tune it to hardware characteristics. However, because the runtime is directly modified, the language gains control over placement of work, enabling good application performance.
   3.101  
   3.102 -However, because the runtime is directly modified, the language has control over placement of work. That lets the language use semantic information about communication between work-units to gain predictability about where input data may reside. That in turn enables scheduling algorithms that use the data locality predictions to reduce communication, for good application performance.
   3.103 -
   3.104 -The last alternative is to write a custom runtime from scratch, using hardware primitives like the Compare And Swap (CAS) instruction, or similar atomic read-modify-write instructions.  This approach requires the highest degree of implementation effort, and the worst portability across hardware.  However, if sufficient effort is expended on tuning, it can achieve the best runtime performance and equal the best performance of application code. So far, the gap has proven small between highly tuned language-specific custom runtime performance and that of our proto-runtime, but we only have the CILK implementation as a comparison point. 
   3.105 +The last alternative we consider is writing a custom runtime from scratch, using hardware primitives like the Compare And Swap (CAS) instruction, or similar atomic read-modify-write instructions.  This approach requires the highest degree of implementation effort, and the worst portability across hardware.  However, if sufficient effort is expended on tuning, it can achieve the best runtime performance and equal the best performance of application code. So far, the gap has proven small between highly tuned language-specific custom runtime performance and that of our proto-runtime, but we only have the CILK implementation as a comparison point. 
   3.106   
   3.107 -Putting this all together, Table \ref{tab:CriteriaVsApproach} shows that only the proto-runtime approach makes initial language implementation fast, as well as reduces porting effort, while keeping runtime performance high and enabling high application performance. 
   3.108 +Putting this all together, Table \ref{tab:CriteriaVsApproach} shows that the proto-runtime approach is the only one that scores high in all the mesures. It makes initial language implementation fast, as well as reduces porting effort, while keeping runtime performance high and enabling high application performance. 
   3.109  
   3.110  \begin{center}
   3.111  \begin{tabular}{|c|c|c|c|c|}\hline
   3.112 @@ -251,15 +303,45 @@
   3.113  \label{tab:CriteriaVsApproach}
   3.114  
   3.115  
   3.116 -\section{Conclusions}
   3.117 -The  main takeaways from the paper are the potential for embedded style Domain Specific Languages (eDSLs) to address the issues holding back parallel programming from the main-stream, and  the role the proto-runtime approach plays in enabling this, by simplifying the runtime aspect of implementing a large number of eDSLs across the many hardware targets. The proto-runtime approach does this by modularizing the runtimes, providing reuse of centralized services, and reuse of the hardware-specific performance tuning, which is performed once per hardware, on the proto-runtime, then enjoyed by all the eDSLs.  Hence, the proto-runtime approach provides a significant piece of the puzzle of providing eDSLs, to bring parallel programming into the mainstream.
   3.118  
   3.119 -?
   3.120  
   3.121 +%%%%%%%%%%%%%%%%%%%%%%%%
   3.122 +%%
   3.123 +%%%%%%%%%%%%%%%%%%%%%%%%
   3.124 +\section{Conclusions and Future Work}
   3.125 +The  main takeaways from the paper are first, the potential for embedded style Domain Specific Languages (eDSLs) to address the issues that are holding-back parallel programming, and second   the role that the proto-runtime approach can play in making eDSLs practical, by simplifying the runtime aspect of implementing a large number of eDSLs across the many hardware targets. 
   3.126 +%The proto-runtime approach does this by modularizing the runtimes, providing reuse of centralized services, and reuse of the hardware-specific performance tuning, which is performed once per hardware, on the proto-runtime, then enjoyed by all the eDSLs.  Hence, the proto-runtime approach provides a significant piece of the puzzle of providing eDSLs, to bring parallel programming into the mainstream.
   3.127 +
   3.128 +
   3.129 +%[[Hypothesis: Embedded-style DSLs -> high productivity + low learning curve + low disruption + low app-port AND quick time to create + low effort to lang-port + high perf across targets]] 
   3.130 +
   3.131 +
   3.132 +Specifically, we have shown how the approach modularizes runtime code, in a way that appears applicable to any language or execution model. It isolates the hardware-specific portion  from language behavior as well as from the language-driven placement of work onto resources, providing interfaces between them.
   3.133 +
   3.134 +  The modularization reduces the effort of implementing a new language, especially for an embedded-style one where runtime creation is a significant portion of total  effort.  It causes the low level hardware portion to be reused by each language. And, the behavior implementation is simplified, by handling shared state inside the proto-runtime and exporting a sequential interface for  the behavior module to use. The simplification   reduces  effort, as does reuse of the hardware-specific portion, reuse of behavior code from one language to another, reuse of assignment code, and familiarity with the modular structure by implementors. Overall effort reduction was supported by measurements of implementation effort. 
   3.135 +
   3.136 +The proto-runtime approach makes it practical to maintain high overall runtime performance, with low effort for the language implementor. It is practical because high effort is put into performance-tuning the hardware-specific proto-runtime, which is then reused by each language. In this way the performance derived from the high tuning effort is inherited without extra effort by the language creators, thus amortizing the cost.
   3.137 +
   3.138 +Centralized services were implemented inside the proto-runtime portion, such as debugging facilities, automated verification, concurrency handling, hardware performance information gathering, and so on. We showed how they are reused by the languages. 
   3.139 +
   3.140 +Although we didn't measure it, we indicated how application performance can be increased due to giving the language direct control over placement of work, to take advantage of data affinity or application-generated communication patterns. This ability is due to the assignment module, which provides the language implementor with control over which core work is assigned to, and the order of executing each work unit.
   3.141 +
   3.142 +Work on the proto-runtime approach is in its infancy, and much remains to be done, including:
   3.143 +\begin{itemize} 
   3.144 +\item  Creating related interfaces for use with distributed memory hardware, and interfaces for hierarchical runtimes, to improve performance on many-level hardware such as high-performance computers, and to tie together runtimes for different types of architecture, to cover heterogeneous architectures and machines.
   3.145 +\item Extending the proto-runtime interface to present hardware information that a work-assigner will need, but in a generic way that remains constant across many hardware configurations yet exposes all relevant information.
   3.146 +\item Exploring work assignment implementations that take advantage of language and application knowledge to improve placement of work to gain higher application performance.
   3.147 +\item Applying the proto-runtime approach to support a portability software stack, and supply OS services to applications via the proto-runtime, to further increase application-code portability.
   3.148 +\end{itemize}
   3.149  
   3.150  
   3.151  =============================================
   3.152 -
   3.153 +==
   3.154 +==
   3.155 +==
   3.156 +==
   3.157 +==
   3.158 +=============================================
   3.159  
   3.160  \section{The Problem}
   3.161  
   3.162 @@ -828,54 +910,10 @@
   3.163  \section{Measurements}
   3.164  
   3.165  \subsection{Implementation time}
   3.166 -Summarized in Table \ref{tabPersonHoursLang}, we measured the time it took us to design, code, and get an initial version working for each of the languages we created.  They are shown in the same order we created them, with SSR the first. As we gained experience,  design and coding became more efficient. Not shown is the 7 hours required to take the send-receive code from SSR and adapt it to work with tasks in VSs.  In addition, 11 hours was spent importing the DKU constructs into VSs.  These are hours spent on the computer, and don't include think time during other activities in the day.
   3.167 - 
   3.168 -\ 
   3.169 -
   3.170 -\begin{centering}
   3.171 -\begin{tabular}{|l|r|r|r|r|r|r|}
   3.172 -  \cline{2-7}
   3.173 -  \multicolumn{1}{r|}{} & SSR & Vthread & VCilk & HWSim & VOMP & VSs\\
   3.174 -  \cline{2-7}
   3.175 -  \noalign{\vskip2pt}
   3.176 -  \hline
   3.177 -  Design & 19 & 6 & 3 & 52 & 18& 6\\
   3.178 -  Code & 13 & 3 & 3& 32 & 9& 12\\
   3.179 -  Test & 7 & 2 & 2& 12 & 8& 5\\
   3.180 -  L.O.C. & 470 & 290 & 310& 3000 & 690 & 780\\
   3.181 -  \hline
   3.182 -\end{tabular}
   3.183 -\caption
   3.184 -{Hours to design, code, and
   3.185 -  test each embedded language. L.O.C. is lines of (original) C code,
   3.186 -  excluding libraries and comments.
   3.187 -}
   3.188 -\end{centering}
   3.189 -\label{tabPersonHoursLang}
   3.190  
   3.191  
   3.192  \subsection{Runtime and Application Performance}
   3.193  
   3.194 -For runtime performance:
   3.195 -
   3.196 --- Vthread vs pthread: laptop and server on exe vs task (and fibonacci?)
   3.197 -
   3.198 --- VCilk vs Cil: laptop and server on fibonacci (from Albert)
   3.199 -
   3.200 --- VOMP vs OpenMP: laptop and server on exe vs task and fibonacci
   3.201 -
   3.202 --- VSs vs OMPSs: laptop and server on fibonacci and jpeg
   3.203 -
   3.204 -\begin{tabular}{|c|c|c|c|c|c|c|}\hline
   3.205 -a & a & a & a & a & a & a \\\hline
   3.206 -a & a & a & a & a & a & a \\\hline
   3.207 -a & a & a & a & a & a & a \\\hline
   3.208 -a & a & a & a & a & a & a \\\hline
   3.209 -\end{tabular}
   3.210 -\caption{}
   3.211 -\label{tab}
   3.212 -
   3.213 -As seen, we didn't include application performance because we have not yet taken advantage of the opportunity to use language information to predict locality.  That research is in progress and will be reported in future papers.
   3.214  
   3.215  %%%%%%%%%%%%%%%%%%%%%
   3.216  \section{Related Work}
   3.217 @@ -884,25 +922,7 @@
   3.218  %%%%%%%%%%%%%%%%%%%%%
   3.219  \section{Conclusion and Future Work}
   3.220  \label{sec:conclusion}
   3.221 -[[Hypothesis: Embedded-style DSLs -> high productivity + low learning curve + low disruption + low app-port AND quick time to create + low effort to lang-port + high perf across targets]] 
   3.222  
   3.223 -We have shown how the proto-runtime approach modularizes runtime code, in a way that appears applicable to any language or execution model. It isolates the hardware-specific portion  from language behavior as well as from the language-driven placement of work onto resources, providing interfaces between them.
   3.224 -
   3.225 -  The modularization reduces the effort of implementing a new language, especially an embedded-style one where runtime creation is a significant portion of total  effort.  The low level hardware portion is reused by each language. The behavior implementation is simplified by handling shared state inside the proto-runtime and exporting a sequential interface for  the behavior module to use. The simplification   reduces  effort, as does reuse of the hardware-specific portion, reuse of behavior code from one language to another, reuse of assignment code, and familiarity with the modular structure by implementors. Overall effort reduction was supported by measurements of implementation effort. 
   3.226 -
   3.227 -The proto-runtime approach makes it practical to maintain high overall runtime performance, with low effort for the language implementor. It is practical because high effort is put into performance-tuning the hardware-specific proto-runtime, which is then reused by each language. In this way the performance derived from the high tuning effort is inherited without extra effort by the language creators, thus amortizing the cost.
   3.228 -
   3.229 -Centralized services were implemented inside the proto-runtime portion, such as debugging facilities, automated verification, concurrency handling, hardware performance information gathering, and so on. We showed how they are reused by the languages. 
   3.230 -
   3.231 -Although we didn't measure it, we indicated how application performance can be increased due to giving the language direct control over placement of work, to take advantage of data affinity or application-generated communication patterns. This ability is due to the assignment module, which provides the language implementor with control over which core work is assigned to, and the order of executing each work unit.
   3.232 -
   3.233 -Work on the proto-runtime approach is in its infancy, and much remains to be done, including:
   3.234 -\begin{itemize} 
   3.235 -\item  Creating related interfaces for use with distributed memory hardware, and interfaces for hierarchical runtimes, to improve performance on many-level hardware such as high-performance computers, and to tie together runtimes for different types of architecture, to cover heterogeneous architectures and machines.
   3.236 -\item Extending the proto-runtime interface to present hardware information that a work-assigner will need, but in a generic way that remains constant across many hardware configurations yet exposes all relevant information.
   3.237 -\item Exploring work assignment implementations that take advantage of language and application knowledge to improve placement of work to gain higher application performance.
   3.238 -\item Applying the proto-runtime approach to support a portability software stack, and supply OS services to applications via the proto-runtime, to further increase application-code portability.
   3.239 -\end{itemize}
   3.240  
   3.241  
   3.242  \end{document} 
     4.1 Binary file CVs/12_Ag__Industry_CV/CV_w_pubs.pdf has changed
     5.1 --- a/CVs/12_Ag__Industry_CV/CV_w_pubs.tex	Mon Nov 12 16:52:20 2012 -0800
     5.2 +++ b/CVs/12_Ag__Industry_CV/CV_w_pubs.tex	Fri Nov 30 13:26:57 2012 -0800
     5.3 @@ -57,7 +57,7 @@
     5.4  
     5.5  \end{minipage}
     5.6  
     5.7 -\section{Overview/Summary} Sean is a researcher and sometimes entrepreneur.  As an undergraduate and masters student, his research at UC Berkeley demonstrated the first evidence of secure communication using chaos, spread spectrum communication using a chaotic carrier signal, and gain from perturbation of chaotic circuits, with a total of 680+ citations to these papers.  He left with his Masters in 1994, to pursue starting a company with technology developed during a graduate course, eventually founding a fabless chip company around a MIMD-SIMD low-power massively parallel processor for graphics (ProSide), in 1997.  After the company's demise, he switched to software and worked his way up the chain, ending at Chief Software Architect at Nevik in 2000.  The dot-com crash precipitated a return to academia to start his PhD in 2003, where he worked on parallelism, developing a theory of parallel computation, and a performance-portable software stack. As part of this, in 2008 he was invited by Albert Cohen to work in Paris at INRIA, the national computer science research institute for France, concentrating on new parallel languages for high productivity and high performance portability.  This led to BLIS and VMS, two technologies that became the basis of his dissertation, for which he received his PhD in 2011.  He is currently advancing the software stack to support performance-portable parallel languages.
     5.8 +\section{Overview/Summary} Sean is a researcher and former entrepreneur.  As an undergraduate and masters student, his research at UC Berkeley demonstrated the first evidence of secure communication using chaos, spread spectrum communication using a chaotic carrier signal, and gain from perturbation of chaotic circuits, with a total of 680+ citations to these papers.  He left with his Masters in 1994, to pursue starting a company with technology developed during a graduate course, eventually founding a fabless chip company around a MIMD-SIMD low-power massively parallel processor for graphics (ProSide), in 1997.  After the company's demise, he switched to software and worked his way up the chain, ending at Chief Software Architect at Nevik in 2000.  The dot-com crash precipitated a return to academia to start his PhD in 2003, where he worked on parallelism, developing a theory of parallel computation, and a performance-portable software stack. As part of this, in 2008 he was invited by Albert Cohen to work in Paris at INRIA, the national computer science research institute for France, concentrating on new parallel languages for high productivity and high performance portability.  This led to BLIS and VMS, two technologies that became the basis of his dissertation, for which he received his PhD in 2011.  He is currently advancing the software stack to support performance-portable parallel languages.
     5.9  
    5.10  
    5.11  
    5.12 @@ -82,7 +82,7 @@
    5.13  
    5.14  \item[Jun '11 to Jun '12:]  Post-doctorate researcher at Technical University Berlin.
    5.15  
    5.16 - Oversaw 6 students who worked  on: verifying a Basic Model of Parallel Computation (proposed in Dissertation), design of a low-power GPU that is free from programming and application restrictions, improvementation of Virtualized Master-Slave (VMS) (from Dissertation), and implementation of the HWSim language on top of VMS. 
    5.17 + Oversaw 6 students who worked  on:  design of a low-power GPU that is free from programming and application restrictions, verifying a Basic Model of Parallel Computation (proposed in Dissertation), improvement of the proto-runtime approach (from Dissertation), and implementation of the HWSim language using proto-runtime. 
    5.18  
    5.19  \item[Apr '08 to Apr '11:]  INRIA, Paris and Ecole Normale Supereur: Performed research on Portable High Performance Parallelism.  
    5.20  
    5.21 @@ -90,11 +90,11 @@
    5.22  
    5.23  \item[Sept '03 to Apr '08:] UC Santa Cruz: Performed research on a theory of parallel computation, and developed a programming system for performance-portable software called CodeTime, in addition to an analytic performance model for out-of-order pipelines.
    5.24  
    5.25 -\item[Apr '02 to Sept '03:]  Freelance: In anticipation of starting a fabless chip company, architected an ultra low power multi-threaded high-throughput processor.  Developed a detailed analytic model of the processor.  Wrote behavioral code and a test suite to demonstrate its performance.  Pursued series-A funding.  
    5.26 +\item[Apr '02 to Sept '03:]  Startup: In anticipation of starting a fabless chip company, architected an ultra low power multi-threaded high-throughput processor.  Developed a detailed analytic model of the processor.  Wrote behavioral code and a test suite to demonstrate its performance.  Pursued series-A funding.  
    5.27  
    5.28  \item[Dec '00 to Apr '02:] Nevik Networks: As Chief Software Architect, designed and was responsible for the team implementing a telecom abstraction layer that provides web-based end-customer provisioning of big-iron class 5 switches as well as soft-switches.  Led development, drove requirements gathering and scoping of the product.
    5.29  
    5.30 -\item[Dec '99 to Dec '00:] SRI International:  A member of SRI's enterprise software consulting spin-off.  Designed a high speed trading system for Deutsche Bank's fixed income securities market.  Co-architected, for American Century, an integrated financial system.  Contributed papers analyzing and detailing electronic exchanges.  Technical lead on implementation of a financial planning services exchange.
    5.31 +\item[Dec '99 to Dec '00:] SRI International:  A member of SRI's enterprise software consulting spin-off.  Designed a high speed trading system for Deutsche Bank's fixed income securities market.  Co-architected, for American Century, an integrated financial system.  Contributed papers analyzing and detailing electronic exchanges.  technical lead on implementation of a financial planning services exchange.
    5.32  
    5.33  \item[Apr '99 to Dec '99:] Cisco Systems, contract:  Member of web based ordering tool team (\$15 Billion in orders annually via this tool).  Added functionality via modifying front-end servlet code, back-end business logic, and Oracle database tables.
    5.34  
    5.35 @@ -110,38 +110,38 @@
    5.36  \pagebreak
    5.37  \section{Personal References}
    5.38  
    5.39 -\noindent These persons are familiar with my professional qualifications and my character:
    5.40 +%\noindent These persons are familiar with my professional qualifications and my character:
    5.41  
    5.42 -\begin{table}[ht]
    5.43 -\begin{tabular}{@{}lll@{}}
    5.44 -\textbf{Prof. Albert Cohen} \\
    5.45 -Professor and Researcher & Phone: & +33-1-44-32-21-67\\
    5.46 -INRIA & Email & albert.cohen@inria.fr\\
    5.47 -Ecole Normale Superieure \\
    5.48 -\\
    5.49 -\end{tabular}
    5.50 +Will be provided post-interview. 
    5.51  
    5.52 -\vspace{.15in}
    5.53 -\begin{tabular}{@{}lll@{}}
    5.54 -\textbf{Dr Yadu Zambre} \\
    5.55 -Director  & Phone: & +1-408-472-8854\\
    5.56 - Lockheed Martin & Email & yzambre@ieee.org\\
    5.57 - \\
    5.58  
    5.59 -\end{tabular}
    5.60 +%\begin{table}[ht]
    5.61 +%\begin{tabular}{@{}lll@{}}
    5.62 +%\textbf{Prof. Albert Cohen} \\
    5.63 +%Professor and Researcher & Phone: & +33-1-44-32-21-67\\
    5.64 +%INRIA & Email & albert.cohen@inria.fr\\
    5.65 +%Ecole Normale Superieure \\
    5.66 +%\\
    5.67 +%\end{tabular}
    5.68 +%
    5.69 +%\vspace{.15in}
    5.70 +%\begin{tabular}{@{}lll@{}}
    5.71 +%\textbf{Dr Yadu Zambre} \\
    5.72 +%Director  & Phone: & +1-408-472-8854\\
    5.73 +% Lockheed Martin & Email & yzambre@ieee.org\\
    5.74 +% \\
    5.75 +%
    5.76 +%\end{tabular}
    5.77 +%
    5.78 +%\vspace{.15in}
    5.79 +%\begin{tabular}{@{}lll@{}}
    5.80 +%\textbf{Dan Zivkovic} \\
    5.81 +%Developer & Phone: & +1-650-796-8623\\
    5.82 +%Google & Email & zivkovic@google.com \\
    5.83 +%\end{tabular}
    5.84 +%\end{table}
    5.85  
    5.86 -
    5.87 -\vspace{.15in}
    5.88 -\begin{tabular}{@{}lll@{}}
    5.89 -\textbf{Dan Zivkovic} \\
    5.90 -Developer & Phone: & +1-650-796-8623\\
    5.91 -Google & Email & zivkovic@google.com \\
    5.92 - 
    5.93 -
    5.94 -
    5.95 -\end{tabular}
    5.96 -
    5.97 -\end{table}
    5.98 +\section
    5.99  
   5.100  \cite{BLISHome,
   5.101  HotPar10_w_BLIS,