changeset 95:df22471e0fdb

Proto-runtime Full conf version -- early in changing to sales style -- hard tech style still there
author Sean Halle <seanhalle@yahoo.com>
date Mon, 12 Nov 2012 16:52:20 -0800
parents 483eb2bbc724
children 45975739549a
files 0__Papers/VMS/VMS__Foundation_Paper/LCPC_springer_published.pdf 0__Papers/VMS/VMS__Foundation_Paper/VMS__Full_conference_version/latex/VMS__Full_conf_paper_2.tex
diffstat 2 files changed, 134 insertions(+), 65 deletions(-) [+]
line diff
     1.1 Binary file 0__Papers/VMS/VMS__Foundation_Paper/LCPC_springer_published.pdf has changed
     2.1 --- a/0__Papers/VMS/VMS__Foundation_Paper/VMS__Full_conference_version/latex/VMS__Full_conf_paper_2.tex	Mon Oct 29 04:55:37 2012 -0700
     2.2 +++ b/0__Papers/VMS/VMS__Foundation_Paper/VMS__Full_conference_version/latex/VMS__Full_conf_paper_2.tex	Mon Nov 12 16:52:20 2012 -0800
     2.3 @@ -45,7 +45,7 @@
     2.4  \copyrightdata{[to be supplied]} 
     2.5  
     2.6  \titlebanner{banner above paper title}        % These are ignored unless
     2.7 -\preprintfooter{short descripti0on of paper}   % 'preprint' option specified.
     2.8 +\preprintfooter{short description of paper}   % 'preprint' option specified.
     2.9  
    2.10  
    2.11  \title{A Proto-Runtime Approach to Domain Specific Languages}
    2.12 @@ -71,7 +71,7 @@
    2.13   Domain Specific Languages that are embedded into a base language have promise to provide productivity, performant-portability and wide adoption for parallel programming. However such languages have too few users to support the large effort required to create them, resulting in low uptake of the method.
    2.14  To solve this, we introduce a proto-runtime approach, which reduces the effort to create and port domain specific languages. It modularizes the creation of runtime systems and the parallelism constructs they implement, separating the language-construct logic and scheduling logic away from the low-level runtime details of concurrency, memory consistency, and runtime-performance related code.
    2.15  As a result, new parallel constructs are written using sequential reasoning, and are easily reused across languages, and scheduling of work onto hardware is under language and application control, without interference from an underlying thread package scheduler. This enables higher quality scheduling decisions for higher application performance.
    2.16 -We present measurements of the time taken to develop new languages, as well as time to re-implement existing ones,  which average  a few days each.  In addition, we measure performance of proto-runtime based implementations going head-to-head with the standard distributions of Cilk, OpenMP, StarSs (OMPSs), and posix threads, showing that the proto-runtime matches or outperforms on large servers in all cases.
    2.17 +We present measurements of the time taken to develop new languages, as well as time to re-implement existing ones,  which average  a few days each.  In addition, we measure performance of proto-runtime based implementations going head-to-head with the standard distributions of Cilk, StarSs (OMPSs), and posix threads, showing that the proto-runtime matches or outperforms on large servers in all cases.
    2.18  \end{abstract}
    2.19  
    2.20  
    2.21 @@ -86,11 +86,11 @@
    2.22  
    2.23  Programming in the past has been overwhelmingly sequential, with the applications being run on sequential hardware.  But the laws of physics have forced the hardware to become parallel, even down to embedded devices such as smart phones. The trend appears unstoppable, forcing essentially all future programming to  become parallel programming.  However,  sequential programming remains the dominant approach due to  the difficulty of the traditional parallel programming methods. 
    2.24  
    2.25 -The difficulties with parallel programming fall into three main categories: 1) a difficult mental model, 2) having to rewrite the code for each hardware target to get acceptable performance and 3) disruption to existing practices, including steep learning curve, changes to the tools used, and changes in design practices. Many believe that these can all be overcome with the use of Domain-Specific Languages. But such languages have been costly to create and port across hardware targets, which makes them impractical given the small number of users of each language, and so have not caught on.
    2.26 +The difficulties with parallel programming fall into three main categories: 1) a difficult mental model, 2) having to rewrite the code for each hardware target to get acceptable performance and 3) disruption to existing practices, including steep learning curve, changes to the tools used, and changes in design practices. Many believe that these can all be overcome with the use of Domain-Specific Languages. But such languages have been costly to create, and to port across hardware targets, which makes them impractical given the small number of users of each language, and so have not caught on.
    2.27  
    2.28 -We propose that a method that makes Domain Specific Languages (DSLs) low cost to produce as well as to port across hardware targets will allow them to fulfill their promise, and we introduce what we call a proto-runtime to help with this.  
    2.29 +We propose that a method that makes Domain Specific Languages (DSLs) low cost to produce as well as to port across hardware targets will allow them to fulfill their promise, and we introduce what we call a proto-runtime to help towards this goal.  
    2.30  
    2.31 -A proto-runtime is a normal, full, runtime, but with two key pieces replaced by an interface. One piece is the logic of language constructs, the other is the logic for choosing which core to assign work onto. What's left is the proto-runtime, which comprises low-level details of internal runtime communication between cores and protection of shared runtime state during concurrent accesses performed  by the plugged-in pieces.
    2.32 +A proto-runtime is a normal, full, runtime, but with two key pieces replaced by an interface. One piece is the logic of language constructs, the other is the logic for choosing which core to assign work onto. What's left is the proto-runtime, which comprises low-level details of internal runtime communication between cores and protection of shared runtime state during concurrent accesses performed  by the plugged-in language-behavior pieces.
    2.33  
    2.34  The decomposition into a proto-runtime plus  plugged-in  language behaviors modularizes the construction of runtimes.  The proto-runtime is one module, which  embodies runtime internals, which are hardware oriented and independent of language. The plugged-in portions form the two other modules, which are language specific. The interface between them   occurs at a natural boundary, which separates   the hardware oriented portion of a runtime from the language oriented portion. 
    2.35  
    2.36 @@ -114,7 +114,7 @@
    2.37  
    2.38  \item Part of the time reduction is due to all languages inheriting the effort of performance tuning the runtime internals, so the language doesn't have to tune runtime to hardware  (\S\ ).  
    2.39  
    2.40 -\item  Part of the time reduction is due  to the use of sequential thinking when implementing the language logic. This is possible because the proto-runtime provides protection of shared internal runtime state, and exports an interface that presents a sequential model  (\S\ ). 
    2.41 +\item  Part of the time reduction is due  to the use of sequential thinking when implementing the language logic. Sequential thinking is possible because the proto-runtime provides protection of shared internal runtime state, and exports an interface that presents a sequential model  (\S\ ). 
    2.42  
    2.43  \item Part of the  time reduction is due to the modularization making it practical to reuse language logic from one language to another  (\S\ ).
    2.44  
    2.45 @@ -128,7 +128,7 @@
    2.46  
    2.47  \begin{itemize}
    2.48  
    2.49 -\item Similar control over hardware is not possible when the language is   built on top of a layer that has its own hardware assignment, such as a package like Posix threads or TBB  (\S\ ).
    2.50 +\item Similar control over hardware is not possible when the language is   built on top of a package like Posix threads or TBB, which has its own hardware assignment   (\S\ ).
    2.51  
    2.52  \end{itemize}
    2.53  
    2.54 @@ -136,31 +136,133 @@
    2.55  
    2.56  \end{itemize}
    2.57  
    2.58 -The paper is organized as follows: In 
    2.59 -\S \ref{sec:DSLHypothesis} we expand on our hypothesis that an embedded style DSL (eDSL) provides high programmer productivity, with a low learning curve. Further, in \S we show that when an application is written in a well designed eDSL, porting it to new hardware becomes simpler because often only the language needs to be ported.  That is because the elements of the problem being solved that require large amounts of computation are often pulled into the language. Lastly, in \S we hypothesize that switching from sequential programming to using an eDSL is low disruption because the base language remains the same, along with most of the development tools and practices.
    2.60 +The paper is organized as follows: We first (\S \ref{sec:DSLHypothesis}) expand on our hypothesis that an embedded style DSL (eDSL) provides high programmer productivity, with a low learning curve. Further, in \S we show that when an application is written in a well designed eDSL, porting it to new hardware becomes simpler because often only the language needs to be ported.  That is because the elements of the problem being solved that require large amounts of computation are often pulled into the language. Lastly, in \S we hypothesize that switching from sequential programming to using an eDSL is low disruption because the base language remains the same, along with most of the development tools and practices.  Hence, we cover how the three issues currently making parallel programming unattractive are addressed by embedded-style DSLs. 
    2.61 +
    2.62 +We next show what the blocks to eDSLs are, and where the main effort in implementing an eDSL lies. Specifically, in \S \ref{sec:DSLHypothesis} we show that the small number of users of an eDSL means that the eDSL must be very low effort to create, and also low effort to port to new hardware.  At the same time, the eDSL must remain very high performance across hardware targets. 
    2.63 +
    2.64 +In \S we analyze where the effort of creating an eDSL is expended. It turns out that in the traditional approach, it is expended in creating the translator for the custom DSL syntax, in creating the runtime, and in performance tuning the major domain-specific constructs. We propose that the MetaBorg[] translation approach  covers creating translators for custom syntax, and that tuning constructs is inescapable, leaving the question of runtime implementation time. 
    2.65 +
    2.66 +In \S we explore the effects of runtime implementation time by taking a step back and examine what the industry-wide picture would be if the eDSL approach were adopted. A large number of eDSLs will come into existence, each with its own set of runtimes, one runtime for each hardware target.  That causes a multiplicative effect: the number of runtimes will equal the number of eDSLs times the number of hardware targets.  Unless the effort of implementing runtimes reduces, this multiplicative effect could dominate, which would retard the uptake of eDSLs. Thus, showing that an approach that mitigates this multiplicative effect is valuable, and is the role that the proto-runtime plays.    
    2.67 +
    2.68 +Following that background on DSLs,  we move on to the details of the proto-runtime approach (\S ). We provide details of how a runtime is modularized (\S ), showing what responsibilities are encapsulated in which modules, and what the interfaces between them look like. We show how this makes the proto-runtime be reused by all languages on given hardware, and how the low-level tuning of the proto-runtime for specific hardware automatically benefits all the languages that run on that hardware.   
    2.69 +
    2.70 +We follow this (\S  ) with an in-depth look at implementing language logic, and show how the proto-runtime interface allows it to use sequential thinking. We then give similar detail (\S ) on the implementation of the assigner, which chooses what core executes each chunk of work. We discuss how that has the potential to improve application performance by reducing communication between cores and reducing idle time of cores.   We support our belief (\S ) that the patterns we followed when modularizing are indeed fundamental and will remain valid for future languages and hardware.  We then discuss (\S ) some of the centralized services provided by the current proto-runtime implementation, as well as planned future ones. Then  we give an example of reusing language logic from one language implementation to another (\S ). 
    2.71 +
    2.72 +With the background on eDSLs and description of the proto-runtime approach behind us, we then provide overhead measurements (\S ) and implementation time measurements (\S ).  Overhead shown (\S ) indicates that the proto-runtime approach has far lower overhead than even the current highly tuned Linux thread implementation, and we discuss why equivalent user-level M to N thread packages haven't been pursued, leaving no viable user-level libraries to compare against. We also give numbers (\S ) that indicate that the proto-runtime approach is also competitive with Cilk, and OMPSs, on large multi-core servers.
    2.73 +
    2.74 +We go on to give a summary of development time (\S ) of the various embedded languages created so far.  Unfortunately, no control is available to compare against, but we provide estimates based on anecdotal evidence of the time taken to develop the versions compared against for overhead (\S ).  We continue  with a bigger picture discussion of the difference in design methods between traditional approaches and the proto-runtime implementations (\S ). We discuss OpenMP versus the equivalent proto-runtime version called VOMP (\S ).  Then (\S ) we discuss Cilk 5.4 vs the proto-runtime VCilk. Next we discuss pthread vs Vthread (\S ), and  OMPSs vs VSs (\S ).  These discussions attempt to give the two design philosophies and paint a picture of the development process in the two competing approaches.  The goal is to illustrate how the proto-runtime approach maintains many of the features, through its centralized services, while significantly reducing implementation time, through reuse of the services, elimination of concurrency concerns in design and debugging, and in the simplifications in design and implementation caused by the clean modularization of the proto-runtime approach, and the regularization of implementation from one language to another.
    2.75 +
    2.76 +Then, with the full understanding of the proto-runtime approach in hand, we discuss  how it compares to related work (\S ).
    2.77 +
    2.78 +Finally, we highlight the main conclusions drawn from the work (\S ).
    2.79 +
    2.80 +
    2.81 +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    2.82 +\section{The DSL Hypothesis}
    2.83 +\label{sec:DSLHypothesis}
    2.84 +
    2.85 +In this section we expand on the hypothesis that an embedded style DSL (eDSL) provides high programmer productivity, with a low learning curve. Further,  we show (\S ) that when an application is written in a well designed eDSL, porting it to new hardware becomes simpler because often only the language needs to be ported.  That is because the elements of the problem being solved that require large amounts of computation are often pulled into the language. Lastly (\S ),  we hypothesize that switching from sequential programming to using an eDSL is low disruption because the base language remains the same, along with most of the development tools and practices.
    2.86  
    2.87  In \S \ref{sec:DSLHypothesis} we show that the small number of users of an eDSL means that the eDSL must be very low effort to create, and also low effort to port to new hardware.  At the same time, the eDSL must remain very high performance across hardware targets. 
    2.88  
    2.89  In \S we analyze where the effort of creating an eDSL is expended. It turns out that in the traditional approach, it is mainly expended in creating the runtime, and in performance tuning the major domain-specific constructs. We use this to support the case that speeding up runtime creation makes eDSLs more viable. 
    2.90  
    2.91 -In \S we take a step back and examine what the industry-wide picture would be if the eDSL approach were adopted. A large number of eDSLs will come into existence, each with its own set of runtimes, one runtime for each hardware target.  That causes a multiplicative effect: the number of runtimes will equal the number of eDSLs times the number of hardware targets.  Unless the effort of implementing runtimes reduces, this multiplicative effect could dominate, which would retard the uptake of eDSLs.    
    2.92 +In \S we take a step back and examine what the industry-wide picture would be if the eDSL approach were adopted. A large number of eDSLs will come into existence, each with its own set of runtimes, one runtime for each hardware target.  That causes a multiplicative effect: the number of runtimes will equal the number of eDSLs times the number of hardware targets.  Unless the effort of implementing runtimes reduces, this multiplicative effect could dominate, which would retard the uptake of eDSLs.
    2.93  
    2.94 -Following that background on DSLs, in \S we move on to the details of the proto-runtime approach. In \S we provide details of how a runtime is modularized, showing what responsibilities are encapsulated in which modules, and what the interfaces between them look like. We show how this makes the proto-runtime be reused by all languages on given hardware, and how the low-level tuning of the proto-runtime for specific hardware automatically benefits all the languages that run on that hardware.   
    2.95 +\section{Details of the Proto-Runtime Approach}
    2.96 +
    2.97 +With the motivation behind us, of how proto-runtimes can reduce the cost of bringing eDSLs to the mainstream, we move on to the details of the proto-runtime approach.
    2.98 +
    2.99 +\subsection{Anatomy of a Proto-Runtime Based Runtime}
   2.100 +
   2.101 +
   2.102 +
   2.103 +Figure X shows how a proto-runtime modularizes runtime construction. The bottom portion is the implementation of the proto-runtime itself, which exports two interfaces, one used to add behavior of language constructs, the other to add an assigner that chooses where work is performed.  The behavior module creates work and determines when work is free the execute, while the assigner module decides on which processor the work executes and in what order. 
   2.104 +Hence, the behavior module tracks  constraints on work coming from language semantics, including data dependencies.
   2.105 +
   2.106 +Before expanding on each module, first a note about how this scheme maps onto hardware. All hardware configurations, although the interface between modules may change slightly between classes of hardware. For example, on multi-core systems, a copy of the proto-runtime plus language modules runs separately on each core and they communicate via shared variables in a shared language environment.  The proto-runtime protects access to the shared language environment so that language modules can be written in sequential style.  However, the proto-runtime plus language environment can also be centralized, with only a single copy running on one core, as is being done in a project that adds hardware acceleration
   2.107 +to the runtime. On distributed memory, the interface exposes messaging that updates 
   2.108 +
   2.109 +?
   2.110 +
   2.111 +The proto-runtime implements hardware-specific functions, as well as "centralized" services that it makes available to all languages.  Hardware specific functions include communicating between processors and protecting the internal state used by the language modules.
   2.112 +
   2.113 +Internal state of the language modules is used to track data dependencies, work availability, and construct-specific constraints. Changes in this state need to be shared among 
   2.114 +
   2.115 +  and what the interfaces between them look like. We show how this makes the proto-runtime be reused by all languages on given hardware, and how the low-level tuning of the proto-runtime for specific hardware automatically benefits all the languages that run on that hardware.   
   2.116  
   2.117  We follow this in \S with an in-depth look at implementing language logic, and show how the proto-runtime interface allows it to use sequential thinking. We then give similar detail in \S on the implementation of the assigner, which chooses what core executes each chunk of work. We discuss how that has the potential to improve application performance by reducing communication between cores and reducing idle time of cores.  In \S we support our belief that the patterns we followed when modularizing are indeed fundamental and will remain valid for future languages and hardware. In \S we discuss some of the centralized services provided by the current proto-runtime implementation, as well as planned future ones. Then in \S we give an example of reusing language logic from one language implementation to another. 
   2.118  
   2.119 -With the background on eDSLs and description of the proto-runtime approach behind us, we then provide overhead measurements in \S and implementation time measurements in \S.  Overhead shown in \S indicates that the proto-runtime approach has far lower overhead than even the current highly tuned Linux thread implementation, and discusses why equivalent user-level M to N thread packages haven't been pursued, leaving no viable user-level libraries to compare against.  In \S we give numbers that indicate that the proto-runtime approach is also competitive with Cilk, OpenMP, and OMPSs, on large multi-core servers.
   2.120 +\section{Measurements}
   2.121 +With the background on eDSLs and description of the proto-runtime approach behind us, we then provide overhead measurements in \S and implementation time measurements in \S
   2.122  
   2.123 -\S gives a summary of development time of the various embedded languages created so far.  Unfortunately, no control is available to compare against, but we provide estimates based on anecdotal evidence of the time taken to develop the versions compared against for overhead, in \S.  We continue in \S with a bigger picture discussion of the difference in design methods between traditional approaches and the proto-runtime implementations. In \S we discuss OpenMP versus the equivalent proto-runtime version, VOMP.  In \S we discuss Cilk 5.4 vs the proto-runtime VCilk. In \S we discuss pthread vs Vthread, and in \S OMPSs vs VSs.  These discussions attempt to give the two design philosophies and paint a picture of the development process in the two competing approaches.  The goal is to illustrate how the proto-runtime approach maintains many of the features, through its centralized services, while significantly reducing implementation time, through reuse of the services, elimination of concurrency concerns in design and debugging, and in the simplifications in design and implementation caused by the clean modularization of the proto-runtime approach, and the regularization of implementation from one language to another.
   2.124 +\subsection{Overhead Measurements}
   2.125 +For the following, we use a 4-core single socket 2.4Ghz laptop, and a 4 socket by 10 core each server.
   2.126  
   2.127 -Then, with the full understanding of the proto-runtime approach in hand, we discuss in \S  how it compares to related work.
   2.128 +\subsubsection{Vthread Versus Highly Tuned Posix Threads}
   2.129 +Measurements indicate that the proto-runtime approach has far lower overhead than even the current highly tuned Linux thread implementation, and discusses why equivalent user-level M to N thread packages haven't been pursued, leaving no viable user-level libraries to compare against.  
   2.130 +\subsubsection{VCilk Versus Cilk 5.4}
   2.131 +In \S we give numbers that indicate that the proto-runtime approach is also competitive with Cilk
   2.132 +\subsubsection{VSs Versus StarSs (OMPSs)}
   2.133 +OMPSs
   2.134 +\subsubsection{VOMP Versus OpenMP}
   2.135 +VOMP
   2.136  
   2.137 -Finally, we highlight the main conclusions drawn from the work in \S .
   2.138 +\subsection{Development Time Measurements}
   2.139 +Here we summarize the time to develop each of the eDSLs created so far. As a control, we estimate how long the equivalent functionality required, using the traditional approach, based on anecdotal evidence.
   2.140  
   2.141 +\subsubsection{Comparison of Design Approaches}
   2.142 +We give the bigger picture of the difference in design methods between traditional approaches and the proto-runtime implementations, discussing OpenMP versus VOMP, Cilk 5.4 vs VCilk, pthread vs Vthread, and OMPSs vs VSs.  These discussions attempt to give the two design philosophies and paint a picture of the development process in the two competing approaches.  The goal is to illustrate how the proto-runtime approach maintains many of the language features, through its centralized services, while significantly reducing implementation time, through reuse of the services, elimination of concurrency concerns in design and debugging, and in the simplifications in design and implementation caused by the clean modularization of the proto-runtime approach, and the regularization of implementation from one language to another.
   2.143  
   2.144 -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   2.145 +\section{Related Work}
   2.146 +
   2.147 +With the full understanding of the proto-runtime approach in hand, we discuss  how it compares to related work.
   2.148 +Now  we compare the proto-runtime approach to other approaches for implementing domain specific languages.  The criteria are: level of effort to implement the DSL, effort to port the DSL, runtime performance, and support for application performance. The main alternative methods are: posix threads, user-level threads, TBB, modifying libGomp, and using hardware primitives to make a custom runtime.
   2.149 +
   2.150 +We talk about each of these approachs, then summarize the conclusions in Table \ref{tab:CriteriaVsApproach}.
   2.151 +
   2.152 +The first three methods involve building the DSL runtime on top of OS threads, user threads, or TBB, all of which are languages in their own right. So the DSL runtime runs on top of the runtime for that lower-level language.  This places control of work placement inside the lower-level runtime, blocking the DSL runtime, which hurts application-code performance, due to inability to use data locality. In addition, OS threads have operating system overhead and OS-imposed fairness requirements, which keeps runtime performance poor.
   2.153 +
   2.154 +All three also force the DSL implementation to manage concurrency explicitly, using language primitives like locks.  TBB may have a slight advantage due to its task-scheduling commands, but only for task-based languages. Hence, implementation effort is poor for these approaches.  
   2.155 +
   2.156 +For the same reason, for these three, the runtime code needs to be rewritten and tuned for each hardware platform, or else some form of hardware-abstraction placed into the runtime.  But putting in a hardware abstraction is essentially an alternative way of implementing half of the proto-runtime approach, but without the centralization, reuse, and modularization benefits.
   2.157 +
   2.158 +libGomp is used by many language researchers (based on informal discussions) because it has a very simple structure, making it relatively easy to modify, especially for simple languages. However, it provides no services such as debugging or performance tuning, and it has no modularization or reuse across languages benefits.  As the price of the simplicity, performance suffers, as seen in the experiments [].  Also, re-writes of the DSL runtime are required for each platform in order to tune it to hardware characteristics.
   2.159 +
   2.160 +However, because the runtime is directly modified, the language has control over placement of work. That lets the language use semantic information about communication between work-units to gain predictability about where input data may reside. That in turn enables scheduling algorithms that use the data locality predictions to reduce communication, for good application performance.
   2.161 +
   2.162 +The last alternative is to write a custom runtime from scratch, using hardware primitives like the Compare And Swap (CAS) instruction, or similar atomic read-modify-write instructions.  This approach requires the highest degree of implementation effort, and the worst portability across hardware.  However, if sufficient effort is expended on tuning, it can achieve the best runtime performance and equal the best performance of application code. So far, the gap has proven small between highly tuned language-specific custom runtime performance and that of our proto-runtime, but we only have the CILK implementation as a comparison point. 
   2.163 + 
   2.164 +Putting this all together, Table \ref{tab:CriteriaVsApproach} shows that only the proto-runtime approach makes initial language implementation fast, as well as reduces porting effort, while keeping runtime performance high and enabling high application performance. 
   2.165 +
   2.166 +\begin{center}
   2.167 +\begin{tabular}{|c|c|c|c|c|}\hline
   2.168 +Runtime Creation  & \textbf{impl.}& \textbf{porting} & \textbf{runtime} & \textbf{application} \\
   2.169 +\textbf{} & \textbf{ease} & \textbf{ease} & \textbf{perf.} & \textbf{perf.}\\\hline
   2.170 +\textbf{OS Threads} & ++ & ++ & + & + \\\hline
   2.171 +\textbf{User Threads} & ++& ++ & ++ & + \\\hline
   2.172 +\textbf{TBB} & ++ & ++ & ++ & + \\\hline
   2.173 +\textbf{libGomp} & +++ & ++ & +++ & ++++ \\\hline
   2.174 +\textbf{HW primitives} & + & + & +++++ & +++++ \\\hline
   2.175 +\textbf{Proto-runtime} & +++++ & +++++ & ++++ & +++++\\\hline
   2.176 +\end{tabular}
   2.177 +\end{center}
   2.178 +\caption{The table shows how well each approach scores in each measure important to the implementor of a runtime for a DSL. On the left are the approaches that can be used to write the runtime. At the top are the measures an implementor may care about. For all measures, one plus is the lowest score, indicating the implementation approach is undesirable, 5 indicates the highest desirability.  The scores are based on reasons  discussed in the text. }
   2.179 +\label{tab:CriteriaVsApproach}
   2.180 +
   2.181 +
   2.182 +\section{Conclusions}
   2.183 +The  main takeaways from the paper are the potential for embedded style Domain Specific Languages (eDSLs) to address the issues holding back parallel programming from the main-stream, and  the role the proto-runtime approach plays in enabling this, by simplifying the runtime aspect of implementing a large number of eDSLs across the many hardware targets. The proto-runtime approach does this by modularizing the runtimes, providing reuse of centralized services, and reuse of the hardware-specific performance tuning, which is performed once per hardware, on the proto-runtime, then enjoyed by all the eDSLs.  Hence, the proto-runtime approach provides a significant piece of the puzzle of providing eDSLs, to bring parallel programming into the mainstream.
   2.184 +
   2.185 +?
   2.186 +
   2.187 +
   2.188 +
   2.189 +=============================================
   2.190 +
   2.191 +
   2.192  \section{The Problem}
   2.193 -\label{sec:problem}
   2.194 +
   2.195  [[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
   2.196  
   2.197  [[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
   2.198 @@ -227,7 +329,7 @@
   2.199  
   2.200  It is this fit between language constructs and the mental model of the application that makes DSLs highly productive and easy to learn, at the same time, it is also what makes applications written in them more portable.  Application patterns that have strong impact on parallel performance are captured as language constructs.  The rest of the source code has less impact on parallel performance, so just porting the language is enough to get high performance on each hardware target.
   2.201  
   2.202 -In practice, designing such a language is an art, and for some hardware targets, the language can become intrusive.  For example, for porting to GPGPUs, their performance is driven by decomposition into many small, simple, kernels, which access memory in contiguous chunks.  Fitting into this pattern forces rearrangement of the base sequential code, and even constrains choice of algorithm.  Hence, a DSL that is portable to standard architectures as well as GPUs would place the GPU restrictions onto code for all machines.  However, much excellent work [polyhedral, others] is being done on automated tools to transform standard code to GPU form, which would lift the  restrictions.  Also, constructs such as the DKU pattern [] map well onto GPUs as well as standard hardware.
   2.203 +In practice, designing such a language is an art, and for some hardware targets, the language can become intrusive.  For example, for porting to GPGPUs, their performance is driven by decomposition into many small, simple, kernels, which access memory in contiguous chunks.  Fitting into this pattern forces rearrangement of the base sequential code, and even constrains choice of algorithm.  Hence, a DSL that is portable to standard architectures as well as GPUs would place the GPU restrictions onto the code for all machines.  However, much excellent work [polyhedral, others] is being done on automated tools to transform standard code to GPU form, which would lift the  restrictions.  Also, constructs such as the DKU pattern [] map well onto GPUs as well as standard hardware.
   2.204  
   2.205  \subsection{The embedded style of DSL}
   2.206  \label{subsec:EmbeddedDSLs}
   2.207 @@ -361,10 +463,7 @@
   2.208  These are the two behaviors a construct performs inside the runtime: managing conditions on which work is free, and managing assignment of free work onto cores.
   2.209  
   2.210  For task based languages, a task runs to completion then always switches to the runtime at the end.  Hence, no suspend and resume exists. Once inside, the runtime's job is to track conditions on which tasks are ready to run, or which to create.  For example, in dataflow, a task is created only once all conditions for starting it are met.  Hence, the only language constructs are "instantiate a task-creator", "connect a task creator to others", and "end a task".  During a run, all of the runtime behavior takes place inside the "end a task" construct, where the runtime sends outputs from the ending task to the inputs of connected task-creators.  The "send" action modifies internal runtime state, which represents the order of inputs to a creator on all of its input ports. When all inputs are ready, it creates a new task, then when hardware is ready, assigns the task to a core.
   2.211 -120 is parent
   2.212 -9e0 is 0
   2.213 -750 is 1
   2.214 -?
   2.215 +
   2.216  
   2.217  One survey[] discusses DSLs for a variety of domains, and this list of DSLs was copied from their paper:
   2.218  \begin{itemize} 
   2.219 @@ -437,7 +536,7 @@
   2.220   
   2.221  As shown in Fix \ref{fig:eDSL_two_pieces}, to create an embedded style DSL (eDSL), do two things: create the runtime and create a wrapper-library that invokes the runtime and also implements the more complex language constructs.
   2.222  
   2.223 -As seen in Fig X, a library call that invokes a language construct is normally a thin wrapper that only communicates to the runtime. It places information to be sent to the runtime into a carrier, then invokes the runtime by suspending the base language execution and switching the processor over to the runtime code.
   2.224 +As seen in Fig X, a library call that invokes a language construct is normally a thin wrapper that only communicates to the runtime. It places information to be sent to the runtime into a carrier, then invokes the runtime via a primitive. The primitive suspends the base language execution and switches the processor over to the runtime code.
   2.225  
   2.226  \subsection{The Proto-Runtime Modularization}
   2.227  
   2.228 @@ -458,7 +557,7 @@
   2.229  \subsection{Exporting a performance-oriented machine view }
   2.230  The proto-runtime interface exports a view of the machine that shows performance-critical aspects.  Machines that share the same architectural approach have the same performance-critical aspects, and differ only in the values. 
   2.231  
   2.232 -For example, cache-coherent shared-memory architectures can be modelled as a collection of memory pools connected by networks.  The essential variations among processor-chips are the sizes of the pools, the connections between them, such as which cores share the same L2 cache, and the latency and bandwidth between them.
   2.233 +For example, the interface models cache-coherent shared-memory architectures  as a collection of memory pools connected by networks.  The essential variations among processor-chips are the sizes of the pools, the connections between them, such as which cores share the same L2 cache, and the latency and bandwidth between them.
   2.234  
   2.235  Hence, a single plugin can be written that gathers this information from the proto-runtime and uses it when deciding which work to assign to which core.  Such a plugin will then be efficient across all machines that share the same basic architecture.
   2.236  
   2.237 @@ -740,7 +839,7 @@
   2.238    \cline{2-7}
   2.239    \noalign{\vskip2pt}
   2.240    \hline
   2.241 -  Design & 19 & 6 & 3 & 52 & 11& 6\\
   2.242 +  Design & 19 & 6 & 3 & 52 & 18& 6\\
   2.243    Code & 13 & 3 & 3& 32 & 9& 12\\
   2.244    Test & 7 & 2 & 2& 12 & 8& 5\\
   2.245    L.O.C. & 470 & 290 & 310& 3000 & 690 & 780\\
   2.246 @@ -780,59 +879,29 @@
   2.247  
   2.248  %%%%%%%%%%%%%%%%%%%%%
   2.249  \section{Related Work}
   2.250 -Now  we compare the proto-runtime approach to other approaches for implementing domain specific languages.  The criteria are: level of effort to implement the DSL, effort to port the DSL, runtime performance, and support for application performance. The main alternative methods are: posix threads, user-level threads, TBB, modifying libGomp, and using hardware primitives to make a custom runtime.
   2.251 -
   2.252 -We talk about each of these approachs, then summarize the conclusions in Table \ref{tab:CriteriaVsApproach}.
   2.253 -
   2.254 -The first three methods involve building the DSL runtime on top of OS threads, user threads, or TBB, all of which are languages in their own right. So the DSL runtime runs on top of the runtime for that lower-level language.  This places control of work placement inside the lower-level runtime, blocking the DSL runtime, which hurts application-code performance, due to inability to use data locality. In addition, OS threads have operating system overhead and OS-imposed fairness requirements, which keeps runtime performance poor.
   2.255 -
   2.256 -All three also force the DSL implementation to manage concurrency explicitly, using language primitives like locks.  TBB may have a slight advantage due to its task-scheduling commands, but only for task-based languages. Hence, implementation effort is poor for these approaches.  
   2.257 -
   2.258 -For the same reason, for these three, the runtime code needs to be rewritten and tuned for each hardware platform, or else some form of hardware-abstraction placed into the runtime.  But putting in a hardware abstraction is essentially an alternative way of implementing half of the proto-runtime approach, but without the centralization, reuse, and modularization benefits.
   2.259 -
   2.260 -libGomp is used by many language researchers (based on informal discussions) because it has a very simple structure, making it relatively easy to modify, especially for simple languages. However, it provides no services such as debugging or performance tuning, and it has no modularization or reuse across languages benefits.  As the price of the simplicity, performance suffers, as seen in the experiments [].  Also, re-writes of the DSL runtime are required for each platform in order to tune it to hardware characteristics.
   2.261 -
   2.262 -However, because the runtime is directly modified, the language has control over placement of work. That lets the language use semantic information about communication between work-units to gain predictability about where input data may reside. That in turn enables scheduling algorithms that use the data locality predictions to reduce communication, for good application performance.
   2.263 -
   2.264 -The last alternative is to write a custom runtime from scratch, using hardware primitives like the Compare And Swap (CAS) instruction, or similar atomic read-modify-write instructions.  This approach requires the highest degree of implementation effort, and the worst portability across hardware.  However, if sufficient effort is expended on tuning, it can achieve the best runtime performance and equal the best performance of application code. So far, the gap has proven small between highly tuned language-specific custom runtime performance and that of our proto-runtime, but we only have the CILK implementation as a comparison point. 
   2.265 - 
   2.266 -Putting this all together, Table \ref{tab:CriteriaVsApproach} shows that only the proto-runtime approach makes initial language implementation fast, as well as reduces porting effort, while keeping runtime performance high and enabling high application performance. 
   2.267 -
   2.268 -\begin{center}
   2.269 -\begin{tabular}{|c|c|c|c|c|}\hline
   2.270 -  & \textbf{impl.}& \textbf{porting} & \textbf{runtime} & \textbf{application} \\
   2.271 -\textbf{} & \textbf{effort} & \textbf{effort} & \textbf{perf.} & \textbf{perf.}\\\hline
   2.272 -\textbf{OS Threads} & poor & poor & poor & poor \\\hline
   2.273 -\textbf{User Threads} & poor& poor & medium & poor \\\hline
   2.274 -\textbf{TBB} & poor & poor & medium & poor \\\hline
   2.275 -\textbf{libGomp} & medium & poor & medium & good \\\hline
   2.276 -\textbf{HW primitives} & poor & poor & good & good \\\hline
   2.277 -\textbf{Proto-runtime} & good & good & good & good \\\hline
   2.278 -\end{tabular}
   2.279 -\end{center}
   2.280 -\caption{Level of achievement for each approach to creating DSL runtimes}
   2.281 -\label{tab:CriteriaVsApproach}
   2.282 -
   2.283  
   2.284  
   2.285  %%%%%%%%%%%%%%%%%%%%%
   2.286  \section{Conclusion and Future Work}
   2.287  \label{sec:conclusion}
   2.288 -[[Hypothesis: Embedded-style DSLs -> high productivity + low learning curve + low disruption + low app-port AND quick time to create + low effort to lang-port + high perf across targets]]
   2.289 -[[Claims: modularize runtime, mod is fund patterns, mod sep lang logic from RT internals, mod makes internal reusable & lang inherit internal perf tune & inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   2.290 +[[Hypothesis: Embedded-style DSLs -> high productivity + low learning curve + low disruption + low app-port AND quick time to create + low effort to lang-port + high perf across targets]] 
   2.291  
   2.292 -We have shown how the proto-runtime approach modularizes runtime code, allowing the low-level hardware-intensive portion to be inherited by all languages built on top of it.  This modularization reduces the effort of implementing a new language, especially an embedded-style one, by simplifying the language-specific portion, reducing it, and  reusing it, as demonstrated in practice. While reducing the effort, the proto-runtime approach also maintains high runtime performance, by separating out the tuning-intensive part, so the tuning effort is amortized over all the languages, which inherit the improved runtime performance without any effort by the language creators.
   2.293 +We have shown how the proto-runtime approach modularizes runtime code, in a way that appears applicable to any language or execution model. It isolates the hardware-specific portion  from language behavior as well as from the language-driven placement of work onto resources, providing interfaces between them.
   2.294  
   2.295 -We have demonstrated the benefits of centralized services inherited by the languages, such as debugging facilities, automated verification, concurrency handling, hardware performance information gathering, and so on.  By genericizing these
   2.296 +  The modularization reduces the effort of implementing a new language, especially an embedded-style one where runtime creation is a significant portion of total  effort.  The low level hardware portion is reused by each language. The behavior implementation is simplified by handling shared state inside the proto-runtime and exporting a sequential interface for  the behavior module to use. The simplification   reduces  effort, as does reuse of the hardware-specific portion, reuse of behavior code from one language to another, reuse of assignment code, and familiarity with the modular structure by implementors. Overall effort reduction was supported by measurements of implementation effort. 
   2.297  
   2.298 -We have also demonstrated how application features and language features gain access to the decision-making for which core work is assigned to.  This provides the language with the opportunity to directly control placement of work, for higher performance of application code.
   2.299 +The proto-runtime approach makes it practical to maintain high overall runtime performance, with low effort for the language implementor. It is practical because high effort is put into performance-tuning the hardware-specific proto-runtime, which is then reused by each language. In this way the performance derived from the high tuning effort is inherited without extra effort by the language creators, thus amortizing the cost.
   2.300 +
   2.301 +Centralized services were implemented inside the proto-runtime portion, such as debugging facilities, automated verification, concurrency handling, hardware performance information gathering, and so on. We showed how they are reused by the languages. 
   2.302 +
   2.303 +Although we didn't measure it, we indicated how application performance can be increased due to giving the language direct control over placement of work, to take advantage of data affinity or application-generated communication patterns. This ability is due to the assignment module, which provides the language implementor with control over which core work is assigned to, and the order of executing each work unit.
   2.304  
   2.305  Work on the proto-runtime approach is in its infancy, and much remains to be done, including:
   2.306  \begin{itemize} 
   2.307 -\item  Creating related interfaces for use with distributed memory hardware, and interfaces for hierarchical runtimes, to improve performance on many-level hardware such as high-performance computing, and to tie together runtimes for different types of architecture, to cover heterogeneous architectures and machines.
   2.308 +\item  Creating related interfaces for use with distributed memory hardware, and interfaces for hierarchical runtimes, to improve performance on many-level hardware such as high-performance computers, and to tie together runtimes for different types of architecture, to cover heterogeneous architectures and machines.
   2.309  \item Extending the proto-runtime interface to present hardware information that a work-assigner will need, but in a generic way that remains constant across many hardware configurations yet exposes all relevant information.
   2.310  \item Exploring work assignment implementations that take advantage of language and application knowledge to improve placement of work to gain higher application performance.
   2.311 -\item Applying the proto-runtime approach to support a portability software stack, and supplying OS services to applications via the proto-runtime, to further increase application-code portability.
   2.312 +\item Applying the proto-runtime approach to support a portability software stack, and supply OS services to applications via the proto-runtime, to further increase application-code portability.
   2.313  \end{itemize}
   2.314  
   2.315