VMS/0__Writings/kshalle

changeset 92:cdd1852fe804

VMS__Full_conf_paper_2.tex -- checkpoint -- about to copy intro and insert paper body directly into it.. getting rid of bunch of background stuff that doesn't cleanly fit in..
author Sean Halle <seanhalle@yahoo.com>
date Mon, 08 Oct 2012 23:05:18 -0700
parents d8024c56ef61
children d005f9012126
files 0__Papers/VMS/VMS__Foundation_Paper/VMS__Full_conference_version/latex/VMS__Full_conf_paper_2.tex
diffstat 1 files changed, 343 insertions(+), 65 deletions(-) [+]
line diff
     1.1 --- a/0__Papers/VMS/VMS__Foundation_Paper/VMS__Full_conference_version/latex/VMS__Full_conf_paper_2.tex	Mon Oct 08 23:03:26 2012 -0700
     1.2 +++ b/0__Papers/VMS/VMS__Foundation_Paper/VMS__Full_conference_version/latex/VMS__Full_conf_paper_2.tex	Mon Oct 08 23:05:18 2012 -0700
     1.3 @@ -45,7 +45,7 @@
     1.4  \copyrightdata{[to be supplied]} 
     1.5  
     1.6  \titlebanner{banner above paper title}        % These are ignored unless
     1.7 -\preprintfooter{short description of paper}   % 'preprint' option specified.
     1.8 +\preprintfooter{short descripti0on of paper}   % 'preprint' option specified.
     1.9  
    1.10  
    1.11  \title{A Proto-Runtime Approach to Domain Specific Languages}
    1.12 @@ -68,13 +68,10 @@
    1.13  
    1.14  
    1.15  \begin{abstract}
    1.16 - Domain Specific Languages that are embedded into a base language have promise to solve the productivity, disruption, mental model, and porting problems of parallel software. However such languages have too few users to support the large effort required to create them, resulting in low adoption.
    1.17 -
    1.18 -To solve this, we introduce a proto-runtime approach, which reduces the effort to create and port domain specific languages in multiple ways. It modularizes the creation of runtime systems and the parallelism constructs they implement, separating the language-construct logic, and scheduling logic away from the low-level runtime details of concurrency, memory consistency, and runtime-performance related code. 
    1.19 -
    1.20 -As a result, new parallel constructs are written using sequential reasoning, and are easily reused across languages, and scheduling of work onto hardware is under language and application control, without interference from an underlying thread package scheduler, which enables high application performance.
    1.21 -
    1.22 -We present measurements of the time to develop new languages, as well as time to re-implement existing ones,  which reduces to a matter of hours.  In addition, we measure performance of proto-runtime based implementations going head-to-head with the standard distributions of Cilk, OpenMP, StarSs (OMPSs), and posix threads, showing that the proto-runtime outperforms on large servers in all cases.
    1.23 + Domain Specific Languages that are embedded into a base language have promise to provide productivity, performant-portability and wide adoption for parallel programming. However such languages have too few users to support the large effort required to create them, resulting in low uptake of the method.
    1.24 +To solve this, we introduce a proto-runtime approach, which reduces the effort to create and port domain specific languages. It modularizes the creation of runtime systems and the parallelism constructs they implement, separating the language-construct logic and scheduling logic away from the low-level runtime details of concurrency, memory consistency, and runtime-performance related code.
    1.25 +As a result, new parallel constructs are written using sequential reasoning, and are easily reused across languages, and scheduling of work onto hardware is under language and application control, without interference from an underlying thread package scheduler. This enables higher quality scheduling decisions for higher application performance.
    1.26 +We present measurements of the time taken to develop new languages, as well as time to re-implement existing ones,  which average  a few days each.  In addition, we measure performance of proto-runtime based implementations going head-to-head with the standard distributions of Cilk, OpenMP, StarSs (OMPSs), and posix threads, showing that the proto-runtime matches or outperforms on large servers in all cases.
    1.27  \end{abstract}
    1.28  
    1.29  
    1.30 @@ -85,34 +82,95 @@
    1.31  \section{Introduction}
    1.32  \label{sec:intro}
    1.33  
    1.34 -Programming in the past has been overwhelmingly sequential, with the applications being run on sequential hardware.  But the laws of physics have forced the hardware to become parallel, even down to embedded devices like phones. The trend is unstoppable, eventually forcing essentially all future programming to  become parallel programming.  The only reason it is not already is due to the difficulty of the traditional parallel programming approaches. 
    1.35 +[Note to reviewers: this paper's style and structure follow the official PPoPP guide to writing style, which is linked to the PPoPP website. We are taking on faith that the approach has been communicated effectively to reviewers and that we won't be penalized for following it's unorthodox structure.]
    1.36  
    1.37 -The problems with parallel programming fall into three main categories: 1) a difficult mental model, 2) having to rewrite the code for each hardware target to get acceptable performance 3) disruption to existing practices (including steep learning curve, change in tools, and change in design practices). Many believe that these can all be overcome with the use of Domain-Specific Languages. But such languages have been costly to create and port across hardware targets, which makes them impractical given the small number of users of each language, and so have not caught on.
    1.38 +Programming in the past has been overwhelmingly sequential, with the applications being run on sequential hardware.  But the laws of physics have forced the hardware to become parallel, even down to embedded devices such as smart phones. The trend appears unstoppable, forcing essentially all future programming to  become parallel programming.  However,  sequential programming remains the dominant approach due to  the difficulty of the traditional parallel programming methods. 
    1.39  
    1.40 -We propose that a method that makes Domain Specific Languages (DSLs) low cost to produce as well as to port across hardware targets will allow them to fulfill their promise, and we introduce what we call a proto-runtime to do this.  A proto-runtime is a normal, full, runtime, but with two key pieces replaced by an interface. One piece is the logic of language constructs, the other is the logic for choosing which core to assign work to. The remaining portion is the proto-runtime, which comprises low-level details of internal runtime communication between cores and protecting shared runtime state during concurrent accesses by the plugged-in pieces.
    1.41 +The difficulties with parallel programming fall into three main categories: 1) a difficult mental model, 2) having to rewrite the code for each hardware target to get acceptable performance and 3) disruption to existing practices, including steep learning curve, changes to the tools used, and changes in design practices. Many believe that these can all be overcome with the use of Domain-Specific Languages. But such languages have been costly to create and port across hardware targets, which makes them impractical given the small number of users of each language, and so have not caught on.
    1.42  
    1.43 -We claim the following features and benefits of the proto-runtime approach, which we shall  support throughout the rest of this paper:
    1.44 +We propose that a method that makes Domain Specific Languages (DSLs) low cost to produce as well as to port across hardware targets will allow them to fulfill their promise, and we introduce what we call a proto-runtime to help with this.  
    1.45 +
    1.46 +A proto-runtime is a normal, full, runtime, but with two key pieces replaced by an interface. One piece is the logic of language constructs, the other is the logic for choosing which core to assign work onto. What's left is the proto-runtime, which comprises low-level details of internal runtime communication between cores and protection of shared runtime state during concurrent accesses performed  by the plugged-in pieces.
    1.47 +
    1.48 +The decomposition into a proto-runtime plus  plugged-in  language behaviors modularizes the construction of runtimes.  The proto-runtime is one module, which  embodies runtime internals, which are hardware oriented and independent of language. The plugged-in portions form the two other modules, which are language specific. The interface between them   occurs at a natural boundary, which separates   the hardware oriented portion of a runtime from the language oriented portion. 
    1.49 +
    1.50 +We claim the following benefits of the proto-runtime approach, each of which is  supported in the indicated section of  the paper:
    1.51  
    1.52  \begin{itemize}
    1.53 -\item The proto-runtime approach modularizes the runtime, which results in reduced time to implement a new language's behavior, especially for embedded style languages.
    1.54 -\item The modularization is effective across languages because it is based on fundamental patterns -- of parallel computation, runtimes, and synchronization constructs.
    1.55 -\item The modularization simplifies by separating language behavior logic from runtime internals (the proto-runtime portion, which communicates between cores and  protects shared runtime data).
    1.56 +\item The proto-runtime approach modularizes the runtime (\S\ ).
    1.57  
    1.58 -\item The modularization causes  time reduction by making the internal portion of runtimes reusable across languages.
    1.59 -\item  The modularization also provides  time reduction by allowing the language logic to be designed and implemented with sequential thinking, due to separating that logic from the internals that protect shared runtime data, which is where the concurrency issues are handled. This separation is not possible when implementing language logic in terms of a package such as Posix threads or TBB (unless one modifies or uses the package according to the proto-runtime pattern).
    1.60 -\item The modularization also causes time reduction by making it practical to reuse the language logic by pulling it from one language into another, due to the well-defined interfaces, and the modular patterns the approach  uses to implement that logic.
    1.61 -\item The modularization causes iheritance, by all languages, of the effort spent performance tuning the internal portion (the proto-runtime), because the dominant factors affecting runtime performance are concentrated in the internal portion. 
    1.62 -\item The modularization causes integration of the language-implemented scheduler into the low level control of hardware (in  contrast to building a language on top of a layer that has its own hardware assignment,  isolated from the language implementation, such as the case when using a package like Posix threads or TBB).
    1.63 +\item The modularization  is consistent with patterns that appear to be fundamental to parallel computation and runtimes (\S\ ). 
    1.64  
    1.65 -\item The modularization causes inheritance by all languages of the centralized services placed into the proto-runtime, such as debugging facilities, automated verification, concurrency handling, hardware performance information gathering, and so on.
    1.66 +\item The modularization  cleanly separates hardware oriented runtime internals from the logic of the language (\S). 
    1.67 +
    1.68 +\item Those who use the proto-runtime approach can rely upon it to apply to future languages and hardware because the patterns underlying it appear to be fundamental and so should apply equally well to as yet undiscovered languages and architectures (\S\ ).
    1.69 +
    1.70 +
    1.71 +\item The modularization results in reduced time to implement a new language's behavior, and in reduced time to port a language to new hardware (\S\ ).
    1.72 +
    1.73 +\begin{itemize}
    1.74 +\item Part of the time reduction is due to reuse of the runtime's internal hardware-oriented portion  by all languages (\S \ref{sec:intro}).
    1.75 +
    1.76 +
    1.77 +\item Part of the time reduction is due to all languages inheriting the effort of performance tuning the runtime internals, so the language doesn't have to tune runtime to hardware  (\S\ ).  
    1.78 +
    1.79 +\item  Part of the time reduction is due  to the use of sequential thinking when implementing the language logic. This is possible because the proto-runtime provides protection of shared internal runtime state, and exports an interface that presents a sequential model  (\S\ ). 
    1.80 +
    1.81 +\item Part of the  time reduction is due to the modularization making it practical to reuse language logic from one language to another  (\S\ ).
    1.82 +
    1.83 +\item  Part of the time reduction is due to the proto-runtime providing a centralized location for services for all languages to use, so the language doesn't have to provide them separately.  Such services include debugging facilities, automated verification, concurrency handling, hardware performance information gathering, etc  (\S\ ).
    1.84  
    1.85  \end{itemize}
    1.86  
    1.87 +\item
    1.88 +
    1.89 +The modularization also gives the language  low level control over placement of work onto the hardware. This allows application information and language semantic information to be used in decisions of which core a given unit of work executes on. This can result in reduced communication between cores and increased performance  (\S\ ).
    1.90 +
    1.91 +\begin{itemize}
    1.92 +
    1.93 +\item Similar control over hardware is not possible when the language is   built on top of a layer that has its own hardware assignment, such as a package like Posix threads or TBB  (\S\ ).
    1.94 +
    1.95 +\end{itemize}
    1.96 +
    1.97 +\item Modularization with similar benefits does not appear possible when using a package such as Posix threads or TBB,  unless the package is modified to conform to a proto-runtime interface or used  according to the proto-runtime pattern  (\S\ ).
    1.98 +
    1.99 +\end{itemize}
   1.100 +
   1.101 +The paper is organized as follows: In 
   1.102 +\S \ref{sec:DSLHypothesis} we expand on our hypothesis that an embedded style DSL (eDSL) provides high programmer productivity, with a low learning curve. Further, in \S we show that when an application is written in a well designed eDSL, porting it to new hardware becomes simpler because often only the language needs to be ported.  That is because the elements of the problem being solved that require large amounts of computation are often pulled into the language. Lastly, in \S we hypothesize that switching from sequential programming to using an eDSL is low disruption because the base language remains the same, along with most of the development tools and practices.
   1.103 +
   1.104 +In \S \ref{sec:DSLHypothesis} we show that the small number of users of an eDSL means that the eDSL must be very low effort to create, and also low effort to port to new hardware.  At the same time, the eDSL must remain very high performance across hardware targets. 
   1.105 +
   1.106 +In \S we analyze where the effort of creating an eDSL is expended. It turns out that in the traditional approach, it is mainly expended in creating the runtime, and in performance tuning the major domain-specific constructs. We use this to support the case that speeding up runtime creation makes eDSLs more viable. 
   1.107 +
   1.108 +In \S we take a step back and examine what the industry-wide picture would be if the eDSL approach were adopted. A large number of eDSLs will come into existence, each with its own set of runtimes, one runtime for each hardware target.  That causes a multiplicative effect: the number of runtimes will equal the number of eDSLs times the number of hardware targets.  Unless the effort of implementing runtimes reduces, this multiplicative effect could dominate, which would retard the uptake of eDSLs.    
   1.109 +
   1.110 +Following that background on DSLs, in \S we move on to the details of the proto-runtime approach. In \S we provide details of how a runtime is modularized, showing what responsibilities are encapsulated in which modules, and what the interfaces between them look like. We show how this makes the proto-runtime be reused by all languages on given hardware, and how the low-level tuning of the proto-runtime for specific hardware automatically benefits all the languages that run on that hardware.   
   1.111 +
   1.112 +We follow this in \S with an in-depth look at implementing language logic, and show how the proto-runtime interface allows it to use sequential thinking. We then give similar detail in \S on the implementation of the assigner, which chooses what core executes each chunk of work. We discuss how that has the potential to improve application performance by reducing communication between cores and reducing idle time of cores.  In \S we support our belief that the patterns we followed when modularizing are indeed fundamental and will remain valid for future languages and hardware. In \S we discuss some of the centralized services provided by the current proto-runtime implementation, as well as planned future ones. Then in \S we give an example of reusing language logic from one language implementation to another. 
   1.113 +
   1.114 +With the background on eDSLs and description of the proto-runtime approach behind us, we then provide overhead measurements in \S and implementation time measurements in \S.  Overhead shown in \S indicates that the proto-runtime approach has far lower overhead than even the current highly tuned Linux thread implementation, and discusses why equivalent user-level M to N thread packages haven't been pursued, leaving no viable user-level libraries to compare against.  In \S we give numbers that indicate that the proto-runtime approach is also competitive with Cilk, OpenMP, and OMPSs, on large multi-core servers.
   1.115 +
   1.116 +\S gives a summary of development time of the various embedded languages created so far.  Unfortunately, no control is available to compare against, but we provide estimates based on anecdotal evidence of the time taken to develop the versions compared against for overhead, in \S.  We continue in \S with a bigger picture discussion of the difference in design methods between traditional approaches and the proto-runtime implementations. In \S we discuss OpenMP versus the equivalent proto-runtime version, VOMP.  In \S we discuss Cilk 5.4 vs the proto-runtime VCilk. In \S we discuss pthread vs Vthread, and in \S OMPSs vs VSs.  These discussions attempt to give the two design philosophies and paint a picture of the development process in the two competing approaches.  The goal is to illustrate how the proto-runtime approach maintains many of the features, through its centralized services, while significantly reducing implementation time, through reuse of the services, elimination of concurrency concerns in design and debugging, and in the simplifications in design and implementation caused by the clean modularization of the proto-runtime approach, and the regularization of implementation from one language to another.
   1.117 +
   1.118 +Then, with the full understanding of the proto-runtime approach in hand, we discuss in \S  how it compares to related work.
   1.119 +
   1.120 +Finally, we highlight the main conclusions drawn from the work in \S .
   1.121 +
   1.122 +
   1.123  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   1.124  \section{The Problem}
   1.125  \label{sec:problem}
   1.126 -[[Hypothesis: Embedded-style DSLs -> high productivity + low learning curve + low disruption + low app-port AND quick time to create + low effort to lang-port + high perf across targets]]
   1.127 -[[Claims: modularize runtime, mod is fund patterns, mod sep lang logic from RT internals, mod makes internal reusable & lang inherit internal perf tune & inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.128 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
   1.129 +
   1.130 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
   1.131 +
   1.132 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
   1.133 +
   1.134 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
   1.135 +
   1.136 +
   1.137 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.138  
   1.139  While talking about the problems encountered by Domain Specific Languages (DSLs), we focus   on implications for the runtime system, due to its central role in the claims.  At the same time we will support the hypothesis that embedded-style DSLs  are high-productivity for application programmers, have a low learning curve, and cause low disruption to current programming practices.  While doing this we set the ground work for the next section, where we show that the main effort of implementing embedded-style DSLs is creating the runtime, and that when using the proto-runtime approach, embedded-style DSLs are low-effort to create and port and move the effort of porting for high performance out of the application and into the language.
   1.140  
   1.141 @@ -124,8 +182,16 @@
   1.142  
   1.143  \subsection{Classifying parallel languages by virtual processor based vs task based}
   1.144  \label{subsec:ClassifyingLangs}
   1.145 -[[Hypothesis: Embedded-style DSLs -> high productivity + low learning curve + low disruption + low app-port AND quick time to create + low effort to lang-port + high perf across targets]]
   1.146 -[[Claims: modularize runtime, mod is fund patterns, mod sep lang logic from RT internals, mod makes internal reusable & lang inherit internal perf tune & inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.147 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
   1.148 +
   1.149 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
   1.150 +
   1.151 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
   1.152 +
   1.153 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
   1.154 +
   1.155 +
   1.156 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.157  
   1.158  One major axis for classifying parallel languages is whether they are virtual processor based or task based, which has implications for the structure of the runtime.
   1.159  
   1.160 @@ -144,12 +210,20 @@
   1.161  
   1.162  \subsection{Domain specific parallel languages}
   1.163  \label{subsec:DomSpecLangs}
   1.164 -[[Hypothesis: Embedded-style DSLs -> high productivity + low learning curve + low disruption + low app-port AND quick time to create + low effort to lang-port + high perf across targets]]
   1.165 -[[Claims: modularize runtime, mod is fund patterns, mod sep lang logic from RT internals, mod makes internal reusable & lang inherit internal perf tune & inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.166 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
   1.167 +
   1.168 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
   1.169 +
   1.170 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
   1.171 +
   1.172 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
   1.173 +
   1.174 +
   1.175 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.176  
   1.177  Now we'll talk about the sub-class of Domain Specific Languages (DSLs): what sets them apart from other parallel languages, how they potentially solve the issues with parallel programming, and the implications for their runtime implementations.
   1.178  
   1.179 -DSLs can be any of the three basic language types, but they are distinguished by having constructs that correspond to features of one narrow domain of applications.  For example, we have implemented a DSL that is just for use in building hardware simulators [cite the HWSim wiki].  Its constructs embody the structure of simulators, and make building one fast and even simpler than when using a sequential language, as will be shown in Subsection [].  The programmer doesn't think about concurrency, nor even about control flow, they simply define behavior of individual hardware elements and connect them to each other.
   1.180 +DSLs can be any of the three basic language types (VP based, task-based or hybrid), but they are distinguished by having constructs that correspond to features of one narrow domain of applications.  For example, we have implemented a DSL that is just for use in building hardware simulators [cite the HWSim wiki].  Its constructs embody the structure of simulators, and make building one fast and even simpler than when using a sequential language, as will be shown in Subsection [].  The programmer doesn't think about concurrency, nor even about control flow, they simply define behavior of individual hardware elements and connect them to each other.
   1.181  
   1.182  It is this fit between language constructs and the mental model of the application that makes DSLs highly productive and easy to learn, at the same time, it is also what makes applications written in them more portable.  Application patterns that have strong impact on parallel performance are captured as language constructs.  The rest of the source code has less impact on parallel performance, so just porting the language is enough to get high performance on each hardware target.
   1.183  
   1.184 @@ -157,12 +231,21 @@
   1.185  
   1.186  \subsection{The embedded style of DSL}
   1.187  \label{subsec:EmbeddedDSLs}
   1.188 -[[Hypothesis: Embedded-style DSLs -> high productivity + low learning curve + low disruption + low app-port AND quick time to create + low effort to lang-port + high perf across targets]]
   1.189 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
   1.190  
   1.191 -We segue now into the embedded style of language, and show how the work of implementing them is mainly the work of implementing their runtime plus the complex domain constructs. We focus on  embedded style domain specific languages because it is the least effort-to-create form of DSL, and making DSLs practical requires it to be low effort to create them and  port them to various hardware targets.
   1.192 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
   1.193  
   1.194 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
   1.195  
   1.196 -An embedded-style language is one that uses the syntax of a base language, like C or Java, and adds constructs that are invoked by making a library call, as illustrated in Figure \ref{fig:EmbeddedEx}. Inside the library call, a primitive is used to escape the base language and enter the  embedded language's runtime, which then performs the behavior of the construct.
   1.197 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
   1.198 +
   1.199 +
   1.200 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.201 +
   1.202 +We segue now into the embedded style of language, and show how the work of implementing them is mainly the work of implementing their runtime plus their complex domain constructs. We focus on  embedded style domain specific languages because it is the least effort-to-create form of DSL, and making DSLs practical requires it to be low effort to create them and  port them to various hardware targets.
   1.203 +
   1.204 +
   1.205 +An embedded-style language is one that uses the syntax of a base language, like C or Java, and adds constructs that are specific to the domain. An added construct may be expressed in custom syntax that is translated to into a library call, or else directly  invoked by making a library call, as illustrated in Figure \ref{fig:EmbeddedEx}. Inside the library call, a primitive is used to escape the base language and enter the  embedded language's runtime, which then performs the behavior of the construct.
   1.206  
   1.207  
   1.208  \begin{figure}[h!tb]
   1.209 @@ -192,13 +275,22 @@
   1.210  \end{figure}
   1.211  An embedded-style language differs from a library in that it has a runtime system, and a way to switch from the behavior of the base language to the behavior inside the runtime.  In contrast, libraries never leave the base language.  Notice that this means, for example, that a posix threads library is not a library at all, but an embedded language.
   1.212  
   1.213 -As a practical matter, embedded-style constructs normally have a thin wrapper that invokes the runtime, however, for DSLs, some perform significant effort inside the library before switching to the runtime, or else after returning from the runtime.  These look more like traditional libraries, but still involve an escape from the base language and more importantly are designed to work in concert with the parallel aspects of the language. They  concentrate key performance-critical aspects of the application inside the language, such as dividing work up, or, say, implementing a solver for differential equations that accepts structures created by the divider.
   1.214 +As a practical matter, embedded-style constructs normally have a thin wrapper that invokes the runtime. However, some DSLs perform significant effort inside the library before switching to the runtime, or else after returning from the runtime.  These look more like traditional libraries, but still involve an escape from the base language and more importantly are designed to work in concert with the parallel aspects of the language. They  concentrate key performance-critical aspects of the application inside the language, such as dividing work up, or, for example, implementing a solver for differential equations that accepts structures created by the divider.
   1.215  
   1.216  It is the appearance of constructs being library calls that brings the low-disruption benefit of embedded-style DSLs.  The syntax is that of the base language, so the existing development tools and work flows remain intact when moving to an embedded style DSL.  In addition, the fit between domain concepts and language constructs minimizes mental-model disruption when switching and makes the learning curve to adopt the DSL very low. 
   1.217  
   1.218  \subsection{Application programmer's view of embedded-style DSLs}
   1.219  \label{subsec:AppProgViewOfDSL}
   1.220 -[[Hypothesis: Embedded-style DSLs -> high productivity + low learning curve + low disruption + low app-port AND quick time to create + low effort to lang-port + high perf across targets]]
   1.221 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
   1.222 +
   1.223 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
   1.224 +
   1.225 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
   1.226 +
   1.227 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
   1.228 +
   1.229 +
   1.230 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.231  
   1.232  Well designed DSLs have very few constructs, yet capture the most performance-critical domain patterns, in a way that feels natural to the application programmer.  This often means that data structures and usage patterns are part of the language. 
   1.233  
   1.234 @@ -206,17 +298,25 @@
   1.235  
   1.236  An example of a DSL that we created using the proto-runtime approach is HWSim [], which is designed to be used for writing architectural simulators. 
   1.237  
   1.238 -When using HWSim, a simulator application is composed of just three things: netlist, behavior functions and timing functions. These are all sequential code that call HWSim constructs at boundaries, like the end of behavior, and use HWSim supplied data structures. To use HWSim, one creates a netlist composed of elements and communication paths that connect them.  A communication path connects an outport of the sending element to an inport of the receiving element. One then attaches  an action to the input, which is triggered when an communication arrives. The action has  a behavior function, which changes the state of the element,  and a timing function which calculates how much simulated time the behavior takes.   
   1.239 +When using HWSim, a simulator application is composed of just three things: netlist, behavior functions and timing functions. These are all sequential code that call HWSim constructs at boundaries, such as the end of behavior, and use HWSim supplied data structures. To use HWSim, one creates a netlist composed of elements and communication paths that connect them.  A communication path connects an outport of the sending element to an inport of the receiving element. An action is then attached to the inport. The action is triggered when a communication arrives. The action has  a behavior function, which changes the state of the element,  and a timing function which calculates how much simulated time the behavior takes.   
   1.240  
   1.241 -The language itself consists of only a few standard data structures, such as \texttt{Netlist}, \texttt{Inport}, \texttt{Outport},  and a small number of constructs, such as \texttt{send\_comm} and \texttt{end\_behavior}.  The advancement of simulated time is performed by a triggered action, and so is implied. The parallelism is also implied, by the order of execution of actions being constrained only by consistency.  
   1.242 +The language itself consists of only a few standard data structures, such as \texttt{Netlist}, \texttt{Inport}, \texttt{Outport},  and a small number of constructs, such as \texttt{send\_comm} and \texttt{end\_behavior}.  The advancement of simulated time is performed by a triggered action, and so is implied. The parallelism is also implied, by the only constraints on  order of execution of actions being  consistency.  
   1.243  
   1.244 -The only parallelism-related restriction is that a behavior function may only use data local to the element it is attached to.   Parallel work is created within the system by outports that connect to multiple destination inports which means one output triggers mutliple actions, and by behavior functions that generate multiple output communications each.
   1.245 +The only parallelism-related restriction is that a behavior function may only use data local to the element it is attached to.   Parallel work is created within the system by outports that connect to multiple destination inports which means one output triggers multiple actions, and by behavior functions that generate multiple output communications each.
   1.246  
   1.247 -Overall, simulator writers have fewer issues to deal with because time-related code has been brought inside the language, where it is reused across simulators, and because parallelism issues reduce to simply being restricted to data local to the attached element.  Both these increase productivity of simulator writers, despite using a parallel language.  The language has so few commands that it takes a matter of days to become proficient (as demonstrated informally by new users of HWSim).  Also, parallelism related constructs in the language are generic to hardware, eliminating the need to modify application code when porting to new hardware (if the language is used according to the recommended coding style).     
   1.248 +Overall, simulator writers have fewer issues to deal with because time-related code has been brought inside the language, where it is reused across simulators, and because parallelism issues reduce to simply being restricted to data local to the attached element.  Both these increase productivity of simulator writers, despite using a parallel language.  The language has so few commands that it takes only a matter of days to become proficient (as demonstrated informally by new users of HWSim).  Also, parallelism related constructs in the language are generic across hardware, eliminating the need to modify application code when porting to new hardware (if the language is used according to the recommended coding style).     
   1.249  
   1.250  \subsection{Implementation of Embedded-style DSLs}
   1.251 -[[Hypothesis: Embedded-style DSLs -- high productivity + low learning curve + low disruption + low app-port AND quick time to create + low effort to lang-port + high perf across targets, centered on runtime and complex domain constructs]]
   1.252 -[[Claims: modularize runtime, mod is fund patterns, mod sep lang logic from RT internals, mod makes internal reusable & lang inherit internal perf tune & inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.253 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
   1.254 +
   1.255 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
   1.256 +
   1.257 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
   1.258 +
   1.259 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
   1.260 +
   1.261 +
   1.262 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.263  
   1.264  When it comes to implementing an embedded-style of DSL, the bulk of the effort is in the runtime and the more complex domain specific constructs.
   1.265  
   1.266 @@ -235,12 +335,24 @@
   1.267  
   1.268  
   1.269  \subsection{Implementation Details of Embedded-style DSLs}
   1.270 -Figure [] shows\ the implementation of the wrapper library for HWSim's send\_and\_idle construct, which sends a communication on the specified outport, and then causes the sending element to go idle. Of note is the packaging of information for the runtime, by placing it into the HWSimSemReq data structure, and then ending the application work by switching to the runtime. The switch is via the send\_and\_suspend call, which is a primitive implemented in assembly that jumps out of the base C language and into the runtime.
   1.271 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
   1.272  
   1.273 -The switch to the runtime can be done in multiple ways.  Our proto-runtime uses assembly to manipulate the stack and registers. For posix threads implemented in Linux, the hardware trap instruction switches from application to the OS, which serves as the runtime that implements the thread behavior. Other forms of hardware [Nexus] implement specific constructs, and the runtime is a hybrid of hardware behavior and code, where a special instruction may switch to the hardware runtime, or a standard runtime may use the hardware to accelerate its internal work.
   1.274 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
   1.275  
   1.276 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
   1.277  
   1.278 -To understand how the core gets used by the construct implementation, consider the two types of runtime, those for VP based languages and those for task based languages.
   1.279 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
   1.280 +
   1.281 +
   1.282 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.283 +
   1.284 +?
   1.285 +
   1.286 +Figure [] shows\ the implementation of the wrapper library for HWSim's send\_and\_idle construct, which sends a communication on the specified outport, and then causes the sending element to go idle. Of note is the packaging of information for the runtime. It is placing  into the HWSimSemReq data structure, and then the application work is ended by switching to the runtime. The switch is via the send\_and\_suspend call, which is a primitive implemented in assembly that jumps out of the base C language and into the runtime.
   1.287 +
   1.288 +The switch to the runtime can be done in multiple ways.  Our proto-runtime uses assembly to manipulate the stack and registers. For posix threads language, when implemented in Linux, the hardware trap instruction is used to switch from application to the OS. The OS serves as the runtime that implements the thread behavior. 
   1.289 +
   1.290 +The core is  used by the construct implementation differently for   VP based languages vs  for task based languages.
   1.291  
   1.292  For VP based languages, once inside the runtime,  a synchronization construct performs the behavior shown abstractly in Figure []. In essence, a synchronization construct is a variable length delay, which waits for activities outside the calling code to cause specific conditions to become true.  These activities could be actions taken by other pieces of application code, such as releasing a lock, or they could be hardware related, such as waiting for a DMA transfer to complete.  
   1.293  
   1.294 @@ -249,9 +361,9 @@
   1.295  These are the two behaviors a construct performs inside the runtime: managing conditions on which work is free, and managing assignment of free work onto cores.
   1.296  
   1.297  For task based languages, a task runs to completion then always switches to the runtime at the end.  Hence, no suspend and resume exists. Once inside, the runtime's job is to track conditions on which tasks are ready to run, or which to create.  For example, in dataflow, a task is created only once all conditions for starting it are met.  Hence, the only language constructs are "instantiate a task-creator", "connect a task creator to others", and "end a task".  During a run, all of the runtime behavior takes place inside the "end a task" construct, where the runtime sends outputs from the ending task to the inputs of connected task-creators.  The "send" action modifies internal runtime state, which represents the order of inputs to a creator on all of its input ports. When all inputs are ready, it creates a new task, then when hardware is ready, assigns the task to a core.
   1.298 -
   1.299 -
   1.300 -
   1.301 +120 is parent
   1.302 +9e0 is 0
   1.303 +750 is 1
   1.304  ?
   1.305  
   1.306  One survey[] discusses DSLs for a variety of domains, and this list of DSLs was copied from their paper:
   1.307 @@ -264,32 +376,120 @@
   1.308  \end{itemize}
   1.309  
   1.310  \subsection{Summary of Section}
   1.311 -[[Hypothesis: Embedded-style DSLs -> high productivity + low learning curve + low disruption + low app-port AND quick time to create + low effort to lang-port + high perf across targets]]
   1.312 -[[Claims: modularize runtime, mod is fund patterns, mod sep lang logic from RT internals, mod makes internal reusable & lang inherit internal perf tune & inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.313 + [[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
   1.314 +
   1.315 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
   1.316 +
   1.317 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
   1.318 +
   1.319 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
   1.320 +
   1.321 +
   1.322 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.323  
   1.324  This section illustrated the promise of DSLs for solving the issues with parallel programming. The HWSim example  showed that well designed parallel DSLs can actually improve productivity, and have a low learning curve, as well as reduce the need for touching application code when moving to new target hardware.  The section showed that the effort of implementing an embedded style DSL is mainly that of implementing its runtime and complex domain constructs, and that a well-designed DSL captures most of the performance-critical aspects of an application inside the DSL constructs. Hence, porting effort reduces to just performance-tuning the language (with caveats for some hardware). This effort is, in turn, reused by all the applications that use the DSL.
   1.325  
   1.326  The stumbling point of DSLs is the small number of users, after all, how many people write hardware simulators? Perhaps  a few thousand people a year write or modify applications suitable for HWSim. That means the effort to implement HWSim has to be so low as to make it no more effort than writing a library, effectively a small percentage of a simulator project.  
   1.327  
   1.328 -The runtime is a major piece, so reducing the effort of implementing the runtime goes a long way to reducing the effort of implementing a new DSL. 
   1.329 +The runtime is a major piece of the DSL implementation, so reducing the effort of implementing the runtime goes a long way to reducing the effort of implementing a new DSL. 
   1.330  
   1.331  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   1.332 -\section{The Idea}
   1.333 +\section{Description}
   1.334  \label{sec:idea}
   1.335 -[[Hypothesis: Embedded-style DSLs -> high productivity + low learning curve + low disruption + low app-port AND quick time to create + low effort to lang-port + high perf across targets]]
   1.336 -[[Claims: modularize runtime, mod is fund patterns, mod sep lang logic from RT internals, mod makes internal reusable & lang inherit internal perf tune & inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.337 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
   1.338 +
   1.339 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
   1.340 +
   1.341 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
   1.342 +
   1.343 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
   1.344 +
   1.345 +
   1.346 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.347 +
   1.348 +?
   1.349   
   1.350  
   1.351 -Now that we have made the case that embedded style DSLs have potential to solve many parallel programming issues, and that the implementation effort is mainly that of the runtime and the complex constructs, which is currently too high, we go into this paper's contribution, the proto-runtime concept, which reduces the effort of implementing the DSL runtime, and porting its constructs across hardware.  We will show how the proto-runtime approach accomplishes this via modularizing runtime code, separating out the reusable language-independent portion,  and then using effective patterns for the language-specific portions that plug-in to the reusable piece via an interface.
   1.352 +Now that we have made the case that embedded style DSLs have potential to solve many parallel programming issues, and that a major obstacle to uptake of them is their implementation effort,   we describe the proto-runtime concept and show how it addresses this obstacle to DSLs. As shown,  embedded style DSL implementation effort and porting effort is mainly that of creating the runtime and implementing the more complex language constructs. We show here that the proto-runtime approach dramatically reduces the effort of creating a DSL runtime, through a number of features.
   1.353  
   1.354 -To motivate the concept, we introduce patterns that a proto-runtime embodies, that appear to be common to all languages and execution models.  We show the basic structure of a synchronization construct, and point it out within the code that implements the mutex construct. 
   1.355  
   1.356 -=======
   1.357 +\begin{figure}[ht]
   1.358 +  \centering
   1.359 +  \includegraphics[width = 2in, height = 1.8in]{../figures/PR_three_pieces.pdf}
   1.360 +  \caption{Shows how the proto-runtime approach modularizes the implementation of a runtime. The three pieces are the proto-runtime implementation, an implementation of the language construct behaviors, and an implementation of the portion of a scheduler that chooses which work is assigned to which processor. }
   1.361 +  \label{fig:PR_three_pieces}
   1.362 +\end{figure}
   1.363  
   1.364 -Big idea: 
   1.365 --- 3 parts of a runtime, pull two out, replace with interface.  
   1.366  
   1.367 --- To make a language, supply the two parts, plug in via the interface.  
   1.368 +The main feature is the proto-runtime's approach to modularizing the runtime code. As shown in Fig \ref{fig:PR_three_pieces}, it  breaks the runtime into three pieces: a cross-language piece, which is the proto-runtime implementation, a piece that implements the language's constructs  and plugs into the proto-runtime, and a piece that assigns work onto  hardware and also plugs into the proto-runtime.
   1.369 +
   1.370 +The modularization appears to remain valid across parallel languages and execution models, and we present underlying patterns that support this observation.  We analyze the basic structure of a synchronization construct, and point  out how the proto-runtime modularization is consistent with it.
   1.371 +
   1.372 +\subsection{Creating an eDSL}
   1.373 +
   1.374 +
   1.375 +\begin{figure}[ht]
   1.376 +  \centering
   1.377 +  \includegraphics[width = 2in, height = 1.8in]{../figures/eDSL_two_pieces.pdf}
   1.378 +  \caption{An embedded style DSL consists of two parts: a runtime and a wrapper library that invokes the runtime}
   1.379 +  \label{fig:eDSL_two_pieces}
   1.380 +\end{figure}
   1.381 + 
   1.382 +As shown in Fix \ref{fig:eDSL_two_pieces}, to create an embedded style DSL (eDSL), do two things: create the runtime and create a wrapper-library that invokes the runtime and also implements the more complex language constructs.
   1.383 +
   1.384 +As seen in Fig X, a library call that invokes a language construct is normally a thin wrapper that only communicates to the runtime. It places information to be sent to the runtime into a carrier, then invokes the runtime by suspending the base language execution and switching the processor over to the runtime code.
   1.385 +
   1.386 +\subsection{The Proto-Runtime Modularization}
   1.387 +
   1.388 +\subsubsection{Dispatch pattern}
   1.389 +-- standardizes runtime code
   1.390 +-- makes familiar going from one lang to another
   1.391 +-- makes reuse realistic, as demonstrated by VSs taking SSR constructs
   1.392 +
   1.393 +-- show the enums, and the switch table
   1.394 +
   1.395 +-- point out how the handler receives critical info -- the semEnv, req struct and calling slave
   1.396 +
   1.397 +\subsubsection{The Request Handler}
   1.398 +-- cover what a request handler does.. connect it to the wrapper lib, and the info loaded into a request struct.
   1.399 +
   1.400 +-- give code of a request handler.. within on-going example of implementing pthreads, or possibly HWSim, or pick a new DSL 
   1.401 +
   1.402 +\subsection{Exporting a performance-oriented machine view }
   1.403 +The proto-runtime interface exports a view of the machine that shows performance-critical aspects.  Machines that share the same architectural approach have the same performance-critical aspects, and differ only in the values. 
   1.404 +
   1.405 +For example, cache-coherent shared-memory architectures can be modelled as a collection of memory pools connected by networks.  The essential variations among processor-chips are the sizes of the pools, the connections between them, such as which cores share the same L2 cache, and the latency and bandwidth between them.
   1.406 +
   1.407 +Hence, a single plugin can be written that gathers this information from the proto-runtime and uses it when deciding which work to assign to which core.  Such a plugin will then be efficient across all machines that share the same basic architecture.
   1.408 +
   1.409 +This saves significant effort by allowing the same plugin to be reused for all the machines in the category.
   1.410 + 
   1.411 +\subsection{Services Provided by the Proto-runtime}
   1.412 +
   1.413 +-- Put services into the low-level piece..  plugins have those available, and inherit lang independent such as debugging, perf counters..  provides effort reduction because lang doesn't have to implement these services.
   1.414 +
   1.415 +-- -- examples of iherited lang services inside current proto-runtime: debugging and perf-tuning..  verification, playback have been started (?)
   1.416 +
   1.417 +-- -- examples of plugin services: creation of base VP, the switch primitives, the dispatch pattern (which reduces effort by cleanly separating code for each construct), handling consistency model (?), handling concurrency
   1.418 +
   1.419 +\subsection{eDSLs talking to each other}
   1.420 +-- show how VSs is example of three different DSLs, and H264 code is three different languages interacting (pthreads, OpenMP, StarSs)
   1.421 +
   1.422 +-- make case that proto-runtime is what makes this practical !  Their point of interaction is the common proto-runtime innards, which provides the interaction services.. they all use the same proto-runtime, and all have common proto-runtime objects, which is how the interaction becomes possible.
   1.423 +
   1.424 +\subsection{The Proto-runtime Approach Within the Big Picture}
   1.425 +
   1.426 +-- Give background on industry-wide, how have langs times machines..  
   1.427 +-- say that proto-runtime has synergistic advantages within this context. -- repeat that eDSLs talk to each other.
   1.428 +-- give subsubsection on MetaBorg for rewriting eDSL syntax into base lang syntax.
   1.429 +-- bring up the tools issue with custom syntax -- compiling is covered by metaborg re-writing..  can address debugging with eclipse.. should be possible in straight forward way that covers ALL eDSLs.. their custom syntax being stepped through in one window, and stepping through what they generate in separate window (by integrating generation step into eclipse).. even adding eclipse understanding of proto-runtime.. so tracks the sequence of scheduling units..  and shows the request handling in action in third window..
   1.430 + 
   1.431 +Preview idea that many players will contribute, and will get people that specialize in creating new eDSLs (such as one of authors)..
   1.432 +-- For them, code-reuse is reality, as supported by VSs example, 
   1.433 +-- and the uniformity of the pattern becomes familiar, also speeding up development, as also supported by VSs, HWSim, VOMP, and DKU examples.
   1.434 +-- for those who only create a single eDSL, the pattern becomes a lowering of the learning curve, aiding adoption
   1.435 +
   1.436 +-- Restate and summarize the points below (covered above), showing how they combine to shrink the wide-spot where all the runtimes are. 
   1.437  
   1.438  -- The low-level part implemented on each machine, exports a view of the machine that shows performance-critical aspects
   1.439  
   1.440 @@ -297,9 +497,8 @@
   1.441  
   1.442  -- Put services into the low-level piece..  plugins have those available, and inherit lang independent such as debugging..  provides effort reduction because lang doesn't have to implement these services.
   1.443  
   1.444 --- -- examples of iherited lang services inside current proto-runtime: debugging and perf-tuning..  verification, playback have been started (?)
   1.445  
   1.446 --- -- examples of plugin services: creation of base VP, the switch primitives, the dispatch pattern (which reduces effort by cleanly separating code for each construct), handling consistency model (?), handling concurrency
   1.447 +\section{(outline and notes)}
   1.448  
   1.449  -- What a plugin looks like: 
   1.450  
   1.451 @@ -319,17 +518,50 @@
   1.452  
   1.453  \subsection{The Cross-language Patterns Behind the Proto-runtime}
   1.454  
   1.455 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
   1.456 +
   1.457 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
   1.458 +
   1.459 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
   1.460 +
   1.461 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
   1.462 +
   1.463 +
   1.464 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.465 +
   1.466  An application switches to the runtime, which does scheduling work then switches back to application code.
   1.467  
   1.468  
   1.469  \subsection{Some Definitions}
   1.470  
   1.471 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
   1.472 +
   1.473 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
   1.474 +
   1.475 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
   1.476 +
   1.477 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
   1.478 +
   1.479 +
   1.480 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.481 +
   1.482  We adopt the concepts of work-unit, virtual processor (VP), animation, and tie-point as discussed in a previous paper []. A work-unit is the trace of instructions executed between two successive switches to the runtime, along with the data consumed and produced during that trace.  A Virtual Processor is defined as being able to animate either the code of a work-unit or else another VP, and has state that it uses during animation, organized as a stack.  Animation is definedd as causing time of a virtual processor to advance, which is equivalent to causing state changes according to instructions, while suspension halts animation, and consequently causes the end of a work-unit (a more complete definition of animation can be found in the dissertation of Halle[]).  A tie-point connects the end of one work-unit to the beginning of one in a different VP, so a tie-point represents a causal relationship between two work-units, and establishes an ordering between those work-units, effectively tying the time-line of the VP animating one to the time-line of the VP animating the other work-unit.
   1.483  
   1.484  In addition, we introduce a definition of the word task, which is a single work-unit coupled to a virtual-processor that comes into existence to animate the work-unit and dissipates at completion of the work-unit.  By definition of work-unit, a task cannot suspend, but rather runs to completion.  If the language defines an entity that has a timeline that can be suspended by switching to the runtime, then such an entity is not a task. Pure Dataflow[] specifies tasks that fit our definition.
   1.485  
   1.486  \subsection{Handling Memory Consistency Models}
   1.487  
   1.488 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
   1.489 +
   1.490 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
   1.491 +
   1.492 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
   1.493 +
   1.494 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
   1.495 +
   1.496 +
   1.497 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.498 +
   1.499  Weak memory models can cause undesired behavior when work-units on different cores communicate through shared variables.  Specifically, the receiving work-unit can see memory operations complete in a different order than the code of the sending work-unit specifies.
   1.500  
   1.501  For example, consider a proto-runtime implemented on shared memory hardware that has a weak consistency model, along with a language that implements a traditional mutex lock.  All memory operations performed in the VP that releases the lock should be seen as complete by the VP that next acquires the lock.  
   1.502 @@ -350,9 +582,16 @@
   1.503  =================
   1.504  
   1.505  \subsection{The patterns}
   1.506 -[[Hypothesis: Embedded-style DSLs -> high productivity + low learning curve + low disruption + low app-port AND quick time to create + low effort to lang-port + high perf across targets]]
   1.507 -[[Claims: modularize runtime, mod is fund patterns, mod sep lang logic from RT internals, mod makes internal reusable & lang inherit internal perf tune & inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.508 -[[[mod is fund patterns]]]
   1.509 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
   1.510 +
   1.511 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
   1.512 +
   1.513 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
   1.514 +
   1.515 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
   1.516 +
   1.517 +
   1.518 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.519  
   1.520  
   1.521  Soln: modularize runtime, to reduce part have to mess with, hide part that has low-level details, reuse low-level tuning effort, and reuse lang-spec parts.
   1.522 @@ -368,8 +607,16 @@
   1.523  
   1.524  
   1.525  \subsubsection{Views of synchronization constructs}
   1.526 -[[Hypothesis: Embedded-style DSLs -> high productivity + low learning curve + low disruption + low app-port AND quick time to create + low effort to lang-port + high perf across targets]]
   1.527 -[[Claims: modularize runtime, mod is fund patterns, mod sep lang logic from RT internals, mod makes internal reusable & lang inherit internal perf tune & inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.528 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
   1.529 +
   1.530 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
   1.531 +
   1.532 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
   1.533 +
   1.534 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
   1.535 +
   1.536 +
   1.537 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.538  
   1.539  One view of sync constructs is that they are variable-length calls. The 
   1.540  basic hardware does this by stalling the pipeline.
   1.541 @@ -381,8 +628,16 @@
   1.542  Another way to think of the sync construct is that it enforces sharp communication boundaries.  The multiple read and write operations are treated as a single communication with the shared-state.  If any other part of the application sees only part of the communication, it sees something inconsistent and thus wrong.  So the sync constructs ensure that communications are complete, so the parts of the application only see complete communications from other parts.  
   1.543  
   1.544  \subsubsection{Universal Runtime Patterns}
   1.545 -[[Hypothesis: Embedded-style DSLs -> high productivity + low learning curve + low disruption + low app-port AND quick time to create + low effort to lang-port + high perf across targets]]
   1.546 -[[Claims: modularize runtime, mod is fund patterns, mod sep lang logic from RT internals, mod makes internal reusable & lang inherit internal perf tune & inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.547 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
   1.548 +
   1.549 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
   1.550 +
   1.551 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
   1.552 +
   1.553 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
   1.554 +
   1.555 +
   1.556 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.557  
   1.558  Unified pattern within parallel languages: create multiple timelines, then control relative progress of them, and control location each chunk of progress takes place.
   1.559  
   1.560 @@ -398,8 +653,16 @@
   1.561  Note a few implications: first, many activities internal to the runtime are part of a unit's life-line, and take place when only the meta-unit exists, before or after the work of the actual unit; second, communication that is internal to the runtime is part of the unit life-line, such as state updates; third, creation may be implied, such as in pthreads, or triggered such as in dataflow, or be by explicit command such as in StarSs, and once created, a meta-unit may languish before the unit it represents is free to be animated.
   1.562  
   1.563  \subsubsection{Putting synchronization constructs together with universal runtime patterns}
   1.564 -[[Hypothesis: Embedded-style DSLs -> high productivity + low learning curve + low disruption + low app-port AND quick time to create + low effort to lang-port + high perf across targets]]
   1.565 -[[Claims: modularize runtime, mod is fund patterns, mod sep lang logic from RT internals, mod makes internal reusable & lang inherit internal perf tune & inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.566 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
   1.567 +
   1.568 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
   1.569 +
   1.570 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
   1.571 +
   1.572 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
   1.573 +
   1.574 +
   1.575 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.576  
   1.577  Putting these  together, gives us that any parallelism construct that has a synchronization behavior causes the end of a work-unit, and a switch to the runtime.  The code following the construct is a different work-unit that will begin after the constraint implied by the construct is satisfied. 
   1.578  
   1.579 @@ -429,6 +692,17 @@
   1.580  Demonstrate Benefits: lang impl doesn't touch low-level details, inherits centralized services (debug support), reuses code from other languages to add features.
   1.581  
   1.582  \subsection{Reuse of Language Logic}
   1.583 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
   1.584 +
   1.585 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
   1.586 +
   1.587 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
   1.588 +
   1.589 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
   1.590 +
   1.591 +
   1.592 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
   1.593 +
   1.594  Demonstrate reuse of language logic:
   1.595  All the languages have copied singleton, atomic, critical section and transaction. In VOMP, took the task code from VSS, in VSS, took the send and receive code from SSR..  for DKU, took the code almost verbatim from earlier incarnation of these ideas, and welded it into SSR, and took VSs tasks and put into SSR. Thus, circle completes.. VSs took from SSR, now SSR takes from VSs..  pieces and parts are being borrowed all over the place and welded in where they're needed.
   1.596   
   1.597 @@ -563,7 +837,7 @@
   1.598  
   1.599  
   1.600  \end{document} 
   1.601 -
   1.602 +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   1.603  Here is an example of netlist creation:
   1.604  
   1.605  The circuit has two elements, each with one input port, one output port, and a single activity-type. The elements are cross-coupled, so output port of one connects to input port of the other.  The input port has the  activity-type attached as its trigger.  The activity is empty, and just sends a NULL message on the output port.  The activity's duration in simulated time and the resulting communication's flight duration in simulated time are both constants.