## VMS/0__Writings/kshalle

### changeset 103:26b697944e73

dealing with swiching disks -- commit and push as backup
author Sean Halle Tue, 17 Sep 2013 06:34:05 -0700 15f50e49ebb7 a42086520679 0__Papers/PRT/PRT__Tutorial_explanation_teaching/helpers/07_F_26__The_Questions__blank.txt 0__Papers/PRT/PRT__Tutorial_explanation_teaching/latex/PRT__tutorial_explanation_learning.tex 0__Papers/PRT/PRT__Tutorial_explanation_teaching/latex/Paper_Design_2.txt 3 files changed, 2702 insertions(+), 0 deletions(-) [+]
line diff
     1.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
1.2 +++ b/0__Papers/PRT/PRT__Tutorial_explanation_teaching/helpers/07_F_26__The_Questions__blank.txt	Tue Sep 17 06:34:05 2013 -0700
1.3 @@ -0,0 +1,100 @@
1.4 +
1.5 +
1.6 +1) What are the problems the authors are trying to solve?
1.7 +    When done, for each problem, how does one decide the value of a proposed solution?  Suggest a priority domain for deciding whether to use a proposed solution.
1.8 +
1.9 +The problem is
1.10 +
1.11 +A priority domain for deciding the value of some proposed solution to this problem is
1.12 +
1.13 +The value of this solution is determined by
1.14 +
1.15 +
1.16 +
1.17 +2) What "things" does the proposed solution to this problem enable?
1.18 +     What benefit to reader is bought by each "thing", & what related to the "thing", gives the benefit.
1.19 +     What details are unique about the proposed solution that enables the thing that gives benefit?
1.20 +     How does that uniqueness enable or achieve the thing?
1.21 +
1.22 +It enables
1.23 +
1.24 +The benefit to me is
1.25 +
1.26 +Unique details of solution that enable the thing gives benefit are
1.27 +
1.28 +The uniqueness enables the thing that gives benefit by
1.29 +
1.30 +
1.31 +
1.32 +3) What are the fundamentals underlying the problem?
1.33 +     What makes this problem hard?
1.34 +     What are the basic elements and forces of the problem that the proposed solution has to be in terms of, avoid, use to advantage? ie: gravity, invariant relationships, market forces, human capacity (avg level of real programmers, hubris, legacy is held onto, barriers to adoption), and so on
1.35 +How does the proposed solution work within/relate to/address/take advantage of/deal with the fundamentals underlying the problem?
1.36 +
1.37 +The fundamentals are
1.38 +
1.39 +The hard part is
1.40 +
1.41 +The basic elements are
1.42 +
1.43 +The proposed solution
1.44 +
1.45 +
1.46 +
1.47 +4) What are other approaches and conventional wisdom to solving these problems?
1.48 +    What benefits enabled by the proposed solution are not enabled by other work, and vice versa?
1.49 +    How does each approach address something the others miss?
1.50 +    Try to suggest groupings or categories for the various approaches.
1.51 +    Try to suggest ways multiple approaches may be combined to get more pros with fewer cons.
1.52 +
1.53 +Other approaches are
1.54 +
1.55 +A benefit enabled by the proposed that is not enabled by other work is
1.56 +
1.57 +Categories:
1.58 +
1.59 +Combining:
1.60 +
1.61 +
1.62 +
1.63 +5) What is/are the unique main "things" that enable what the proposed solution does?
1.64 +    Sketch the details of each of these "things".
1.65 +    Did you detect any drawbacks, not stated in the paper, from the details?
1.66 +    Did you see any really cool techniques?
1.67 +
1.68 +Unique main "things" are
1.69 +
1.70 +Drawbacks from details:
1.71 +
1.72 +Idea of
1.73 +
1.74 +
1.75 +
1.76 +6) What aspects of the implementation/proof/design need results given in order to convince you that the proposed solution delivers the stated benefits?
1.77 +
1.78 +They have to show
1.79 +
1.80 +
1.81 +
1.82 +7) What results did they show?
1.83 +       Did they show results in all the needed aspects (which were left out)?
1.84 +       Were the testing method and results shown good enough to convince you?
1.85 +       Did you detect any cons, not stated in the paper, from the results?
1.86 +
1.87 +They showed
1.88 +
1.89 +Con..
1.90 +
1.91 +
1.92 +
1.93 +8) How do you think this work may provide some value to you in your future research?
1.94 +
1.95 +The work my provide value for me
1.96 +
1.97 +
1.98 +
1.99 +3 or more comments/questions:  (pick out the most important things to you from the discussion you gave above, or add things that were not brought out by the above questions.  I am asking for these as things to bring up during class).
1.100 +
1.101 +1)
1.102 +
1.103 +
1.104 \ No newline at end of file

     2.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
2.2 +++ b/0__Papers/PRT/PRT__Tutorial_explanation_teaching/latex/PRT__tutorial_explanation_learning.tex	Tue Sep 17 06:34:05 2013 -0700
2.3 @@ -0,0 +1,2570 @@
2.4 +%-----------------------------------------------------------------------------
2.5 +%
2.6 +%               Template for sigplanconf LaTeX Class
2.7 +%
2.8 +% Name:         sigplanconf-template.tex
2.9 +%
2.10 +% Purpose:      A template for sigplanconf.cls, which is a LaTeX 2e class
2.11 +%               file for SIGPLAN conference proceedings.
2.12 +%
2.13 +% Guide:        Refer to "Author's Guide to the ACM SIGPLAN Class,"
2.14 +%               sigplanconf-guide.pdf
2.15 +%
2.16 +% Author:       Paul C. Anagnostopoulos
2.17 +%               Windfall Software
2.18 +%               978 371-2316
2.19 +%               paul@windfall.com
2.20 +%
2.21 +% Created:      15 February 2005
2.22 +%
2.23 +%-----------------------------------------------------------------------------
2.24 +
2.25 +
2.26 +\documentclass[preprint]{sigplanconf}
2.27 +
2.28 +% The following \documentclass options may be useful:
2.29 +%
2.30 +% 10pt          To set in 10-point type instead of 9-point.
2.31 +% 11pt          To set in 11-point type instead of 9-point.
2.32 +% authoryear    To obtain author/year citation style instead of numeric.
2.33 +\usepackage{amssymb,graphicx,calc,ifthen,subfig,dblfloatfix,fixltx2e}
2.34 +
2.35 +
2.36 +% correct bad hyphenation here
2.37 +\hyphenation{op-tical net-works semi-conduc-tor}
2.38 +
2.39 +\usepackage{wasysym}
2.40 +\usepackage{amstext}
2.41 +
2.42 +\begin{document}
2.43 +
2.44 +\bibliographystyle{plain}
2.45 +%
2.46 +
2.47 +\conferenceinfo{WXYZ '05}{date, City.}
2.49 +\copyrightdata{[to be supplied]}
2.50 +
2.51 +\titlebanner{banner above paper title}        % These are ignored unless
2.52 +\preprintfooter{short description of paper}   % 'preprint' option specified.
2.53 +
2.54 +
2.55 +\title{ The Proto-Runtime Abstraction Applied to the
2.56 +Construction
2.57 +of Parallel Domain Specific Languages}
2.58 +
2.59 +
2.60 +\authorinfo{Sean Halle}
2.61 +           {Open Source Research Institute, INRIA,
2.62 +           and TU Berlin}
2.63 +           {seanhalle@opensourceresearchinstitute.org}
2.64 +\authorinfo{Merten Sach}
2.65 +           {TU Berlin}
2.66 +           {msach@mailbox.tu-berlin.de}
2.67 +\authorinfo{Albert Cohen}
2.68 +           {Ecole Normal Supereur, and INRIA}
2.69 +           {albert.cohen@inria.fr}
2.70 +
2.71 +\maketitle
2.72 +
2.73 +
2.74 +\begin{abstract}
2.75 +
2.76 +
2.77 +
2.78 +Domain Specific Languages that are embedded into a base language have promise to provide productivity, performant-portability and wide adoption for parallel programming. However such languages have too few users to support the large effort required to create them and port them across hardware platforms, resulting in low adoption of the method.
2.79 +As one step to ameliorate this, we apply the proto-runtime approach, which reduces the effort to create and port the runtime systems of parallel languages. It modularizes the creation of runtime systems and the parallelism constructs they implement, by providing an interface
2.80 +that separates the language-construct  and scheduling logic away from the low-level runtime details, including concurrency, memory consistency, and runtime-performance aspects.
2.81 +As a result, new parallel constructs are written using sequential reasoning,  multiple languages can be mixed within
2.82 +the same program, and reusable services such as performance
2.83 +tuning and debugging
2.84 +support are available. In addition, scheduling of work onto hardware is under language and application control, without interference from an underlying thread package scheduler. This enables higher quality scheduling decisions for higher application performance.
2.85 +We present measurements of the time taken to develop runtimes for  new languages, as well as time to re-implement for existing ones,  which average  a few days each.  In addition, we measure performance of implementations
2.86 +based on proto-runtime, going head-to-head with the standard distributions of Cilk, StarSs (OMPSs), and posix threads, showing that the proto-runtime matches or outperforms on large servers in all cases.
2.87 +
2.88 +?
2.89 +
2.90 +
2.91 +replace lang-specific with interface, centralize services, minimize effort to create, give language control over hardware assignment..  side benefits: multi-lang, perf-tuning, debugging\end{abstract}
2.92 +
2.93 +
2.94 +
2.95 +
2.96 +
2.97 +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2.98 +\section{Background and Motivation}
2.99 +\label{sec:intro}
2.100 +
2.101 +[Note to reviewers: this paper's style and structure follow the official PPoPP guide to writing style, which is linked to the PPoPP website. We are taking on faith that the approach has been communicated effectively to reviewers and that we won't be penalized for following it's recommended structure and approach.]
2.102 +
2.103 +As  hardware  becomes increasingly parallel, programming must also
2.104 +become parallel.  However,  the transition from sequential to parallel programming has been slow due to  the difficulty of the traditional parallel programming methods.
2.105 +
2.106 +The main difficulties with parallel programming are: 1)  difficult mental model, which reduces productivity, 2) additional effort to rewrite the code for each hardware target to get acceptable performance and 3) disruption to existing practices, including steep learning curve, changes to the tools used, and changes in work practices.
2.107 +
2.108 +Many believe that these can be overcome with the use of embedded style parallel Domain-Specific Languages (epDSLs) []. epDSL language
2.109 +constructs match the mental model of the domain, while
2.110 +they internally imply parallelism. For example, a simulation
2.111 +epDSL called HWSim[] has only 10 constructs, which match
2.112 +the actions taken during a simulation
2.113 +of interacting objects.  They are mixed into sequential C code and take
2.114 +only a couple of hours to learn.  Yet they encapsulate subtle
2.115 +and complex dependencies that relate simulated time
2.116 +to the physical time in the machine. They encapsulate the parallelism
2.117 +present, while simultaneously making the implementation
2.118 +simpler to think about than a purely sequential implementation.
2.119 +
2.120 +
2.121 +
2.122 + Despite this, the adoption of such languages has been slow, we believe due to the cost to create them and to port them across hardware targets. The small number of users of each language, which is specific to a narrow domain, makes this cost impractical.
2.123 +
2.124 +We propose that a method that makes epDSLs lower cost to produce as well as to port across hardware targets will allow them to fulfill their promise. We  show
2.125 +how to apply the proto-runtime approach to help towards this goal.
2.126 +
2.127 +In this approach, a language's runtime system is built
2.128 +as a plugin that is connected to a pre-existing proto-runtime  instance installed on given hardware. Together, the plugin
2.129 +plus proto-runtime instance form the runtime system
2.130 +of the language. The proto-runtime instance itself acts as the infrastructure of a runtime system, and
2.131 +encapsulates most of the hardware-specific details,
2.132 +while providing a number of services for use by the
2.133 +plugged in language module.
2.134 +
2.135 +A proto-runtime instance is essentially a full runtime, but with two key pieces replaced by an interface. One  piece replaced is the logic of language constructs, and the other is logic for choosing which core to assign work onto. The proto-runtime instance then supplies
2.136 +the rest of the runtime system.
2.137 +
2.138 +The decomposition, into a proto-runtime plus  plugged-in  language behaviors, modularizes the construction of runtimes.  The proto-runtime is one module, which  embodies runtime internals, which are hardware oriented and independent of language. The plugged-in portions form the two other modules, which are language specific. The interface between them   occurs at a natural boundary, which separates   the hardware oriented portion of a runtime from the language oriented portion.
2.139 +
2.140 +We claim the following benefits of the proto-runtime approach, each of which is  supported in the indicated section of  the paper:
2.141 +
2.142 +\begin{itemize}
2.143 +
2.144 +\item The proto-runtime approach modularizes the runtime (\S\ref{sec:Proposal}).
2.145 +
2.146 +%\item The modularization  is consistent with patterns that appear to be fundamental to parallel computation and runtimes (\S\ ).
2.147 +
2.148 +\item The modularization  cleanly separates hardware
2.149 +related runtime internals from the language-specific logic (\S\ref{sec:Proposal},
2.150 +\S\ref{subsec:Example}).
2.151 +
2.152 +\item The modularization gives the language control
2.153 +over timing and placement of executing work (\S\ref{sec:Proposal}).
2.154 +
2.155 +
2.156 +\item
2.157 +
2.158 +The modularization  selectively exposes hardware aspects relevant to placement of work. If the language takes advantage of this, it  can result in reduced communication between cores and increased application performance  (\S\ ).
2.159 +
2.160 +\begin{itemize}
2.161 +
2.162 +\item Similar control over hardware is not possible when the language is   built on top of a package like Posix threads or TBB, which has its own work-to-hardware assignment   (\S\ref{sec:Related}).
2.163 +
2.164 +\end{itemize}
2.165 +
2.166 +
2.167 +\item The modularization results in reduced time to implement a new language's behavior, and in reduced time to port a language to new hardware (\S\ref{sec:Proposal},
2.168 +\S\ref{subsec:ImplTimeMeas}).
2.169 +
2.170 +\begin{itemize}
2.171 +
2.172 +
2.173 +\item  Part of the time reduction is due to the proto-runtime providing common services for all languages to (re)use.  Such services include debugging facilities, automated verification, concurrency handling, dynamic performance measurements for use in assignment and auto-tuning, and so on  (\S\ ).
2.174 +
2.175 +\item Part  is due to hiding the low
2.176 +level hardware aspects inside the proto-runtime module,
2.177 +independent from language (\S \ref{sec:intro}).
2.178 +
2.179 +\item Part  is due to  reuse of the effort of performance-tuning  the runtime internals (\S ).
2.180 +
2.181 +\item  Part is due to using sequential thinking when implementing the language logic, enabled by  the proto-runtime protecting shared internal runtime state and exporting an interface that presents a sequential model  (\S\ref{subsec:Example}).
2.182 +
2.183 +
2.184 +\end{itemize}
2.185 +
2.186 +\item Modularization with similar benefits does not appear possible when using a package such as Posix threads or TBB,  unless the package itself is modified and then used  according to the proto-runtime pattern  (\S\ref{sec:Related}).
2.187 +
2.188 +
2.189 +\item The proto-runtime approach appears to future-proof language
2.190 +runtime
2.191 +construction,  because the patterns underlying proto-runtime appear to be fundamental (\S\ref{subsec:TiePoints},
2.192 +\S\ref{subsec:Example}), and so  should hold for future  architectures. Plugins are reused on those, although performance related updates to the
2.193 +plugins may be desired.
2.194 +
2.195 +\end{itemize}
2.196 +
2.197 +The paper is organized as follows: We first expand on the value of embedded style parallel DSLs (epDSLs), and where the effort goes when creating one (\S\ref{subsec:eDSLEffort}). We focus on the role that  runtime implementation effort plays in the adoption of epDSLs, which motivates the value of the  savings provided by the proto-runtime approach. We then move on to the details of the proto-runtime approach (\S\ref{sec:Proposal}), and tie them to how a runtime is modularized (\S\ref{subsec:Modules}), covering how each claimed benefit is provided.
2.198 +We then show overhead measurements (\S\ref{subsec:OverheadMeas}) and implementation time measurements (\S\ref{subsec:ImplTimeMeas} ), which indicate that the proto-runtime approach is performance competitive while significantly reducing implementation and porting effort.
2.199 +With that  understanding in hand, we then discuss  how the approach compares to related work (\S\ref{sec:Related}), and finally, we highlight the main conclusions drawn from the research (\S\ref{sec:Conclusion}).
2.200 +
2.201 +
2.202 +
2.203 +
2.204 +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2.205 +%
2.206 +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2.207 +\section{Background: The epDSL Hypothesis}
2.208 +
2.209 +%[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
2.210 +
2.211 +%[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
2.212 +
2.213 +%[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
2.214 +
2.215 +%[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
2.216 +
2.217 +Domain Specific Languages have been around for a while [], and recently have been suggested as a good approach for parallel programming[][stanford PPL].
2.218 +
2.219 +In essence, a DSL, or just Domain Language, captures patterns that are common in a particular domain of expertise, such as user interfaces, simulations of physical systems, bio-informatics,  and so on.  Each domain has a particular set of mental models, common types of computation, and common kinds of data structures. A  DSL captures these common elements in custom syntax.
2.220 +
2.221 +
2.222 +The custom syntax can capture parallelism information while simultaneously being natural to think about. In practice, multiple aspects of domains provide opportunities for parallelism. For example, the custom data structures seen by the coder can be internally implemented with distributed algorithms; common operations in the domain can be internally implemented with parallel algorithms; and, the domain constructs often imply dependencies. All of these are gained without the programmer being aware of this implied parallelism; they just follow simple language usage rules.
2.223 +
2.224 +
2.225 +
2.226 +\subsection{Embedding a DSL into a base language}
2.227 +
2.228 +A style of domain language, which we feel has good adoption potential, is the so-called \textit{embedded} style of DSL (eDSL) [] [metaborg][stanford ppl]. In this variation, a program is written in a mix of a base sequential language plus domain language constructs. The syntax of the two is intermixed. A preprocessing step then translates the domain syntax into the base syntax, and includes calls to the domain language's runtime.
2.229 +
2.230 +
2.231 +For example, use C (or Java) as the base language for an application, then mix-in custom syntax  from a user-interface eDSL.  To test the code, the developer modifies the build process to first perform the translation step, then pass the resulting source through the normal  compiler. The resulting executable contains calls to a runtime library that becomes linked, at run time, to an implementation that has been tuned to the hardware.
2.232 +
2.233 +As with HWSim, the number of such embedded
2.234 +constructs tends to be low, easy to learn, and significantly
2.235 +reduce the complexity of the code written. All while
2.236 +implicitly specifying parallelism.
2.237 +
2.238 +Additionally, parallel versions, or epDSLs have more than just a syntactic advantage over libraries.  The language has a toolchain that provides build-time optimization and can take advantage of relationships among distinct constructs within the code.  The relationship information allows derivation of communication patterns that inform the choice of placement of work, which is critical to performance on parallel hardware.
2.239 +\subsection{Low learning curve, high productivity, and portability}
2.240 + eDSLs tend to have low learning curve because domain experts are  already familiar with the concepts behind the language constructs, and there are relatively few constructs
2.241 +for an embedded DSL. This is especially valuable for  those who are \textit{not} expert programmers. Embedded style DSLs further reduce learning curve because they  require no new development tools nor development procedures. Together, these address the goal of  a low learning curve for switching to parallel software development.
2.242 +
2.243 +Productivity has been shown to be enhanced by a well designed DSL, with studies  measuring
2.244 +10x reduction in development time [][][].  Factors
2.245 +behind this include simplifying the application code, modularizing it, and encapsulating  performance aspects inside the language.  Simplifying reduces the amount of code and the amount of mental effort. Modularizing separates concerns within the code and isolates aspects, which improves productivity. Encapsulating performance inside the DSL constructs removes them from the application programmer's concerns, which also improves productivity.
2.246 +
2.247 +Perhaps the most important productivity enhancement comes from hiding parallelism aspects inside the  DSL constructs. The language takes advantage of the domain patterns to present a familiar mental model, and then attaches synchronization, work-division, and communication implications to those constructs, without the programmer having to be aware of them.    Combining the simplicity, modularization, performance encapsulation, and parallelism hiding,  with congruence with the mental model of the domain,  together work towards the goal of high productivity.
2.248 +
2.249 +Portability is aided by the encapsulation of performance aspects inside the DSL constructs. The aspects   that require large amounts of computation are often pulled into the language, so only the language implementation must adapt to new hardware. Although fully achieving such isolation isn't always possible, epDSLs hold promise for making significant strides towards it.
2.250 +
2.251 +\subsection{Low disruption and easy adoption}
2.252 +
2.253 +Using an epDSL tends to have low disruption because the base language remains the same, along with most of the development tools and practices.
2.254 + Constructs from the epDSL can be mixed into existing sequential code, incrementally replacing the high computation sections, while continuing with the same development  practices.
2.255 +
2.256 + \subsection{ Few users means the effort of eDSLs must be low} \label{subsec:eDSLEffort}
2.257 +
2.258 +What appears to be holding epDSLs back from widespread
2.259 +adoption is mainly the time, expertise, and cost to develop an epDSL.  The effort to create a usable epDSL needs to be reduced to the point that it is viable for a user base of only a few hundred.
2.260 +
2.261 +The effort  falls into three categories:
2.262 +
2.263 +\begin{enumerate}
2.264 +\item effort to explore  language design and create the epDSL syntax
2.265 +\item effort to create the runtime that produces the epDSL behavior
2.266 +\item effort to performance tune the epDSL on particular hardware
2.267 +\end{itemize}
2.268 +
2.269 +
2.270 +\subsection{The big picture}
2.271 +
2.272 +Across the industry as a whole, when epDSLs become successful, there may be  thousands of epDSLs, that
2.273 +each must  be mapped onto hundreds of different hardware platforms.  That multiplicative effect must be reduced in order to make the epDSL approach economically viable.
2.274 +
2.275 +The first category of eDSL effort is creating the front-end translation of custom syntax into the base language. This is a one-time effort that does not repeat when new hardware is added.
2.276 +
2.277 +The effort that has to be expended on each platform is the runtime implementation and toolchain optimizations.
2.278 +Runtime implementation includes hardware-specific low-level tuning and modification of mapping of work onto cores.
2.279 +
2.280 +This is where leveraging the proto-runtime approach
2.281 +pays off. Hardware platforms cluster into groups with similar performance-related features.  Proto-runtime
2.282 +presents a common abstraction for all hardware
2.283 +platforms, but a portion of the interface supplies performance related
2.284 +information specific to the hardware. This portion is  specialized for each
2.285 +cluster. Examples of clusters include:
2.286 +
2.287 +\begin{itemize}
2.288 +\item single chip shared coherent memory
2.289 +\item multi-chip shared coherent memory (NUMA)
2.290 +\item coprocessor with independent address space (GPGPU)
2.291 +\item a network among nodes of the above categories
2.292 +(Distributed) \item a hierarchy of sub-networks
2.293 +\end{itemize}
2.294 +
2.295 +
2.296 +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2.297 +%
2.298 +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2.299 +\section{Our Proposal} \label{sec:Proposal}
2.300 +
2.301 +We propose addressing the runtime effort portion of creating
2.302 +an epDSL by defining a modularization of runtimes, as seen in Fig. \ref{fig:PR_three_pieces}.  The low-level hardware details are collected into one module, which presents a common interface, called the \textit{proto-runtime
2.303 +instance}. The language supplies
2.304 +the top two modules, which plug in via the interface. The hardware specific module  (proto-runtime instance) presents the same interface
2.305 +for all platforms, with a specialization for each category
2.306 +of platform sharing similar performance related features.  The proto-runtime module only has to be implement once for a given platform, and is then reused by  all the languages.
2.307 +
2.308 +\begin{figure}[ht]
2.309 +  \centering
2.310 +  \includegraphics[width = 1.5in, height = 1.1in]{../figures/proto-runtime__modules.pdf}
2.311 +  \caption{Shows how the proto-runtime approach modularizes the implementation of a runtime. The three pieces are the proto-runtime implementation, an implementation of the language construct behaviors, and an implementation of the portion of  scheduling that chooses which work is assigned to which processor. }
2.312 +  \label{fig:PR_three_pieces}
2.313 +\end{figure}
2.314 +
2.315 +
2.316 +Because of the modularization, a language has a much lower effort requirement, of implementing just for each category.
2.317 +
2.318 +The higher level of abstraction simplifies the task for the language implementer.
2.319 +The language doesn't consider the low-level details of making the runtime itself run fast. It only has to consider the level of hardware feature that is exposed by the interface.
2.320 +
2.321 +One additional benefit is that the assignment module
2.322 +gives control to the language, to choose when and where it wishes work to execute.
2.323 +This  simplifies implementation of language  features related to scheduling behavior.
2.324 +It also enables the language implementor to use sophisticated
2.325 +methods for choosing placement of work, which can significantly impact
2.326 +application performance.
2.327 +
2.328 +In this paper, we present work that applies to coherent
2.329 +shared memory machines, both single chip and multiple chip. Extensions beyond this are currently in progress, to address multiple-address-space machines and hierarchical
2.330 +heterogeneous collections of processors, which will appear in future papers.
2.331 +
2.332 +\subsection{Breakdown of the modules} \label{subsec:Modules}
2.333 +
2.334 +The language is broken into two parts, as seen in Fig.
2.335 +\ref{fig:langBreakdown}. One is a thin wrapper library that
2.336 +invokes the runtime and the other is a set of modules that are part of that invoked runtime. These are called
2.337 +the \textit{language plugin} or just plugin.
2.338 +
2.339 +
2.340 +\begin{figure}[ht]
2.341 +  \centering
2.342 +  \includegraphics[width = 2.8in, height = 1.1in]{../figures/proto-runtime__modules_lang_breakdown.pdf}
2.343 +  \caption{Shows how the code of the language implementation
2.344 +  is broken into two pieces. The first is a thin wrapper
2.345 +  that invokes the runtime, the other is a dynamic
2.346 +  library that plugs into the runtime.}
2.347 +  \label{fig:langBreakdown}
2.348 +\end{figure}
2.349 +
2.350 +
2.351 +
2.352 +Thus, a non-changing application executable is able to invoke hardware specific plugin code, which changes between machines. The plugin collects the two language modules into a dynamic library. The library is implemented, compiled,  distributed and installed separately from  applications.  The application executable contains only symbols of plugin functions, and during the run those are dynamically linked to machine-specific implementations.
2.353 +
2.354 +
2.355 +In order to provide such modularization, we rely upon a model for specifying synchronization constructs that we call the tie-point model. The low-level nature of a tie-point places them below the level of  constructs,
2.356 +even a simple  mutex. Instead, a mutex is specified in terms
2.357 +of the primitives in the tie-point model. In turn,
2.358 +the  tie-point primitives are implemented
2.359 +by proto-runtime.
2.360 +
2.361 + This places all parallel constructs on the same level in the software stack, be they complex like the AND-OR parallelism of Prolog, or the wild-card matching
2.362 +channels in coordination languages,  or ultra-simple acquire and release mutex constructs. All are implemented in terms of the same tie-point primitives provided by the proto-runtime instance.
2.363 +
2.364 +We have reached a point in the paper, now, where the order of explanation can take one of two paths: either
2.365 +start with the abstract model of tie-points and explain how this affects the modularization of the runtime, or start with implementation details and work upwards towards the abstract model of tie-points.  We have chosen to start with the abstract tie-point model, but the reader is invited to skip to the section after it, which starts with code examples and ties code details to the abstract tie-point model.
2.366 +
2.367 +
2.368 +
2.369 +\section{The tie-point model.}\label{subsec:TiePoints}
2.370 +
2.371 +
2.372 +\subsection{timelines}
2.373 +A tie-point relates timelines, so we talk a bit, first, about timelines. A timeline is the common element in parallelism.  If you look at any parallel language, it involves a number of independent timelines. It then controls which timelines are actively progressing relative to the others.
2.374 +
2.375 +For example, take a thread library, which we consider
2.376 +a parallel language.  It provides a command to create a thread, where that thread represents an independent timeline. The library also provides the mutex acquire and release commands, which control which of those timelines advance relative to each other. When an acquire executes, it can cause the thread to block, which means the associated timeline suspends; it stops
2.377 +making forward progress. The release in a different thread clears the block, which resumes the timeline. That linkage between suspend and resume of different timelines is the control the language exerts over which timelines are actively progressing.
2.378 +
2.379 +To build up to tie-points, we look at the nature of points on
2.380 +a single timeline, by reviewing mutex behavior in detail. See the timeline shown in Fig \ref{fig:singleTimeline}.  Thread A, which is timeline A, tries to acquire the mutex, M,
2.381 +by executing the acquire command. Timeline A stops, at point 1.S, then something external to it happens, and the timeline starts again at point 1.R.  The gap between is not seen by the code executed within the thread.  Rather, from the code-execution viewpoint, the acquire command is a single command, and hence the gap between 1.S and 1.R collapses to a single point on the timeline.
2.382 +
2.383 +
2.384 +\begin{figure}[ht]
2.385 +  \centering
2.386 +  \includegraphics[width = 2.8in, height = 0.8in]
2.387 +  {../figures/PR__timeline_single.pdf}
2.388 +  \caption{The timeline suspends at 1.S and resumes
2.389 +  at 1.R. From the viewpoint of the timeline, the gap collapses into a single point.}
2.390 +  \label{fig:singleTimeline}
2.391 +\end{figure}
2.392 +
2.393 +
2.394 + Fig. \ref{fig:dualTimeline}  shows  two timelines: timeline A executing acquire and timeline B executing release. The release still suspends its timeline, but
2.395 +it quickly resumes again because it is not blocked.
2.396 +The release causes timeline A to also resume. The fact
2.397 +of the release on one timeline has caused the end of the acquire on the other. This makes
2.398 +the two collapsed points become what we term \textit{tied together} into a \textit{tie-point}.
2.399 +
2.400 +\begin{figure}[ht]
2.401 +  \centering
2.402 +  \includegraphics[width = 2.8in, height = 1.2in]
2.403 +  {../figures/PR__timeline_dual.pdf}
2.404 +  \caption{Two  timelines with tied together collapsed''
2.405 +points.
2.406 +Point 1 on timeline A forms a tie-point with point
2.407 +2 on timeline B.
2.408 +It is hidden activity that takes place inside the gaps that
2.409 +establishes a causal relationship that ties them together.}
2.410 +  \label{fig:dualTimeline}
2.411 +\end{figure}
2.412 +
2.414 +how the release goes about causing the end of the block
2.415 +on the acquire. It reveals
2.416 +a hidden timeline, which is what performs the behavior of the
2.417 +acquire and release constructs.  As seen, acquire starts
2.418 +with a suspend, which is accompanied by a communication
2.419 +sent to the hidden timeline.  The hidden timeline then
2.420 +checks whether the mutex is free, sees that it isn't
2.421 +and leaves  timeline A suspended. Later, timeline
2.422 +B performs release, which suspends it and sends a communication
2.423 +to the same hidden timeline. That then sees that timeline
2.424 +A is waiting for the release and performs a special
2.425 +control action that resumes timeline A, followed by
2.426 +doing the control action again to resume timeline B.
2.427 + It is inside the hidden timeline that the acquire
2.428 +gets linked to the release, tying the constructs together.
2.429 +
2.430 +
2.431 +\begin{figure}[ht]
2.432 +  \centering
2.433 +  \includegraphics[width = 2.8in, height = 1.9in]
2.434 +  {../figures/PR__timeline_dual_w_hidden.pdf}
2.435 +  \caption{Two  timelines with tied together collapsed''
2.436 +points  showing the detail of a hidden timeline that
2.437 +performs the behavior that ties the points together.
2.438 +Vertical dashed lines represent communication sent
2.439 +as part of the suspend action, and the curvy arrows
2.440 +represent special control that causes resume of the
2.441 +target timelines. During the gaps in timelines A and
2.442 +B, activity takes place in the hidden timeline, which
2.443 +calculates that the timelines should be resumed, then
2.444 +exercises control to make resume happen.}
2.445 +  \label{fig:dualTimelineWHidden}
2.446 +\end{figure}
2.447 +
2.448 +
2.449 +
2.450 +We show in \S\ref{sec:FormalTiePoint} that the pattern
2.451 +of communications to and from the hidden timeline establishes
2.452 +an ordering relationship between events before and
2.453 +after the tied points. That implies a relation on
2.454 +the visibility of events.
2.455 +
2.456 +Fig \ref{fig:tie-pointGuarantees} shows the ordering relationship and the implied visibility of operations between
2.457 +the timelines. Operations that execute  in
2.458 +the first timeline before the tie-point are visible
2.459 +in the second after the tie point, and vice versa. Likewise, operations that execute in one timeline after the tie-point are not  visible in the other timeline before the tie-point. Such an ordering satisfies
2.460 +the requirements
2.461 +of a synchronization construct.
2.462 +
2.463 +
2.464 +
2.465 +\begin{figure}[ht]
2.466 +  \centering
2.467 +  \includegraphics[width = 2.8in, height = 1.25in]
2.468 +  {../figures/PR__timeline_tie_point_ordering.pdf}
2.469 +  \caption{The
2.470 +visibility guarantees that result from a tie-point. Shows which
2.471 + operations, such as writes,  performed on one timeline can be seen by the other
2.472 +timeline. These visibilities are equivalent to establishing
2.473 +an order between events before the tied points versus those after the tied
2.474 +points.  Both timelines agree on what events are before
2.475 +versus after the tied point.  }
2.476 +  \label{fig:tie-pointGuarantees}
2.477 +\end{figure}
2.478 +
2.479 +
2.480 +\subsection{Formal definition of tie-point} \label{sec:FormalTiePoint}
2.481 +In a moment we will show how any and all synchronization constructs
2.482 +can be defined in terms of tie-points. Before getting
2.483 +there, we must choose an, unavoidably arguable, definition of synchronization
2.484 +construct. We then provide a formal definition of tie-point
2.485 +and use it to show that a tie point
2.486 +satisfies the conditions of any
2.487 +such synchronization
2.488 +construct.
2.489 +
2.490 +Our formalism defines timelines, communication between
2.491 +timelines, and suspend and resume of a timeline. It then shows a particular pattern, which is the characteristic pattern that defines a tie-point. We then show that when that characteristic pattern exists, then relations exist between timelines that have certain properties.
2.492 +We conclude by showing a few classical definitions
2.493 +of synchronization and show that those definitions
2.494 +are upheld when  the tie-point pattern is present. Hence, those classical definitions can be satisfied via creation of a tie-point.
2.495 +
2.496 +\subsubsection{}
2.497 +
2.498 +\begin{description}
2.499 +\item[timeline:]
2.500 +$$T = E \times\mathbb{N}, (E, <)$$.  A timeline is an ordered
2.501 +sequence of events. Given two events $e_\alpha, e_\beta \in E$ from a timeline, the events are ordered by the
2.502 +subscripts, so: $e_\alpha < e_\beta$ iff $\alpha < \beta$,
2.503 +and vice versa.
2.504 + Any and all memory locations in a system are part
2.505 + of, or local to, exactly one timeline.  Only that
2.506 +timeline can modify the locations (hence, side-effects require shared memory to have its own timeline that
2.507 +is separate
2.508 +from any timeline that code executes in).
2.509 +
2.510 +\item[event:]
2.511 +$$E =\{c_{0,t},c_{1,t}, ..\} \cup \{s_{n,\alpha ,t}\} \cup \{r_{n,\beta , t}\} 2.512 +\cup \{z_{\gamma ,t} \}$$. There are four kinds of event
2.513 +that can happen on a timeline, namely $c$, a step of computation,
2.514 +which modifies the memory local to the timeline; $s$, a
2.515 +send of a communication which pushes out contents from
2.516 +the timeline's local memory; $r$, a receive of a communication
2.517 +which modifies the timeline's local memory; and $z$,
2.518 +a synchronization
2.519 +construct which suspends then resumes the timeline in such a way
2.520 +as to establish a relation between events on this timeline
2.521 +versus events on a remote timeline. Suspend is denoted
2.522 +$z\_s_{\gamma ,t}$ while resume is denoted $z\_r_{\gamma 2.523 +,t}$ where $s$
2.524 +and $r$ are literal while $\gamma$ denotes the position
2.525 +on the timeline and $t$ is the timeline that executes
2.526 +the synchronization construct.
2.527 +\item[communication:]
2.528 +$$C = \{s,r\}, s < r$$.  A communication is a set of
2.529 +one send event from one timeline plus one or more receive events
2.530 +from different timelines, with the send
2.531 +event ordered before the receive event(s), denoted $s_{n,\alpha, t}\mapsto 2.532 +r_{n,\beta,t}$ where $n$ distinguishes the communication
2.533 +set, $\alpha$ and $\beta$ are the ordering upon the
2.534 +timeline and $t$ denotes the timeline the event is on.  A communication
2.535 +orders events on one timeline relative to events on another.
2.536 +However, the ordering is only between two points. In
2.537 +particular for two sends from timeline 1 to timeline
2.538 +2, if $$s_{1,\_,1} < s_{2,\_,1}$$ on timeline 1, then on
2.539 +timeline 2, both $$r_{1,\_,2} < r_{2,\_,2}$$ and $$r_{2,\_,2} < r_{1,\_,2}$$ are valid, where $\_$'' in the position
2.540 +of the ordering integer represents a wild
2.541 +card. However, $s_{1,\_,1} \mapsto r_{1,\_,2}$
2.542 +followed by $s_{2,\_,2} \mapsto r_{2,\_,1}$ where $r_{1,\_,2} 2.543 +< s_{2,\_,2}$
2.544 +  implies that $s_{1,\_,1} < r_{2,\_,1}$ always.
2.545 +
2.546 +\item[hidden timeline:] We define a special kind of  "hidden" timeline that is not
2.547 +seen by application code. It has an additional
2.548 +kind of event available, which ends a synchronization
2.549 +event on a different timeline.
2.550 + We denote this $fro_{\delta,h}$ where $fro$ is literal,
2.551 + standing for force resume other (timeline)", $\delta$ is the position
2.552 + on the timeline and $h$ is the (hidden) timeline the
2.553 +event is on. Additionally, a suspend event on an application
2.554 +visible timeline implies a send from that timeline
2.555 +to a hidden timeline. Hence $z\_s_{\gamma,t} \Rightarrow 2.556 +s_{n,\gamma,t} \mapsto r_{n,\_,h}$
2.557 +
2.558 +\item[tie-point:] Now, we define a tie-point as a set of two or more
2.559 +synchronization points from different timelines which
2.560 +are related by a particular pattern of communications.
2.561 +As a result of the pattern, the set satisfies particular criteria. The pattern is that communications from the suspend synchronization events must converge on a common hidden timeline and that timeline must then emit a subsequent resume event for each of the suspended timelines,
2.562 +as shown back in Fig. \ref{fig:dualTimelineWHidden}.
2.563 +
2.564 +\end{description}
2.565 +
2.566 +We now show that from these definitions it follows:
2.567 +[math here] which says that any event that comes after a tie point on one timeline is ordered after any event on a different timeline that precedes the tie-point on that timeline (note that the same tie point is common to both timelines).  The dual also holds true.
2.568 +
2.569 +We take the event immediately preceding and the event
2.570 +immediately following two synchronization events on
2.571 +two timelines.  The synchronization events begin with
2.572 +a suspend half-event and ends with a resume half-event.
2.573 +The suspend half-event is accompanied by a send to
2.574 +a hidden timeline.  That hidden timeline has a receive,
2.575 +and later in its sequence it has a receive for the
2.576 +synchronization event from the second timeline. The
2.577 +hidden timeline then performs resume of both timelines.
2.578 +
2.579 +From that, we get the following relations:
2.580 +
2.581 +Which shows that the event following on timeline 1 comes after the event preceding on timeline 2 and vice versa.
2.582 +
2.583 +This property of ordering events on two timelines in this way is the key requirement for several classical definitions of synchronization.  Hence, any implementation that exhibits this pattern of synchronization communications converging on a common hidden timeline, which subsequently resumes the synchronizations, in turn satisfies the conditions for a synchronization.
2.584 +
2.585 +\subsubsection{What is different about tie-point?}
2.586 +Many readers will be wondering "so, how is implementing
2.587 +a synchronization construct this way any different
2.588 +from how they're currently implemented?"  The answer
2.589 +is that currently, synchronization constructs are
2.590 +implemented on top of other synchronization constructs,
2.591 +where we consider an atomic Compare and Swap instruction
2.592 +to be a synchronization construct.  It is only in the
2.593 +hardware that a synchronization construct is assembled
2.594 +from pieces.  We further claim that the hardware implements
2.595 +according to the tie-point pattern described in our formal definition.
2.596 +
2.597 +What we consider to be a tie-point is any point that
2.598 +has this pattern, independent of the semantics added.
2.599 +For example, for the Compare And Swap (CAS) instruction,
2.600 +the comparison and swap are the semantics of what the
2.601 +instruction does, while the atomicity, or exclusive
2.602 +access is the part that provides the ordering relations.
2.603 +So, the presence of the ordering relations is the tie-point
2.604 +portion, while the comparison and swap are the plugged-in
2.605 +semantics portion associated with the tie point.
2.606 +
2.607 +In that way, tie-point can be considered to simply
2.608 +say has the ordering relation of a synchronization
2.609 +construct". Tie-point is nothing new, when viewed that way.  However, a tie-point is not a given, but rather
2.610 +has to be constructed.  To get a tie-point, one must
2.611 +create a construction from which the givens for a synchronization
2.612 +can be derived.  Further, tie points can be constructed
2.613 +for things that most would not readily consider a synchronization
2.614 +construct.  For example, any asynchronous communication
2.615 +establishes a half tie-point, because ordering can
2.616 +be derived.  This is useful, for example, in defining
2.617 +memory consistency models.
2.618 +
2.619 +The key here is the elements of the model within which
2.620 +tie-point is defined.  In particular, memory does not
2.621 +exist outside a timeline, the points on a timeline
2.622 +have no ordering relative to points on another timeline,
2.623 +ordering between timelines is only established by a communication, and timelines can suspend themselves
2.624 +(or be suspended by a different timeline),
2.625 +and be resumed by a different timeline.
2.626 +
2.627 +Within this model, the characteristics of a synchronization
2.628 +can be derived.  That is the key difference, as usually
2.629 +one states as a \textit{given}  that a construct exists that has the synchronization properties.  Tie-point
2.630 +is derived, versus synchronization is given.
2.631 +
2.632 +True, the two are equally powerful.
2.633 +
2.634 +More low level, less junk on top, more efficiency and
2.635 +more control -- w/sync like threads, it has its own
2.636 +scheduler, have no control over where and when work
2.637 +happens.
2.638 +
2.639 +It is different because it only directly provides half
2.640 +the behavior, the time half. in the sense that
2.641 +
2.642 +The claim is that from a theory standpoint, tie-point
2.643 +is not more powerful -- proto-runtime can implement
2.644 +synchronization constructs, and sync constructs can
2.645 +implement other sync constructs..
2.646 +
2.647 +But, sync constructs CANNOT implement all of proto-runtime!  They can't do the communications nor the hidden timeline nor create VPs
2.648 +nor scheduling..   also, proto-runtime can do distributed
2.649 +memory things that sync constructs cannot.
2.650 +
2.651 +The sync constructs can be used together with shared
2.652 +memory-based communication in order to make more complex
2.653 +sync constructs..  but they can't be used in a distributed
2.654 +memory system to make distributed memory things.
2.655 +
2.656 +Unless use communication to implement shared memory
2.657 +on top of distributed memory.. things like that.. It's
2.658 +a question of what's fair game in the comparison --
2.659 +proto-runtime the behavior is in the hidden timeline,
2.660 +which is "inside" the construct, in a sense..  but using sync constructs to implement others, you lose
2.661 +that "inside" notion..  it just becomes application
2.662 +code that uses sync constructs..  with the app code
2.663 +running in an application timeline..  so..  need to
2.664 +get at that notion of animator, which has the "hidden"
2.665 +timeline, versus function call..
2.666 +
2.667 +What about this.. it's a matter of constructing from
2.668 +equally powerful versus from less powerful.. mmmm want
2.669 +that notion of animator in there..  and want to get
2.670 +at when an arrangement qualifies as having "switched
2.671 +over to the animator" -- does implementing mutex from
2.672 +just memory ops qualify as switching over to the animator
2.673 +just by entering the code that implements the mutex?
2.674 +Say, place that code in-line in the application code
2.675 +everywhere it's used..
2.676 +
2.677 +Hmmmm.. could use the relation model to show that the
2.678 +pure memory based implementation contains a tie-point,
2.679 +which is how the more-primitive operations are able
2.680 +to construct the more powerful mutex.     That might
2.681 +be a more fruitful, easier to gain acceptance, approach..
2.682 +show that things that have no time-related semantics,
2.683 +only simple one-way communication, are able to construct
2.684 +the time-related semantics.. and it is the presence
2.685 +of the tie-point convergence pattern that does it.
2.686 +
2.687 +In fact, might take the Dijkstra original mutex from
2.688 +must memory implementation and show the tie-point pattern
2.689 +within it..  then also show the tie-point pattern within lock-free implementations..  the point being that all
2.690 +you have to show is the presence of the tie-point pattern,
2.691 +in order to prove synchronization properties..  where
2.692 +"synchronization properties" is the existence of the ordering relation.. which is equivalent to agreement of before vs after.. which is equivalent to the visibility
2.693 +relation, which is what a programmer cares about..
2.694 +the visibility is what a programmer requires in a "mutual
2.695 +exclusion".
2.696 +
2.697 +This visibility guarantees is how it can be guaranteed that
2.698 +those that are still "before" the mutex cannot influence
2.699 +the one "after" the mutex, which is inside the critical section.  And also require vice versa,
2.700 +that the one "after" the mutex, inside the critical
2.701 +section, cannot take actions
2.702 +that influence any "before" it..  similarly at the
2.703 +end of the critical section, need the same isolation.
2.704 +
2.705 +
2.706 +Let's see..  the relation model said that something
2.707 +with synchronization constraints can be created from
2.708 +just communication plus hidden timeline..  as long
2.709 +as get the convergence on that hidden timeline.
2.710 +
2.711 +What Henning was saying was that sync is defined as
2.712 +the end-constraints.  So, the end-constraints IS what
2.713 +a synchronization construct is.  It doesn't matter
2.714 +how to implement one, it only matters the end constraints.
2.715 +
2.716 +So, what the relation thing showed was how to construct
2.717 +a synchronization.  What need to show is that the relation
2.718 +thing can also construct stuff that cannot be constructed
2.719 +with a synchronization construct.
2.720 +
2.721 +I guess the question would be: if one starts with a
2.722 +synchronization construct existing within a distributed
2.723 +system..  well, then one can construct other sync constructs
2.724 +from that one..
2.725 +
2.726 +For them, the question of "more primitive" is: can the more primitive
2.727 +thing do stuff the "full" one cannot?
2.728 +
2.729 +For me, the question of "more primitive"\ is: can one
2.730 +of them be constructed from the other, which ONLY\
2.731 +has simpler pieces?  Constructing one from itself says nothing..
2.732 +but being able to construct one from something that
2.733 +is NOT one, whose individual components all have less
2.734 +than one..  that thing's pieces are all less powerful..
2.735 +then it is a particular combination that brings the extra
2.736 +time-related behavior of a sync construct into existence.
2.737 +It is recognizing the particular pattern that brings
2.738 +that extra into existence that is of value.
2.739 +
2.740 +It is that pattern that tells you how to get one from
2.741 +simpler pieces.
2.742 +
2.743 +So, the story is: using only pieces that lack the "special"
2.744 +synchronization construct property, construct something
2.745 +that does have the synchronization property.  That,
2.746 +is building something more powerful from pieces that
2.747 +are less powerful.
2.748 +
2.749 +The other part of the story is: the proto-runtime cannot
2.750 +be used by itself.  It requires addition before it
2.751 +can be used.  That is, have to add the $M\mapsto M$, to arrive
2.752 +at the $T\times M\mapsto M$, then can use the $T\times 2.753 +M\mapsto M$..  but can't
2.754 +use just the $T\times$ by itself -- that's non-sensical.
2.755 +So, provides a $(M\mapsto M, f)$ that is used to get the $T\times M\mapsto M$,
2.756 +but can't use the $f$ inside an application.. it doesn't
2.757 +do anything other than add the Tx..  so it doesn't
2.758 +accomplish any steps of computation, nor does it provide
2.759 +$T\times$ to any application code..  the $(M\mapsto M, f)$ is outside
2.760 +of any language -- that's what CREATES a language.
2.761 +
2.762 +*****Can't define $(M\mapsto M, f)$ as part of its own language,
2.763 +because it doesn't do anything.  No computation is
2.764 +performed by it. ****  (so, what's the definition of
2.765 +computation, then?)
2.766 +
2.767 +The other part of the story is the HWSim time behavior
2.768 +-- those aren't sync constructs..  rather that is a
2.769 +particular set of constraints on time..  constructed
2.770 +out of primitives none of which have sych nor time
2.771 +behavior by themselves beyond "comes after" of comm.
2.772 +
2.773 +Another part of the story is the singleton thing, constructed
2.774 +directly..  Q: can that be built from sync constructs
2.775 +in distributed system?  Does using sync constructs
2.776 +do something that using primitives doesn't?  Does it
2.777 +add something, fundamentally?  Well, it is in terms
2.778 +of something that already has the property being constructed..
2.779 +that's the issue..  in one case, taking something that
2.780 +has the property and building something else that has
2.781 +it..  in other case taking something that doesn't and
2.782 +building something that does.
2.783 +
2.784 +So..  in the consistency model, just using the comes-after
2.785 +property of communication to derive compound communication,
2.786 +of particular write to particular read, via memory
2.787 +locations.
2.788 +
2.789 +So, what is a tie-point in that consistency model?  It is the pattern that allows deriving an ordering, between different computation timelines.  There, the
2.790 +tie-point was tying a write on one to a read on the
2.791 +other, and thereby establishing a half-ordering between
2.792 +the two timelines.
2.793 +
2.794 +Right..  so that should be it..  that a chain of communications results in an ordering between the end-points.  And that a synchronization is nothing more than two communication chains that are tied together.. where the tie equals the chains SHARING one link, on some intermediate timeline.
2.795 +
2.796 +Right.. thinking about mutex acquire and release..
2.797 +the release is asynch..  the sending timeline resumes before
2.798 +the hidden timeline receives notice.. but that just
2.799 +establishes a half tie-point, no?
2.800 +
2.801 +In the async case, operations after the construct can be seen BEFORE the construct in the other timeline. Right.  So that's a half tie-point.  A full tie-point is that nothing after in either can be seen before by the other.
2.802 +
2.803 +Right.. so one distinction is this: a half tie-point
2.804 +cannot be created using sync constructs "directly".
2.805 + A sync construct is a full tie-point.
2.806 +
2.807 +
2.808 +================================================
2.809 +
2.810 +
2.811 +\subsubsection{Lifeline, Timeline, and Projection}
2.812 +We define a formal entity that we call a lifeline,
2.813 +where a timeline is a type of lifeline.
2.814 +We define event-types and specific occurrences of event-types, and show how multiple lifelines can observe the same occurrence. A projection between
2.815 +lifelines is defined as an event initiated upon one lifeline being observed on a different lifeline.  The projection is from initiator to observer.
2.816 +
2.817 +\begin{description}
2.818 +\item[event:]
2.819 +$$E$$ represents an event, which is something that
2.820 +can be initiated or observed.
2.821 +\item[occurrence:]
2.822 +$$O\in E \times\mathbb{N}$$  is the set of occurrences, where each occurrence associates a specific event with a unique identifier. A particular occurrence is denoted by subscripting with the value of the associated
2.823 +integer, for example: $$O_{7}$$
2.824 +
2.825 +\item[clock:]
2.826 +$$t:I\rightarrow\mathbb{R}^{+}$$ maps each integer
2.827 +onto a  real number, such that $$I_{1}<I_{2}\Rightarrow 2.828 +R_{1}<R_{2}$$ . In general, different clocks have no relation to each other, but elements associated with a clock have a sequence defined by the integer
2.829 +sequence of the clock.
2.830 +\item[lifeline:]
2.831 +$$l = <\alpha , t>$$
2.832 + is a lifeline, where $$\alpha$$
2.833 +is a sequence over $$Dom(t)$$ and each element of $$\alpha$$ is either an initiation of an occurrence, or an observation
2.834 +of one. A \textit{beat} of the lifeline is one tuple, denoted $$l(i)$$, while the occurrence associated
2.835 +to the beat is denoted $$O(l(i))$$ or equivalently $$O(\alpha(i)).$$ The real value
2.836 +associated with the beat is denoted $$t(l(i))$$.  For a given lifeline, not every element of $$t$$ must have an associated
2.837 +$$\alpha$$, but every $$\alpha$$ must have a unique associated
2.838 +$$I$$ from the clock $$t$$.  Note that $$\forall i , t(l(i)) < t(l(i+1))$$.  At most one beat from one
2.839 +lifeline can initiate an occurrence.  However, multiple
2.840 +beats
2.841 +from a given lifeline can observe the same occurrence,
2.842 +including one initiated earlier in the sequence of
2.843 +the lifeline,
2.844 +and multiple lifelines may observe the same occurrence,
2.845 +each multiple times.
2.846 +
2.847 +\item[projection:]
2.848 +Given $$l_{1} = <\alpha , t_{1}>$$, $$l_{2} = <\beta , t_{2}>$$ then a projection from $$l_{1}$$ to $$l_{2}$$
2.849 + is denoted  $$l_{1}(i) \uparrow l_{2}(j)$$, where  $$l_{1}(i) \uparrow l_{2}(j) 2.850 +\equiv O(l_{1}(i)) = O(l_{2}(j))$$.
2.851 + This says that the occurrence initiated by the ith beat of the first lifeline is observed by the jth beat
2.852 +of the second lifeline.
2.853 +
2.854 +\item[ordering tuple:] $$OT_{}$$ is a tuple consisting
2.855 +of a set of two  beats from two different lifelines, which do
2.856 +not participate in projections, plus a set of projections
2.857 +that cross the two beats in the forward direction.
2.858 +Given $$OT =<[l_{1}(x) , l_{2}(y)], [projections]>$$ then $$OT$$ is an
2.859 +ordering tuple  iff $$[projections] \neq0 \forall p(i,j) \in projections \nexists p(i,j) 2.860 +|i<x \wedge j>y\$$
2.861 +\item[program run:] $$\mathcal{R}$$ is a particular set of lifelines.
2.862 +The program run begins with the creation of any lifeline, and
2.863 +ends with the end of all lifelines.
2.864 +
2.865 +\item[equivalent positions in different  sequences:] a partial ordering is defined.
2.866 +Given two positions within different sequences, if
2.867 +one or both both can be
2.868 +validly rearranged, by using the partial ordering to
2.869 +define valid rearrangements,  so they occupy
2.870 + the same position in their rearranged sequences, then
2.871 +they are equivalent positions.
2.872 +
2.873 +\item[equivalent occurrences:] two occurrences are
2.874 +equivalent if their event instances cannot be distinguished, given the observation
2.875 +measurements of interest. If the observation measurement
2.876 +involves sequences, then the two events must lie at
2.877 +equivalent positions within their respective sequences.
2.878 +
2.879 +\item[equivalent lifelines:] two lifelines whose beats
2.880 +can be paired, such that every beat in one lifeline
2.881 +has an equivalent beat in the other.  The beats do
2.882 +not have to occur in the same order in both lifelines.
2.883 +Beats associated to occurrences that are not of interest can be dropped.
2.884 +
2.885 +\item[equivalent program runs:] two runs such that
2.886 +their lifelines can be paired one-to-one, with every lifeline in one paired to an equivalent
2.887 +lifeline in the other. The projections between lifelines
2.888 +in one run can be different from the projections in
2.889 +the other run.
2.890 +
2.891 +\item[tie-point:] a set of beats, one from each of two lifelines, such that this set of beats forms a separation set in all equivalent program runs.
2.892 +\end{description}
2.893 +
2.894 +
2.895 +Some things to note: A particular occurrence
2.896 +can be associated to at most one beat from a given
2.897 +lifeline, but that same occurrence can also be associated
2.898 +to beats from multiple other lifelines.  Also, an occurrence may
2.899 +be initiated by a lifeline but never observed by any.
2.900 +Every $$O$$ has a set of projections associated with it.
2.901 +
2.902 +For example, the event could
2.903 +be writing a value into a variable.  Two separate
2.904 +write events are considered equivalent occurrences if
2.905 +they both write the same particular value into whatever memory location
2.906 +is associated to the same particular
2.907 +variable, and happen within valid partial orderings
2.908 +relative to the other occurrences.  This is normally
2.909 +compared across re-creations of the "universe" that
2.910 +provides the context for the orderings of events instances.
2.911 +
2.912 +=========
2.913 +
2.914 + Okay, talked it over with Sung -- what about making distinguished beats
2.915 +-- as Sung poked around for, make the PR\ "suspend" be the
2.916 +distinguished beat.  Then, as we worked out talking it
2.917 +through, make the code that happens on the hidden timeline be the linkage between the beats -- so a tie-point is any number of distinguished beats such that the hidden calculation on one of the beats executed the resume for all of the other beats in the tie point.  That establishes how a tie point gets created..  separately, need a universal statement of what is guaranteed by a tie point.
2.918 +
2.919 +So, one thing, is that the hidden calc is normally chosen such that every equivalent program run reproduces equivalent tie points -- but defining equivalent relies upon defining the "meaning"\ of the constructs..  but maybe that thing above about equivalent in terms of partial order can be used, by saying all constructs
2.920 +are associated with a partial ordering -- but, still can have truly non-deterministic behavior being the correct behavior.. hmmm, but that should still have a partial ordering!
2.921 +
2.922 +  What I\ really want to do is define tie-point in terms of the write-to-read.  A half tie point says what's before the pre is visible after in the post timeline.  And a full tie-point says that goes both ways.  So, acquire-release is only a half tie-point, because what's after the release in its timeline  can be seen before the acquire in its timeline. That makes it a half tie-point.  Also, whats before the acquire in its timeline does not necessarily have to be seen after the release in its timeline.. that also makes it a half tie-point.
2.923 +
2.924 +So, use the project definition, and the crossing definition, to say which crossing projects are allowed by a half tie point, and which of those must be eliminated to make it a full tie-point.  Then THAT\ defines the behaviors of a tie-point, independently from how it is created.
2.925 +
2.926 +The full definition of tie-point, in terms of proto-runtime value, has both those -- the hidden timeline "math" thing along with the causality, gives the "creation" aspect of tie-point, and the allowed projections gives the "behavior" aspect of tie-point.
2.927 +
2.928 +From the projection "behavior" I can simply state "this
2.929 +defines what all synchronization constructs do" --
2.930 +the projection behavior is the whole purpose of a sync construct -- to ensure particular communication pattern when comm is via side-effect
2.931 +
2.932 +=======
2.933 +
2.934 +From first model, have the real-value constraints for  slide of suspend and resume relative to each other..
2.935 +
2.936 +The behavior of full tie-point is no back-cross projections, and there is a set of forward-crossing projections, which may be empty, and any of the tied timelines may
2.937 +be the initiating timeline.  For a half tie-point, have the origin lifeline. There is a set of forward-crossing projections with initiation on the origin lifeline,
2.938 +and backward crossing are allowed whose initiation
2.939 +is on non-origin lifeline.
2.940 +
2.941 +But a tie-point is more than just the behavior it defines.
2.942 + In order for a pair of special beats to form a tie-point,
2.943 +they must be causally linked on their internal lifelines.  This means that a sequence of changes of the internal
2.944 +state links the internal activity of one of the special beats to the internal activity of another special beat
2.945 +that executes the resume that ends  the second special beat. All special beats that are resumed inside the
2.946 +same internal activity will have the behavior of a
2.947 +full tie-point. Half tie-points can have both halves
2.948 +resumed in different internal activities.
2.949 +
2.950 +A special beat has a variable-length span, as measured in the real-number of the clock. A special beat is associated to an isolated atomic span on a hidden lifeline. The only way to end the span of a special beat is via a "resume" beat on the hidden lifeline, which names the special beat to be ended.
2.951 +
2.952 +The internal activity on the hidden lifeline enforces some description.
2.953 +
2.954 +For
2.955 +example, send-receive descriptions are: send = if paired
2.956 +receiver is in shared context then resume both else place self into shared context. receive: if paired send is in shared context then resume both else place self into shared context.
2.957 +
2.958 +For acquire-release.. acquire: if lock-owner inside shared
2.959 +context is empty then place self-name into lock-owner
2.960 +and resume self else place self onto end of sequence
2.961 +of special beats. release: remove self from lock-owner
2.962 +and place the next in sequence of special beats into
2.963 +lock-owner. If non-empty then resume the new lock-owner.
2.964 +in every case, resume self. Note, acquire-release can
2.965 +form either a half tie-point or a full tie-point.
2.966 +?
2.967 +
2.968 +====
2.969 +
2.970 +Note to the reader.  This is a first pass at a formal description of tie-point. It likely contains more constraints than necessary. It should not be taken as the final formalism, nor is it implied to be elegant in any way, but simply an existence proof for a formal description
2.971 +of a useful subset of what the intuition of tie-point associates to.
2.972 +
2.973 +
2.974 +
2.975 +
2.976 +
2.977 +\subsection{How a synchronization construct relates
2.978 +to tie-points}
2.979 +
2.980 +To prepare for stating how the tie-point model can be used to
2.981 +specify a synchronization construct, we first state
2.982 +clearly what we mean by a synchronization construct''.
2.983 +
2.984 +The top of Fig \ref{fig:PRSyncConstrDef} shows two
2.985 +independent timelines, both performing reads and writes
2.986 +within a machine that has coherent shared memory. The
2.987 +timelines have no relative ordering defined, so any
2.988 +write on Timeline A can be received by any read of
2.989 +the same address on
2.990 +Timeline B, and vice versa.  This means that, in general,
2.991 +the use of a variable that is read and written by both will result in non-deterministic behavior.
2.992 +
2.993 +
2.994 +\begin{figure}[ht]
2.995 +  \centering
2.996 +  \includegraphics[width = 2.0in, height = 2.8in]
2.997 +  {../figures/PR__timeline_sync_def.pdf}
2.998 +  \caption{Depicts the meaning  we adopt for  synchronization construct'.  One of them  controls communications between timelines
2.999 +by controlling the slide of timelines relative to each
2.1000 +other. They imply certain visibility between writes  and reads on different timelines.}
2.1001 +  \label{fig:PRSyncConstrDef}
2.1002 +\end{figure}
2.1003 +
2.1004 +
2.1005 +
2.1006 +To control the behavior of writes and reads to the
2.1007 +same addresses, a common point must be established, which
2.1008 +limits the sliding'' of the timelines relative to
2.1009 +each other. A synchronization construct is used for
2.1010 +this.
2.1011 +The net effect of such a construct is to establish
2.1012 +a common point that both timelines agree on.  This
2.1013 +point separates reads and writes before it from reads
2.1014 +and writes after it.
2.1015 +
2.1016 +For example, consider a simple lock used to protect a critical section.  The lock is acquired by one timeline
2.1017 +before entering the critical section. Any writes performed
2.1018 +on other timelines before the lock was granted must be complete before the critical section starts, so that reads performed inside the critical section see them. This is illustrated in the middle of Fig \ref{fig:PRSyncConstrDef}.
2.1019 +
2.1020 +The critical section ends by releasing the lock, which allows a different timeline to acquire and enter the critical section.  As seen in the bottom of Fig \ref{fig:PRSyncConstrDef},
2.1021 +any writes performed by that new
2.1022 +timeline after it acquires the lock must not be visible
2.1023 +to reads performed by the old timeline before it released
2.1024 +the lock.
2.1025 +
2.1026 +With this intuition, we define a synchronization construct
2.1027 +as an operation preformed on a timeline, which has
2.1028 +the property that it creates
2.1029 +a tie-point together with an operation performed on a different
2.1030 +timeline.  Such operations that establish a tie-point
2.1031 +fit our definition of synchronization constructs.
2.1032 +
2.1033 +
2.1034 +\subsection{More on tie-points}
2.1035 +
2.1036 +Fig \ref{fig:dualTimeline} showed how a tie-point can be generated. The establishment was accomplished by
2.1037 +a combination of primitive mechanisms. These include: 1) suspend; 2) an invisible' timeline that executes
2.1038 +behavior in the gaps; 3) resume
2.1039 +called from that invisible timeline; and 4) enforcement
2.1040 +of instruction completion relative to resume.
2.1041 +
2.1042 +What an established tie-point  provides is the notion that the tied points are the same instant" for both tied timelines.   What that means is that both timelines see events ordered relative to that point in the same way.
2.1043 +
2.1044 +
2.1045 +Notice that the primitives that establish a tie-point
2.1046 +do not involve any notion of  dependency or constraint
2.1047 +on order of execution.  It is the behavior code that runs on the invisible
2.1048 + timeline that embodies notions such as dependency
2.1049 + between units of work, mutual exclusion,
2.1050 + partial ordering of work, and so on.  However, the
2.1051 + primitives do provide the notion of causality,  the ordering implied by causality, and enforcing completion
2.1053 +
2.1054 +It is up to the language to supply the behavior that happens inside
2.1055 +the gaps, which executes on the invisible timeline. This behavior is what decides which timelines end up
2.1056 +sharing a tie point. It is that decision making, of which timelines to tie together, that implements the
2.1057 +semantics of a synchronization construct.
2.1058 +
2.1059 +A workshop paper also discusses tie points
2.1060 +[]. A formal treatment of tie-points is beyond the scope of this paper. However, a formal framework has been substantially completed and
2.1061 +will be published in a future paper.
2.1062 +
2.1063 +
2.1064 +
2.1065 +\subsection{Tie-points within a proto-runtime}
2.1066 +
2.1067 + Fig \ref{fig:dualTimeline} didn't say what entity owns the hidden timeline that executes  the behavior that takes place in the gaps.  This is what the proto-runtime does. An instance of the
2.1068 +proto-runtime executes the language plugin behavior.
2.1069 +It acts as the hidden timeline.
2.1070 +
2.1071 + The proto-runtime code module also supplies implementations
2.1072 +of the primitives that are used to establish a tie-point, including these:
2.1073 +
2.1074 + %It provides the primitive that suspends a timeline and then causes language plugin behavior to execute in the gap.
2.1075 +
2.1076 +%The plugin behavior that runs in the proto-runtime when one timeline suspends is what chooses another timeline to resume as a consequence. That choice establishes causality between the suspensions of the two timelines,  and in the process ensures that a valid tie will exist between the two collapsed timeline points. The code of the primitives  is provided as part of the proto-runtime  code module, while the plugin behavior is executed by an   instance of a running proto-runtime.
2.1077 +
2.1078 +%The running proto-runtime instance is also known as the Master, while the application timelines are known as Slaves.  The behavior of the language constructs executes within the Master's timeline, while the behavior of application code executes within Slave timelines.
2.1079 +
2.1080 +%\subsection{More about the proto-runtime}
2.1081 +
2.1082 +\begin{itemize}
2.1083 +\item create a virtual processor (which has a suspendible timeline)
2.1084 +\item create a task (which has an atomic timeline that runs to completion)
2.1085 +
2.1086 +\item suspend a timeline, then invoke a   function to handle the suspension -- handler is  supplied with
2.1087 +parameters from application
2.1088 +\item resume a timeline, which makes it ready for execution
2.1089 +\item end a timeline
2.1090 +\item trigger choosing which virtual processor or task to begin execution on an offered
2.1091 +core
2.1092 +
2.1093 +\end{itemize}
2.1094 +
2.1095 +Virtual processors and tasks, both, have associated timelines. The reason for having both is a practical one, as tasks are simpler, with less overhead,
2.1096 +and many languages have the semantics of short, atomic, units of work that
2.1097 +are not intended to suspend. Thus, tasks are treated differently inside the
2.1098 +proto-runtime, and incur less overhead to create and run.
2.1099 +
2.1100 +A special feature of the proto-runtime is that if a task happens to execute
2.1101 +a language command that causes suspension, then the proto-runtime automatically
2.1102 +converts that task to a suspendible virtual processor. This helps support the mixing of different
2.1103 +languages within the same program.
2.1104 +
2.1105 +
2.1106 +The proto-runtime provides a mechanism for communicating information from the application code to the plugin function that was invoked to handle suspension. For example, the identity of a particular mutex a thread wishes to acquire
2.1107 +can be communicated from the wrapper library to the plugin.
2.1108 +
2.1109 +
2.1110 +Because the proto-runtime tracks all the timelines, the end of a timeline has to be explicitly stated in the application code, by calling a wrapper library function. That then invokes the proto-runtime primitive,
2.1111 +which informs the proto-runtime instance. The proto-runtime  performs internal bookkeeping related to the ending of the timeline, and notes that the core is now free and offers it to the plugin's Assigner function.
2.1112 +
2.1113 +The proto-runtime involves the language into the process of choosing which core a given task
2.1114 +or virtual processor executes on. The proto-runtime maintains control, but offers free cores to the Assigner
2.1115 +portion of the plugin. It responds by then assigning a task or virtual processor to the core. The proto-runtime just offers, it is up to the language to decide what work that core should receive at that point in time.
2.1116 +
2.1117 +
2.1118 +
2.1119 +\subsection{Concrete Example}\label{subsec:Example}
2.1120 +
2.1121 +To make this concrete, consider the example of implementing
2.1122 +acquire mutex and release mutex. The semantics are:
2.1123 +
2.1124 +\begin{itemize}
2.1125 +\item Acquire Mutex: A thread  calls the construct,
2.1126 +and
2.1127 +provides the name of the mutex. If no thread owns the
2.1128 +mutex, the calling thread is given ownership and it
2.1129 +continues to make progress. However, if a different thread
2.1130 +already owns the mutex, the calling thread is put into a queue
2.1131 +of waiting threads, and stops making progress.
2.1132 +\item Release Mutex: A thread calls the construct and
2.1133 +provides the name of the mutex. If the mutex has waiting threads in its queue, then the next thread is taken out and given ownership of the mutex. That thread is resumed, to once again make progress, as it the thread
2.1134 +that called the release construct..
2.1135 +\end{itemize}
2.1136 +
2.1137 +This calls for a data structure that has two fields:
2.1138 +one holds the thread that currently owns the mutex,
2.1139 +the other holds a queue of threads waiting to acquire
2.1140 +the mutex. The semantics of a construct involve multiple
2.1142 +and writes of the data structure. Hence, the
2.1143 + structure must  be protected
2.1144 +from races between different threads.
2.1145 +
2.1146 +The protection
2.1147 +is where the difficulty comes into the implementation,
2.1148 +and where performance issues come into the picture.
2.1149 +It could be accomplished with a single global lock
2.1150 +  that uses hardware primitives, or accomplished
2.1151 +with wait-free data structures that only rely upon the coherence
2.1152 +mechanism of the memory system, or even by message passing plus
2.1153 +quorum techniques.
2.1154 +
2.1155 +However, the implementation of the semantics  is independent
2.1156 +of the implementation of the protection. They are orthogonal,
2.1157 +and an interface can be placed between them. One side
2.1158 +of the interface implements checking and updating the fields of
2.1159 +the data structure, while the other side implements
2.1160 +protecting the first side from interference.
2.1161 +
2.1162 +The side that provides protection requires fields,
2.1163 +for its use, to be placed into the data structure used
2.1164 +to represent a thread. To hide those details,
2.1165 +the protection side should also provide
2.1166 +primitives to create and destroy threads, as well as suspend
2.1167 +and resume them.
2.1168 +
2.1169 +This interface that separates the semantic side from
2.1170 +the protection
2.1171 +side is the proto-runtime interface. It is what enables
2.1172 +the modularization of runtime system implementations.
2.1173 +
2.1174 +The tie-point concept provides a model for thinking
2.1175 +about how the semantic side controls ordering among multiple threads, without exposing any details of the protection side. The tie-point model involves thinking only about actions taken during suspension of timelines (threads). It assumes that those actions are protected from interference, and that suspend and resume of timelines are primitive operations made available. The model remains constant regardless of  implementation details.
2.1176 + That provides a cross-hardware way of specifying synchronization
2.1177 +behavior using just sequential thinking. The proto-runtime primitives implement the elements of the tie-point model.
2.1178 +
2.1179 + %Currently, these constructs are either implemented directly in terms of hardware level synchronization constructs such as the atomic Compare And Swap (CAS) instruction, or else are a thin wrapper that invokes operating system behavior. However, the behavior of the OS\ kernel's threading primitives are themselves implemented in terms of hardware level synchronization
2.1180 +%constructs. Either way,  developing the behavior proves
2.1181 +%time consuming due to the difficulty of debugging hardware level synchronization behavior, and due to the difficulty of performance tuning such low level code across the full spectrum of patterns caused by applications.
2.1182 +
2.1183 +
2.1184 +
2.1185 +
2.1186 +
2.1187 +\section{Concrete Details}
2.1188 +Now that we have seen the concepts of how to modularize
2.1189 +a runtime system, using the tie-point model, it is
2.1190 +time to make the concepts concrete by showing code
2.1191 +segments that implement each of the concepts, and code
2.1192 +segments that use the concepts.  We will start with
2.1193 +the big picture and work down.
2.1194 +
2.1195 +The first stop will be the development process, showing
2.1196 +how it is fractured into three separate and independent
2.1197 +development activities.  Next, we will show examples
2.1198 +of how application
2.1199 +code invokes constructs, and follow the path of calls
2.1200 +down to the point it switches over to the runtime system. Lastly,
2.1201 +we will look at the flow of control inside the runtime,
2.1202 +where we will focus on the interaction between plugin
2.1203 +code and proto-runtime code.
2.1204 +
2.1205 +In this last portion, we will show how the
2.1206 +interface supplies the plugin with a consistent inside
2.1207 +the runtime" environment.  Along with that, we will
2.1208 +show how providing
2.1209 +a consistent environment
2.1210 + is an implementation of the "single hidden timeline" portion
2.1211 + of the tie-point model. We will also show how it is
2.1212 + the existence of a \textit{single} hidden timeline
2.1213 + that allows the semantic portion of the language constructs
2.1214 +to be written in a sequential style, without regard to concurrency issues.
2.1215 +
2.1216 +
2.1217 +\subsection{Three independent development efforts}
2.1218 +
2.1219 +To get a handle on the big picture,  we describe the
2.1220 +three independent paths that development takes:
2.1221 +one for development of proto-runtime code, one for
2.1222 +development of language implementation, and one for
2.1223 +application development. Each of these produces a separate
2.1224 +installable artifact.
2.1225 +The proto-runtime development produces a dynamic library, for each machine. The language development produces a dynamic library to plug into whichever proto-runtime library is installed on a given machine. It may also produce development tools that are used during compilation, distribution, and even installation and during the run. The application development produces a single source, which the language tools may then turn into multiple executables.
2.1226 +
2.1227 +The proto-runtime code is developed separately from
2.1228 +both language and application code, and packaged as a dynamic library. This library has multiple implementations. Each kind of hardware platform has a proto-runtime implemented specifically for it, and is tuned for low overhead on that hardware. The administrator of a particular machine chooses the proto-runtime implementation best suited to that hardware, and installs that.
2.1229 +
2.1230 +The language code is likewise developed separately from both proto-runtime and application code. Although multiple versions of a language may be implemented, there are significantly fewer versions than the number of proto-runtime versions. That is because most of the hardware details are abstracted away by the proto-runtime interface.
2.1231 +
2.1232 +However, the interface does expose hardware features related to placement of work onto cores, so some variations may exist for the same interface. Again, the administrator chooses which language implementation best suits their machine and installs the corresponding dynamic library.
2.1233 +
2.1234 +The wrapper library, however, is not
2.1235 +installed on the machine where code runs. Rather, it
2.1236 +is only used during development of an application,
2.1237 +and remains independent of hardware.
2.1238 +
2.1239 +Ideally the application is developed only once. It makes calls to the wrapper library, which in turn invokes the dynamic libraries of the language and proto-runtime.
2.1240 +When an application is executed, the loader binds the
2.1241 +dynamic libraries, connecting them to the application.
2.1242 + In this way, a single,
2.1243 +unchanging, executable gains access to machine-specific implementations of language and proto-runtime.
2.1244 +
2.1245 +However, the success of the compile-once approach has
2.1246 +limits in practice. Each machine's characteristics determine the size of unit of work that gives the best performance. When too small, the overhead in the runtime system that is required to create the work, manage constraints, and perform assignment becomes larger than the work
2.1247 +itself. When work-unit size is too large, then not enough units exist to keep all the cores busy. Thankfully, the range between is wide enough, for most applications, that neither limit is hit, on most machines. As machines evolve, though, this happy circumstance is likely to change, necessitating recompiling and possibly hand modifying the application code or some meta-form.
2.1248 +
2.1249 +\subsection{Walk through of activity during execution}
2.1250 +
2.1251 +At this point, we present a picture of the flow of control on each
2.1252 +of two cores, as the core is switched between application
2.1253 +code and runtime code.  It is too early to understand
2.1254 +the details, but this figure can be referred back to
2.1255 +as each portion is discussed in the coming sub-sections.
2.1256 +Each portion of the figure is labelled with the sub-section that describes that portion of activity.
2.1257 +
2.1258 +At the top is the main program, which starts the proto-runtime,
2.1259 +and creates a proto-runtime process.  Below that is
2.1260 +depicted the creation of proto-runtime virtual processors,
2.1261 +along with the animation of application code by those virtual
2.1262 +processors.
2.1263 +
2.1264 +?
2.1265 +
2.1266 +The application  passes information to a wrapper library
2.1267 +call,
2.1268 +such as the ID of the mutex to acquire. The library function packages the
2.1269 +information into a request data structure, then invokes a proto-runtime
2.1270 +primitive. That suspends the virtual processor (timeline) that is executing
2.1271 +that code.  The call to the primitive passes as arguments the request structure and a pointer
2.1272 +to the plugin function that will handle the request.
2.1273 +The handler runs inside the Master and chooses which
2.1274 +other timelines to resume as a consequence of the wrapper-library
2.1275 +call. Those timelines will then resume, returning from
2.1276 +whatever wrapper-library call caused them to suspend.  In this way, the request handle implements the behavior of a
2.1277 +synchronization construct.
2.1278 +
2.1279 +However, there is one last step between the request
2.1280 +handler marking a timeline as ready to resume
2.1281 +and it becoming re-animated. That step is where the
2.1282 +assignment half of the language plugin comes into play.
2.1283 +The request handlers stack up work that is free to
2.1284 +be executed, but it is the assigner that chooses which
2.1285 +of those to place onto an offered core.
2.1286 +
2.1287 +
2.1288 +
2.1289 +
2.1290 +
2.1291 +\begin{figure*}[ht]
2.1292 +  \centering
2.1293 +  \includegraphics[width = 7.0in, height = 4.5in]
2.1294 +  {../figures/Proto-Runtime__modules_plus_plugin_plus_code.pdf}
2.1295 +  \caption{Illustration of the physical time sequence of the timelines of multiple virtual processors executing on multiple
2.1296 +cores. The timelines run top to bottom, while calls
2.1297 +between modules and returns run horizontally. The colors of Fn names indicate whether the
2.1298 +code is part of the application (green), the proto-runtime module (blue), or the language  (red). The top two timelines are animated
2.1299 +by core 1, while the bottom 2 are animated by core
2.1300 +2. The boxes
2.1301 +represent virtual processors, each with its associated
2.1302 +timeline next to it. The timelines have no relative
2.1303 +ordering, except at tie-points established by the Request
2.1304 +Handlers.   Gaps in the timelines are caused by suspension,
2.1305 +which is effected by primitives within the proto-runtime
2.1306 +code module.}
2.1307 +  \label{fig:physTimeSeq}
2.1308 +\end{figure*}
2.1309 +
2.1310 +
2.1311 +
2.1312 +\subsection{Using language constructs}
2.1313 +In the simple form of an eDSL, the language constructs
2.1314 +take the form of function calls. The reader familiar
2.1315 +with posix threads will have used function calls to
2.1316 +perform mutex acquire commands and mutex release commands.
2.1317 +Here, we illustrate invoking language commands in the
2.1318 +same way.
2.1319 +
2.1320 +We use posix threads for our example because it is
2.1321 +a familiar language that the reader already knows well.
2.1322 +It allows us to illustrate the concepts new to proto-runtime without introducing potential confusion about what the language semantics are.
2.1323 +
2.1324 +\subsubsection{Main and startup}
2.1325 +Before using a proto-runtime based language, the proto-runtime
2.1326 +system must be started, and a proto-runtime process
2.1327 +must be created.  Fig X shows this. Notice that the
2.1328 +create process was given a pointer
2.1329 +to a function.  This function is the seed of the proto-runtime
2.1330 +based application code.  This seed must start all proto-runtime
2.1331 +based languages that will be used in the application,
2.1332 +and must create the virtual processors and tasks that
2.1333 +perform the work and may in turn create more VPs and/or tasks that perform work.
2.1334 +
2.1335 +==main, with PR\_\_start and PR\_\_create\_process ==
2.1336 +
2.1337 +\subsubsection{Seed birth function and thread birth
2.1338 +function}
2.1339 +Fig X shows our example seed function. It first starts
2.1340 +the language that will be used, which is Vthread. It
2.1341 +is an implementation of posix threads that is on top of proto-runtime.
2.1342 +Next, the seed uses Vthread commands to create two
2.1343 +threads, and then uses Vthread join to wait for both
2.1344 +threads to die.  Lastly it "dissipates", which is the
2.1345 +command that kills the virtual processor that is animating
2.1346 +the function.
2.1347 +
2.1350 +
2.1351 +Notice the signature
2.1352 +of the seed birth function. It returns void, and takes a pointer
2.1353 +to void plus a pointer to a SlaveVP struct. This is
2.1354 +the standard signature that must be used for all birth functions for
2.1355 +proto-runtime created virtual processors or tasks.
2.1356 +
2.1357 +
2.1358 +Also, notice that the standard signature includes a
2.1359 +pointer to a SlaveVP struct. This is a proto-runtime
2.1360 +defined structure, which holds the meta-information
2.1361 +about a virtual processor. The birth function is handed
2.1362 +the structure of the virtual processor that is animating
2.1363 +it.
2.1364 +
2.1365 +An illuminating aside is that the birth function for
2.1366 +a posix thread doesn't need
2.1367 +to be handed the structure representing the animating thread.
2.1368 +That is because the operating system tracks which thread
2.1369 +is assigned to which core.  Posix thread constructs work by executing
2.1370 +an instruction that suspends the code executing on
2.1371 +the core and switches
2.1372 +the core over to animating the OS kernel code. The OS kernel
2.1373 +then looks up the data structure that is assigned to
2.1374 +the core.
2.1375 +
2.1376 +That lookup is how the OS kernel gains the
2.1377 +pointer to the thread that was animating the application
2.1378 +code that called the posix construct.  But the implementation
2.1379 +of proto-runtime illustrated in this paper doesn't
2.1380 +have such a hardware based suspend instruction available,
2.1381 +and so proto-runtime-based application code must explicitly pass around the pointer to the data
2.1382 +structure of the virtual processor performing the animation.
2.1383 +
2.1384 +Fig X shows the birth function of the threads created
2.1385 +by the seed birth function.  It uses the Vthread equivalent
2.1386 +of mutex acquire and release to protect access to
2.1387 +a critical section. Notice that the signature
2.1388 +is the same as the signature of the seed birth function.
2.1389 +Also notice that the SlaveVP structure is handed to
2.1390 +each invocation of a Vthread construct.  In the next
2.1391 +several sub sections we will track how this SlaveVP structure
2.1392 +is used.
2.1393 +
2.1394 +==thread birth function.. uses Vthread acquire and
2.1395 +release to protect a counter plus print of count value==
2.1396 +
2.1397 +
2.1398 +\subsection{Language Wrapper Library}
2.1399 +
2.1400 +Looking at the implementation of the Vthread calls
2.1401 +reveals code such as in Fig X.
2.1402 +
2.1403 +==wrapper lib code for mutex acquire==
2.1404 +
2.1405 +There's nothing much to it.  It just creates a data
2.1406 +structure, fills it, then hands it to a proto-runtime
2.1407 +call.  This is a starnd form for wrapper library
2.1408 +calls. The data structure is used to carry information
2.1409 +into the proto-runtime (the proto-runtime that was
2.1410 +started by the PR\_\_start command).  The PR call is
2.1411 +the equivalent of the hardware instruction that suspends
2.1412 +application code and switches to the kernel.  For the
2.1413 +implementation of PR illustrated in this paper, this
2.1414 +call is implemented with assembly instructions.
2.1415 +
2.1416 +This wrapper library code is placed on the machine
2.1417 +used during development of the application, and is
2.1418 +compiled into the application executable.  However,
2.1419 +the proto-runtime call is a link to a dynamic library,
2.1420 +and is not part of the application executable.
2.1421 +
2.1422 +Notice that the PR\ primitive is given a pointer to
2.1423 +a function.  This is called the handler function, and
2.1424 +is part of the language plugin.  The proto-runtime
2.1425 +will actually perform the call to the handler function, but in a carefully controlled
2.1426 +way. It will provide the handler function with a carefully controlled environment
2.1427 +to use while it handles this wrapper-library call.
2.1428 +We will see in a moment how proto-runtime invokes the
2.1429 +handler function, and what such a handler function
2.1430 +looks like.
2.1431 +
2.1432 +First, here's the assembly that suspends the application code and
2.1433 +switches to the proto-runtime code, as seen in Fig X
2.1434 +
2.1435 +==assembly of suspend and switch==
2.1436 +
2.1437 +All it does is save the program counter and stack pointer
2.1438 +into the SlaveVP structure, then load in the program
2.1439 +counter and stack pointer of the proto-runtime code,
2.1440 +which was previously saved in different fields of that same SlaveVP structure.
2.1441 +
2.1442 +\subsubsection{proto-runtime code that is switched
2.1443 +to}
2.1444 +
2.1445 +The PR assembly code switches the core to executing
2.1446 +the (psuedo) code seen in Fig X.
2.1447 +
2.1448 +==animation master code, which calls plugin fns==
2.1449 +
2.1450 +All this does is invoke the handler function named
2.1451 +in the wrapper library, and hands it an environmen.
2.1452 +This is the hidden environment referred to in the tie-point
2.1453 +model.  It must be accessed in an isolated, atomic,
2.1454 +fashion.  The proto-runtime code seen here happens
2.1455 +to use a global lock for each language's environment.
2.1456 + However other implementations are possible.  In order
2.1457 + to keep overhead low, it uses the Compare And Swap
2.1458 + instruction to acquire the lock, and an exponential random
2.1459 + backoff scheme when contention for the lock arises.
2.1460 +
2.1461 + The handler function is the hidden behavior that executes
2.1462 + on the hidden timeline that is mentioned in the tie-point
2.1463 + model. The suspend primitive is what begins a special
2.1464 + beat on the lifeline of the virtual processor that
2.1465 + executed the wrapper library call. It is this handler
2.1466 + code that then establishes the causal connections
2.1467 + between such special beats, and so ties them together.
2.1468 + The causal connection is via the changes make to the
2.1469 + language environment.
2.1470 +
2.1471 + So, in summary, the proto-runtime is the hidden timeline.
2.1472 + The suspend primitive is what starts a special beat
2.1473 + and starts the behavior on the hidden timeline. The
2.1474 + lock is what isolates and sequentializes
2.1475 + the behavior on the hidden timeline.  The language
2.1476 + environment is the hidden state used to establish
2.1477 + causal connection between special beats.
2.1478 +
2.1479 +
2.1480 +
2.1481 +This is not the plugin code, this is the library that the application executable includes. It's equivalent to the pthread library. When you look at the source of the pthread library, it's just a wrapper that invokes the OS. It doesn't do anything itself. The language libraries  are the same thing, just wrappers that invoke the proto-runtime primitives. Those suspend the VP and send a message to the proto-runtime. When the message arrives, it invokes the plugin to handle the task.
2.1482 +
2.1483 +Here's how the wrapper library connects a request to the request handler: via this function pointer, right here Fig X, given to the proto-runtime "suspend and send" primitive. The pointed-to function is part of the plugin. That runs inside the proto-runtime, and is what handles the message created in the wrapper library.
2.1484 +
2.1485 +
2.1486 +If we go and look at that handler function, Fig X, we see that it has a standard prototype, so it takes a standard set of arguments. One of those, here in Fig X, is a language environment. This is the special sauce, it is the thing that is shared among all the cores. This language environment is where tasks are placed that are not yet ready to execute, and where suspended virtual processors are placed that are not yet ready to resume.
2.1487 +
2.1488 +Here, Fig X, you can see there's a hash table. The language environment contains that hash table. The tasks get parked in this hash table. Each time a task completes, it looks in the hash table, finds all tasks waiting for its completion, and updates the status of those waiting tasks. If this was the last task being waited for, the waiter is taken out of the hash table and put into the queue of ready to execute tasks.
2.1489 +
2.1490 +This is the semantics of the language. This is how the semantics of the language defines what dependencies are, and how it defines when a task's dependencies have been satisfied.  The implementation is just a data structure in the shared language environment. It is the proto-runtime that takes care of creating the tasks, creating the virtual processors, execute those, suspend them and resume them. The proto-runtime handles the mechanics of all that stuff. The language just figures out what are the constraints on making it ready.
2.1491 +
2.1492 +?
2.1493 +
2.1494 +Separately, the proto-runtime calls the Assigner function, which is also part of the plugin dynamic library. Each time a task completes or a virtual processor suspends, the wrapper library invokes a proto-runtime primitive. Among other things, that primitive informs the proto-runtime about the completion of that work, which tells the proto-runtime that hardware resources have just been freed up.
2.1495 +
2.1496 +The proto-runtime then invokes the Assigner function, passing it information about the hardware that was just freed. The assigner is implemented by the language and uses some language-specific way to choose which of the ready work-units to execute on that hardware (a work-unit is either a ready-to-execute task or a ready-to-resume virtual processors).  This is how the language is given control over placement of work onto cores.
2.1497 +
2.1498 +===================
2.1499 +
2.1500 +
2.1501 +\subsection{not sure}
2.1502 +A task is an atomic unit of work.  It runs to completion, without suspending. That characteristic allows the proto-runtime to internally treat a task differently than a virtual processor.  The fact that it never suspends means it doesn't need a stack, and needs less bookkeeping, which makes a task faster to create and faster to assign, for lower overhead.
2.1503 +
2.1504 +However, a task may optionally choose at some point to execute a language command that causes it to suspend. At the point it does that, the proto-runtime internally converts the task to a virtual processor. That allows the task to suspend and later resume, at the cost of gaining the normal virtual processor overhead. However, the virtual processor the task is converted to comes from a recycle pool and returns when the task completes.
2.1505 +
2.1506 +As an application programmer, you can create processes directly with an OS-like language built on top of the proto-runtime.  But you use a programming language to create tasks or virtual processors. For example, VSs has a way to create tasks.  VSs internally then uses a proto-runtime command to have the proto-runtime create a task for it.  Then VSs decorates the task with its own meta-data. It uses that meta-data to track when a task should be executed.
2.1507 +
2.1508 +?
2.1509 +
2.1510 +The only thing you're allowed to do outside a language is create the environment in which you start a language.
2.1511 +
2.1512 +?
2.1513 +
2.1514 +The implementation of the language behavior is the plugin. The plugin has two parts: request handlers, which handle the messages that come when a VP suspends, and an assigner, which picks where particular VP resumes onto or a task runs. With VSs, the plugin provides the behavior of "submit task".
2.1515 +The request handler plus plugin together provide the two halves of what people normally call a scheduler.
2.1516 +
2.1517 +=================
2.1518 +
2.1519 +\subsection{more on tie-points}
2.1520 +Any event visible before in one is visible in both after. The guarantee is between before in one and after in both.
2.1521 +
2.1522 +From the program point of view, that acquire statement is one instant.  That entire gap in physical time is seen as a single instant to the code.
2.1523 +
2.1524 +However, the tie point is just one instant in the timelines.  After the point, one of the timelines could perform an event that interferes with an event from before the tie-point, and no guarantees are given about what the other timeline sees.  However, if another tie-point is created between them, then they are both guaranteed to see that second, interfering event, after the second tie-point.
2.1525 +
2.1526 +Take the example of a mutex, M.  The purpose of the only-one semantics of a mutex is to isolate read and write operations done by the owning thread from those done by other threads, which own before or after it.
2.1527 +
2.1528 +The mutex behavior is illustrated in Fig X. Timeline 1 writes to variable A at point 1, then releases the M at point 2. Timeline 2 acquires M, at the tied point 2 and reads A at point 3.  For M to provide isolation, it must guarantee that the A write operation at point 1 is seen by the other timeline's read operation, at point 3.  Likewise, it has to guarantees that nothing that happens in timeline 2 after the acquire of M, at point 2, will be seen by timeline 1 before its release, also at point 2.
2.1529 +
2.1530 +That ordering guarantee is what we think of when we imagine the behavior of a mutex acquire-release pair.  All writes done by the releasing thread are seen as completed, by reads performed in the acquiring thread, and no writes in the acquiring thread are seen before the release by the releasing thread.  That is required in order to have value for the semantics of only one thread owns the mutex at any point. The purpose of only-one is to isolate read and write operations done by the owning thread from those done by the threads that own before or after it.
2.1531 +
2.1532 +
2.1533 +The behavior is implemented in terms of a data structure that lives inside the controlling entity's environment.  The controlling entity looks up the data structure for the mutex being requested.  This data structure has a field that contains the name of the thread that current owns the mutex, plus a queue of threads waiting to acquire it.  So, the controlling entity first looks at the field that holds the current owner, sees that it is occupied, and then puts the thread's name into the queue of waiting threads.
2.1534 +
2.1535 +At some point later, the waiting thread reaches the top of the queue. At the point the owning thread executes the release operation, that owning thread also suspends, the controlling entity sees that suspend and that the thread wants to perform the release behavior. It looks up the release behavior and performs it.  This behavior looks up the mutex data structure in the controlling entity's environment, removes the releasing thread from the owner field, takes the top thread off the waiters, writes its name into the current owner, then marks both those threads as ready to resume their timelines.
2.1536 +
2.1537 +The proto-runtime is the controlling entity, which looks up the behaviors and performs them.  It also manages the environment that holds the data structures used by the behaviors.
2.1538 +
2.1539 +===========
2.1540 +
2.1541 +The purpose of the M is to guarantee that what gets written to A here in this timeline is seen over here, in this other timeline.
2.1542 +
2.1543 +So, to turn this simple mechanism into a synchronization construct, you add semantics on top, which determine the end of suspend in the two timelines.  The timelines voluntarily place themselves into suspend, and it is up to the controlling entity to decide at what point to end that suspension.  It is this choice of ending suspension that ties events in one timeline to events in another.  The semantics of deciding that end of suspension is the semantics of the synchronization construct.
2.1544 +
2.1545 +For example, take mutual exclusion within Threads. One thread executes a construct that asks to acquire the mutex.  At the point of executing, that thread suspends, so that timeline ceases advancing.  At some point later, the controlling entity sees that suspend, and sees that the timeline is attempting the acquire mutex activity.  It looks up the behavior for acquire mutex, which is then performed inside that controlling entity.
2.1546 +
2.1547 +============
2.1548 +
2.1549 +
2.1550 +\subsection{More on eDSLs}
2.1551 +%======================================
2.1552 +
2.1553 +%We expand on the hypothesis that an embedded style Domain Specfic Language (eDSL) provides high programmer productivity, with a low learning curve. We also show (\S ) that when an application is written in a well designed eDSL, porting it to new hardware becomes simpler, because often only the language needs to be ported.  That is because the elements of the problem being solved that require large amounts of computation are often pulled into the language. Lastly (\S ),  we hypothesize that switching from sequential programming to using an eDSL is low disruption because the base language remains the same, along with most of the development tools and practices.
2.1554 +
2.1555 +%In \S \ref{sec:DSLHypothesis} we show that the small number of users of an eDSL means that the eDSL must be very low effort to create, and also low effort to port to new hardware.  At the same time, the eDSL must remain very high performance across hardware targets.
2.1556 +
2.1557 +%In \S we analyze where the effort of creating an eDSL is expended. It turns out that in the traditional approach, it is mainly expended in creating the runtime, and in performance tuning the major domain-specific constructs. We use this to support the case that speeding up runtime creation makes eDSLs more viable.
2.1558 +
2.1559 +%In \S we take a step back and examine what the industry-wide picture would be if the eDSL approach were adopted. A large number of eDSLs will come into existence, each with its own set of runtimes, one runtime for each hardware target.  That causes a multiplicative effect: the number of runtimes will equal the number of eDSLs times the number of hardware targets.  Unless the effort of implementing runtimes reduces, this multiplicative effect could dominate, which would retard the uptake of eDSLs.
2.1560 +
2.1561 +
2.1562 +% ==============
2.1563 +
2.1564 +%Further, in \S we show that when an application is written in a well designed eDSL, porting it to new hardware becomes simpler because often only the language needs to be ported.  That is because the elements of the problem being solved that require large amounts of computation are often pulled into the language. Lastly, in \S we hypothesize that switching from sequential programming to using an eDSL is low disruption because the base language remains the same, along with most of the development tools and practices.  Hence, we cover how the three issues currently making parallel programming unattractive are addressed by embedded-style DSLs.
2.1565 +
2.1566 +%We next show what the blocks to eDSLs are, and where the main effort in implementing an eDSL lies. Specifically, in \S \ref{sec:DSLHypothesis} we show that the small number of users of an eDSL means that the eDSL must be very low effort to create, and also low effort to port to new hardware.  At the same time, the eDSL must remain very high performance across hardware targets.
2.1567 +
2.1568 +%In \S we analyze where the effort of creating an eDSL is expended. It turns out that in the traditional approach, it is expended in creating the translator for the custom DSL syntax, in creating the runtime, and in performance tuning the major domain-specific constructs. We propose that the MetaBorg[] or Rose[] translation approaches cover creating translators for custom syntax, and that tuning constructs is inescapable, leaving the question of runtime implementation time.
2.1569 +
2.1570 +%In \S we explore the effects of runtime implementation time by taking a step back and examine what the industry-wide picture would be if the eDSL approach were adopted. A large number of eDSLs will come into existence, each with its own set of runtimes, one runtime for each hardware target.  That causes a multiplicative effect: the number of runtimes will equal the number of eDSLs times the number of hardware targets.  Unless the effort of implementing runtimes reduces, this multiplicative effect could dominate, which would retard the uptake of eDSLs. Thus, showing that an approach that mitigates this multiplicative effect is valuable, and is the role that the proto-runtime plays.
2.1571 +
2.1572 +
2.1573 +
2.1574 +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2.1575 +\subsection{Details}
2.1576 +\label{subsec:Details}
2.1577 +
2.1578 +  what responsibilities are encapsulated in which modules, and what the interfaces between them look like.
2.1579 +
2.1580 +modularization and its interface is what makes the proto-runtime reusable by all languages on given hardware, and the low-level tuning of the proto-runtime for specific hardware automatically benefits all the languages  on that hardware.
2.1581 +
2.1582 +?
2.1583 +
2.1584 +
2.1585 +
2.1586 + overhead measurements
2.1587 +
2.1588 +implementation time measurements
2.1589 +
2.1590 + discuss why equivalent user-level M to N thread packages haven't been pursued, leaving no viable user-level libraries to compare against.
2.1591 +
2.1592 + give numbers that indicate that the proto-runtime approach is also competitive with Cilk, and OMPSs, on large multi-core servers.
2.1593 +
2.1594 + summary of development time  of the various embedded languages created so far.  Unfortunately, no control is available to compare against, but we provide estimates based on anecdotal evidence of the time taken to develop the versions compared against for overhead. In the
2.1595 +least, the same effort would have to be expended on
2.1596 +each and every language that we expended on performance
2.1597 +tuning our proto-runtime.
2.1598 +
2.1599 +  We continue  with a bigger picture discussion of the difference in design methods between traditional approaches and the proto-runtime implementations (\S ). We discuss OpenMP versus the equivalent proto-runtime version called VOMP (\S ).  Then (\S ) we discuss Cilk 5.4 vs the proto-runtime VCilk. Next we discuss pthread vs Vthread (\S ), and  OMPSs vs VSs (\S ).  These discussions attempt to give the two design philosophies and paint a picture of the development process in the two competing approaches.  The goal is to
2.1600 +
2.1601 +illustrate how the proto-runtime approach maintains many of the features, through its centralized services, while significantly reducing implementation time, through reuse of the services, elimination of concurrency concerns in design and debugging, and in the simplifications in design and implementation caused by the clean modularization of the proto-runtime approach, and the regularization of implementation from one language to another.
2.1602 +
2.1603 +Then, with the full understanding of the proto-runtime approach in hand, we discuss  how it compares to related work (\S ).
2.1604 +
2.1605 +Finally, we highlight the main conclusions drawn from the work (\S ).
2.1606 +
2.1607 +
2.1608 +
2.1609 +?
2.1610 +
2.1611 +
2.1612 +
2.1613 +
2.1614 + The behavior module creates work and determines when work is free the execute, it  tracks  constraints on work imposed by language semantics, and constraints
2.1615 +due to data dependencies.
2.1616 +
2.1617 + a copy of the proto-runtime with language modules runs separately on each core and they communicate via shared variables in a shared language environment.  The proto-runtime protects access to the shared language environment so that language modules can be written in sequential style.
2.1618 +
2.1619 +?
2.1620 +
2.1621 +The proto-runtime also implements "centralized" services that it makes available to all languages.  Hardware specific functions include communicating between processors and protecting the internal state used by the language modules.
2.1622 +
2.1623 +
2.1624 +
2.1625 +   this makes the proto-runtime be reused by all languages on given hardware, and the low-level tuning of the proto-runtime for specific hardware automatically benefits all the languages that run on that hardware.
2.1626 +
2.1627 + implementing language logic,
2.1628 +
2.1629 +show how the proto-runtime interface allows it to use sequential thinking.
2.1630 +
2.1631 +give similar detail on the implementation of the assigner,
2.1632 +we discuss how that has the potential to improve application performance by reducing communication between cores and reducing idle time of cores.
2.1633 +
2.1634 +support  belief that the patterns we followed when modularizing are indeed fundamental and will remain valid for future languages and hardware.
2.1635 +
2.1636 + discuss some of the centralized services provided by the current proto-runtime implementation, as well as planned future ones.
2.1637 +
2.1638 +reusing language logic from one language implementation to another.
2.1639 +
2.1640 +
2.1641 +%%%%%%%%%%%%%%%%%%%%%%%%
2.1642 +%%
2.1643 +%%%%%%%%%%%%%%%%%%%%%%%%
2.1644 +\section{Measurements}
2.1645 +With the background on eDSLs and description of the proto-runtime approach behind us, we then provide overhead measurements in \S\ref{subsec:OverheadMeas} and implementation time measurements in \S\ref{subsec:ImplTimeMeas}
2.1646 +
2.1648 +For the following, we use a 4-core single socket 2.4Ghz laptop, and a 4 socket by 10 core each server.
2.1649 +
2.1650 +For runtime performance:
2.1651 +
2.1652 +-- Vthread vs pthread: laptop and server on exe vs task (and fibonacci?)
2.1653 +
2.1654 +-- VCilk vs Cilk: laptop and server on fibonacci (from Albert)
2.1655 +
2.1656 +-- VOMP vs OpenMP: laptop and server on exe vs task and fibonacci
2.1657 +
2.1658 +-- VSs vs OMPSs: laptop and server on fibonacci and jpeg
2.1659 +
2.1660 +\begin{tabular}{|c|c|c|c|c|c|c|}\hline
2.1661 +a & 2 & a & a & a & a & a \\\hline
2.1662 +a & 2 & a & a & a & a & a \\\hline
2.1663 +a & a & a & a & a & a & a \\\hline
2.1664 +a & a & a & a & a & a & a \\\hline
2.1665 +\end{tabular}
2.1666 +\caption{}
2.1667 +\label{tab}
2.1668 +
2.1669 +As seen, we didn't include application performance because we have not yet taken advantage of the opportunity to use language information to predict locality.  That research is in progress and will be reported in future papers.
2.1670 +
2.1671 +
2.1672 +\subsubsection{Vthread Versus Highly Tuned Posix Threads}
2.1674 +Measurements indicate that the proto-runtime approach has far lower overhead than even the current highly tuned Linux thread implementation, and discusses why equivalent user-level M to N thread packages haven't been pursued, leaving no viable user-level libraries to compare against.
2.1675 +\subsubsection{VCilk Versus Cilk 5.4}
2.1676 +In \S we give numbers that indicate that the proto-runtime approach is also competitive with Cilk
2.1677 +\subsubsection{VSs Versus StarSs (OMPSs)}
2.1678 +OMPSs
2.1679 +\subsubsection{VOMP Versus OpenMP}
2.1680 +VOMP
2.1681 +
2.1682 +
2.1683 +%%%%%%%%%%%%%%%%%%%%%%%%
2.1684 +%%
2.1685 +%%%%%%%%%%%%%%%%%%%%%%%%
2.1686 +\subsection{Development Time Measurements}\label{subsec:ImplTimeMeas}
2.1687 +Here we summarize the time to develop each of the epDSLs and each copy-cat language created so far. As a control, we estimate, based on anecdotal evidence, the time required to create the equivalent functionality, using the traditional approach.
2.1688 +
2.1689 +Table \ref{tabPersonHoursLang}, summarizes measurements
2.1690 +of the time we spent to design, code, and debug an initial version working for each of the languages we created.  The results are shown in the same order we created them, with SSR the first. As we gained experience,  design and coding became more efficient.   These are hours spent at the keyboard or with pen and paper, and don't include think time during other activities in the day.
2.1691 +
2.1692 +
2.1693 +\begin{centering}
2.1694 +\begin{tabular}{|l|r|r|r|r|r|r|r|}
2.1695 +  \cline{2-8}
2.1696 +  \multicolumn{1}{r|}{} & SSR & Vthread & VCilk & HWSim & VOMP & VSs & Reo\\
2.1697 +  \cline{2-8}
2.1698 +  \noalign{\vskip2pt}
2.1699 +  \hline
2.1700 +  Design & 19 & 6 & 3 & 52 & 18& 6 & 14\\
2.1701 +  Code & 13 & 3 & 3& 32 & 9& 12 & 18\\
2.1702 +  Test & 7 & 2 & 2& 12 & 8& 5 & 10\\
2.1703 +  L.O.C. & 470 & 290 & 310& 3000 & 690 & 780 & 920\\
2.1704 +  \hline
2.1705 +\end{tabular}
2.1706 +\caption
2.1707 +{Hours to design, code, and test each embedded language. L.O.C. is lines of (original) C code, excluding libraries and comments.
2.1708 +}
2.1709 +\end{centering}
2.1710 +\label{tabPersonHoursLang}
2.1711 +
2.1712 +%\subsubsection{Comparison of Design Approaches}
2.1713 +%We give the bigger picture of the difference in  approach for each language, between the proto-runtime implementation and the distributed implementation.  The goal is to illustrate how the proto-runtime  centralized services, while significantly reducing implementation time, through reuse of the services, elimination of concurrency concerns in design and debugging, and in the simplifications in design and implementation caused by the clean modularization of the proto-runtime approach, and the regularization of implementation from one language to another.
2.1714 +
2.1715 +
2.1716 +%%%%%%%%%%%%%%%%%%%%%%%%
2.1717 +%%
2.1718 +%%%%%%%%%%%%%%%%%%%%%%%%
2.1719 +\section{Related Work} \label{sec:Related}
2.1720 +
2.1721 +We discuss  how proto-runtime compares to other approaches to implementing the runtimes of domain specific languages.  The criteria for comparison are: level of effort to implement the runtime, effort to port the runtime, runtime performance, and support for application performance. The main alternative implementation approaches are: posix threads, user-level threads, TBB, modifying libGomp, and using hardware primitives to make a custom runtime.
2.1722 +
2.1723 +We  summarize the conclusions in Table \ref{tab:CriteriaVsApproach}.
2.1724 +
2.1725 +
2.1726 +\begin{center}
2.1727 +\caption{Table \ref{tab:CriteriaVsApproach} shows how well each approach scores in the measures important to implementors of  runtimes for  DSLs. On the left are the implementation approaches. At the top are the measures. In a cell is the score on the measure for
2.1728 +the approach. One plus is the lowest score, indicating the implementation approach is undesirable, 5 indicates the highest desirability.  The reasons for the scores are discussed in the text. } \label{tab:CriteriaVsApproach}
2.1729 +
2.1730 +\begin{tabular}{|c|c|c|c|c|}\hline
2.1731 +Runtime Creation  & \textbf{impl.}& \textbf{porting} & \textbf{runtime} & \textbf{application} \\
2.1732 +\textbf{} & \textbf{ease} & \textbf{ease} & \textbf{perf.} & \textbf{perf.}\\\hline
2.1733 +\textbf{OS Threads} & ++ & ++ & + & + \\\hline
2.1734 +%\textbf{User Threads} & ++& ++ & ++ & + \\\hline
2.1735 +\textbf{TBB} & ++ & ++ & ++ & + \\\hline
2.1736 +\textbf{libGomp} & +++ & ++ & +++ & ++++ \\\hline
2.1737 +\textbf{HW primitives} & + & + & +++++ & +++++ \\\hline
2.1738 +\textbf{Proto-runtime} & +++++ & +++++ & ++++ & +++++\\\hline
2.1739 +\end{tabular}
2.1740 +\end{center}
2.1741 +
2.1742 +
2.1743 +
2.1744 +The first two methods have poor runtime and application
2.1745 +performance. They involve building the DSL runtime on top of OS threads\ or TBB, both of which have runtimes in their own right. So the DSL runtime runs on top of the lower-level runtime.  This places control of work placement inside the lower-level runtime, blocking the DSL runtime, which hurts application-code performance, due to inability to use data locality. In addition, OS threads have operating system overhead and OS-imposed fairness requirements, which keeps runtime performance poor as seen in Section \ref{sec:VthreadVsPthread}.
2.1746 +
2.1747 +Both also force the DSL implementation to manage concurrency explicitly, using lower-level runtime constructs such as locks.  TBB may have a slight advantage due to its task-scheduling commands, but only for task-based languages. Hence, implementation effort is poor for these approaches.
2.1748 +
2.1749 +For the same reason, porting is poor for these two
2.1750 +approaches. The DSL's runtime code needs to be rewritten and tuned for each hardware platform, or else some form of hardware-abstraction placed into the runtime.  But putting in a hardware abstraction is essentially an alternative way of implementing half of the proto-runtime approach, but without the centralization, reuse, and modularization benefits.
2.1751 +
2.1752 +Moving on to libGomp. Some  language researchers use libGomp (based on informal discussions) because of its very simple structure, which makes it relatively easy to modify, especially for simple languages. However, it provides no services such as debugging or performance tuning, and it has no modularization or reuse across languages benefits.  As the price of the simplicity, performance suffers, as seen in the experiments [].  Also, re-writes of the DSL runtime are required for each platform in order to tune it to hardware characteristics. However, because the runtime is directly modified, the language gains control over placement of work, enabling good application performance, if the extra
2.1753 +effort is expended to take advantage.
2.1754 +
2.1755 +Lastly, we consider the alternative of writing a custom runtime from scratch, using hardware primitives such as the Compare And Swap (CAS) instruction, or similar atomic read-modify-write instructions.  This approach requires the highest degree of implementation effort, and the worst portability across hardware.  However, if sufficient effort is expended on tuning, it can achieve the best runtime performance and equal the best performance of application code. So far, the gap has proven small between highly tuned language-specific custom runtime performance and that of our proto-runtime, but we only have the CILK implementation as a comparison point.
2.1756 +
2.1757 +Putting this all together, Table \ref{tab:CriteriaVsApproach} shows that the proto-runtime approach is the only one that scores high in all of the measures. It makes initial language implementation fast, as well as reduces porting effort, while keeping runtime performance high and enabling high application performance.
2.1758 +
2.1759 +
2.1760 +
2.1761 +%%%%%%%%%%%%%%%%%%%%%%%%
2.1762 +%%
2.1763 +%%%%%%%%%%%%%%%%%%%%%%%%
2.1764 +\section{Conclusions and Future Work}
2.1765 +The  main takeaways from the paper are first, the potential for embedded style Domain Specific Languages (eDSLs) to address the issues that are holding-back parallel programming, and second   the role that the proto-runtime approach can play in making eDSLs practical, by simplifying the runtime aspect of implementing a large number of eDSLs across the many hardware targets.
2.1766 +%The proto-runtime approach does this by modularizing the runtimes, providing reuse of centralized services, and reuse of the hardware-specific performance tuning, which is performed once per hardware, on the proto-runtime, then enjoyed by all the eDSLs.  Hence, the proto-runtime approach provides a significant piece of the puzzle of providing eDSLs, to bring parallel programming into the mainstream.
2.1767 +
2.1768 +
2.1769 +%[[Hypothesis: Embedded-style DSLs -> high productivity + low learning curve + low disruption + low app-port AND quick time to create + low effort to lang-port + high perf across targets]]
2.1770 +
2.1771 +
2.1772 +Specifically, we have shown how the approach modularizes runtime code, in a way that appears applicable to any language or execution model. It isolates the hardware-specific portion  from language behavior as well as from the language-driven placement of work onto resources, providing interfaces between them.
2.1773 +
2.1774 +  The modularization reduces the effort of implementing a new language, especially for an embedded-style one where runtime creation is a significant portion of total  effort.  It causes the low level hardware portion to be reused by each language. And, the behavior implementation is simplified, by handling shared state inside the proto-runtime and exporting a sequential interface for  the behavior module to use. The simplification   reduces  effort, as does reuse of the hardware-specific portion, reuse of behavior code from one language to another, reuse of assignment code, and familiarity with the modular structure by implementors. Overall effort reduction was supported by measurements of implementation effort.
2.1775 +
2.1776 +The proto-runtime approach makes it practical to maintain high overall runtime performance, with low effort for the language implementor. It is practical because high effort is put into performance-tuning the hardware-specific proto-runtime, which is then reused by each language. In this way the performance derived from the high tuning effort is inherited without extra effort by the language creators, thus amortizing the cost.
2.1777 +
2.1778 +Centralized services were implemented inside the proto-runtime portion, such as debugging facilities, automated verification, concurrency handling, hardware performance information gathering, and so on. We showed how they are reused by the languages.
2.1779 +
2.1780 +Although we didn't measure it, we indicated how application performance can be increased due to giving the language direct control over placement of work, to take advantage of data affinity or application-generated communication patterns. This ability is due to the assignment module, which provides the language implementor with control over which core work is assigned to, and the order of executing each work unit.
2.1781 +
2.1782 +Work on the proto-runtime approach is in its infancy, and much remains to be done, including:
2.1783 +\begin{itemize}
2.1784 +\item  Creating related interfaces for use with distributed memory hardware, and interfaces for hierarchical runtimes, to improve performance on many-level hardware such as high-performance computers, and to tie together runtimes for different types of architecture, to cover heterogeneous architectures and machines.
2.1785 +\item Extending the proto-runtime interface to present hardware information that a work-assigner will need, but in a generic way that remains constant across many hardware configurations yet exposes all relevant information.
2.1786 +\item Exploring work assignment implementations that take advantage of language and application knowledge to improve placement of work to gain higher application performance.
2.1787 +\item Applying the proto-runtime approach to support a portability software stack, and supply OS services to applications via the proto-runtime, to further increase application-code portability.
2.1788 +\end{itemize}
2.1789 +
2.1790 +
2.1791 +\end{document}
2.1792 +=============================================
2.1793 +==
2.1794 +==
2.1795 +==
2.1796 +==
2.1797 +==
2.1798 +=============================================
2.1799 +
2.1800 +\section{The Problem}
2.1801 +
2.1802 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
2.1803 +
2.1804 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
2.1805 +
2.1806 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
2.1807 +
2.1808 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
2.1809 +
2.1810 +
2.1811 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
2.1812 +
2.1813 +While talking about the problems encountered by Domain Specific Languages (DSLs), we focus   on implications for the runtime system, due to its central role in the claims.  At the same time we will support the hypothesis that embedded-style DSLs  are high-productivity for application programmers, have a low learning curve, and cause low disruption to current programming practices.  While doing this we set the ground work for the next section, where we show that the main effort of implementing embedded-style DSLs is creating the runtime, and that when using the proto-runtime approach, embedded-style DSLs are low-effort to create and port and move the effort of porting for high performance out of the application and into the language.
2.1814 +
2.1815 +To give the needed depth, we'll first talk about a way to classify parallel languages  according to the structure of their runtime (subsection \ref{subsec:ClassifyingLangs}).  Then we'll talk about the sub-class of domain specific parallel languages, what sets them apart, and the implications for their runtime implementations (subsection \ref{subsec:DomSpecLangs}). That segues into the embedded style of language, and how the work of implementing them is mainly the work of implementing their runtime (subsection \ref{subsec:EmbeddedDSLs}).
2.1816 +
2.1817 +Once that reduction from parallel languages in general to embedded style domain specific ones in particular is done, we'll give more on what embedded style DSLs look like from an application programmer's view (subsection \ref{subsec:AppProgViewOfDSL}). We will include depth on a particular embedded-style language, showing sample code that uses the constructs, then delving into needs within the implementation of that language, and behavior of the constructs during a run (subsection []).
2.1818 +
2.1819 +The main implications for runtime systems, which were uncovered within the section, are summarized at the end (subsection []).
2.1820 +
2.1821 +\subsection{Classifying parallel languages by virtual processor based vs task based}
2.1822 +\label{subsec:ClassifyingLangs}
2.1823 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
2.1824 +
2.1825 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
2.1826 +
2.1827 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
2.1828 +
2.1829 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
2.1830 +
2.1831 +
2.1832 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
2.1833 +
2.1834 +One major axis for classifying parallel languages is whether they are virtual processor based or task based, which has implications for the structure of the runtime.
2.1835 +
2.1836 +A virtual processor is long-lived, and has a context that persists across suspend and resume, while a task has no preceding context to fit into and leaves no implied context when done.  Posix threads is a standard example of a virtual processor based parallel language, as are UPC, Charm, TBB, and so forth. All of these create virtual processors (aka threads), which suspend when they invoke synchronizations and other parallel-language constructs then resume after the construct completes.  Such virtual processors have their own private stack  to save the information that is needed upon resume.
2.1837 +
2.1838 +In contrast, dataflow is a standard example of a task based language, as is CnC. For these languages, a task is passed all the information it needs at the point of creation, and is expected to run to completion.  If a task needs to invoke a parallelism construct, that invocation normally ends the task, while information needed by following tasks is saved explicitly in shared variables, or passed to the runtime as a continuation that is then handed to the task created when the construct completes.
2.1839 +
2.1840 +Hybrids of the two also exist, such as OpenMP which implies thread creation, via the parallel-pragma, but also creates tasks via the for-pragma. As well, StarSs (OMPSs) mixes the two, with a main thread that creates meta-tasks that have to resolve their dependencies before being turned into executable tasks. Those tasks are also able to invoke barriers and other synchronization constructs, then resume.
2.1841 +
2.1842 +The runtime implementations of the two different types of execution model differ markedly.  Virtual processor (VP) based runtimes have to create a stack for each VP created, and manage the interleaving of the CPU's hardware stack.  They also require a mechanism to suspend and resume the VPs, and save them in internal structures while suspended.
2.1843 +
2.1844 +In contrast, task based runtimes need ultra-fast creation of tasks, and fast linkage from the end of one to the start of the next.  They tend to keep the task-structures in a queue and discard them when complete.
2.1845 +
2.1846 +Hence, VP based runtimes revolve around storing suspended VPs inside structures that embody the constraints on when the VP can  resume.  But task based runtimes revolve around the conditions upon which to create new tasks, and the organization of the inputs to them.  The runtimes for hybrid languages have characteristics of both.
2.1847 +
2.1848 +
2.1849 +\subsection{Domain specific parallel languages}
2.1850 +\label{subsec:DomSpecLangs}
2.1851 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
2.1852 +
2.1853 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
2.1854 +
2.1855 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
2.1856 +
2.1857 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
2.1858 +
2.1859 +
2.1860 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
2.1861 +
2.1862 +Now we'll talk about the sub-class of Domain Specific Languages (DSLs): what sets them apart from other parallel languages, how they potentially solve the issues with parallel programming, and the implications for their runtime implementations.
2.1863 +
2.1864 +DSLs can be any of the three basic language types (VP based, task-based or hybrid), but they are distinguished by having constructs that correspond to features of one narrow domain of applications.  For example, we have implemented a DSL that is just for use in building hardware simulators [cite the HWSim wiki].  Its constructs embody the structure of simulators, and make building one fast and even simpler than when using a sequential language, as will be shown in Subsection [].  The programmer doesn't think about concurrency, nor even about control flow, they simply define behavior of individual hardware elements and connect them to each other.
2.1865 +
2.1866 +It is this fit between language constructs and the mental model of the application that makes DSLs highly productive and easy to learn, at the same time, it is also what makes applications written in them more portable.  Application patterns that have strong impact on parallel performance are captured as language constructs.  The rest of the source code has less impact on parallel performance, so just porting the language is enough to get high performance on each hardware target.
2.1867 +
2.1868 +In practice, designing such a language is an art, and for some hardware targets, the language can become intrusive.  For example, for porting to GPGPUs, their performance is driven by decomposition into many small, simple, kernels, which access memory in contiguous chunks.  Fitting into this pattern forces rearrangement of the base sequential code, and even constrains choice of algorithm.  Hence, a DSL that is portable to standard architectures as well as GPUs would place the GPU restrictions onto the code for all machines.  However, much excellent work [polyhedral, others] is being done on automated tools to transform standard code to GPU form, which would lift the  restrictions.  Also, constructs such as the DKU pattern [] map well onto GPUs as well as standard hardware.
2.1869 +
2.1870 +\subsection{The embedded style of DSL}
2.1871 +\label{subsec:EmbeddedDSLs}
2.1872 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
2.1873 +
2.1874 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
2.1875 +
2.1876 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
2.1877 +
2.1878 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
2.1879 +
2.1880 +
2.1881 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
2.1882 +
2.1883 +We segue now into the embedded style of language, and show how the work of implementing them is mainly the work of implementing their runtime plus their complex domain constructs. We focus on  embedded style domain specific languages because it is the least effort-to-create form of DSL, and making DSLs practical requires it to be low effort to create them and  port them to various hardware targets.
2.1884 +
2.1885 +
2.1886 +An embedded-style language is one that uses the syntax of a base language, like C or Java, and adds constructs that are specific to the domain. An added construct may be expressed in custom syntax that is translated to into a library call, or else directly  invoked by making a library call, as illustrated in Figure \ref{fig:EmbeddedEx}. Inside the library call, a primitive is used to escape the base language and enter the  embedded language's runtime, which then performs the behavior of the construct.
2.1887 +
2.1888 +
2.1889 +\begin{figure}[h!tb]
2.1890 +{\noindent
2.1891 +{\footnotesize
2.1892 +{\normalsize Creating a new virtual processor (VP):}
2.1893 +\begin{verbatim}
2.1894 +newVP = SSR__create_VP( &top_VP_fn, paramsPtr, animatingVP );
2.1895 +\end{verbatim}
2.1896 +
2.1897 +{\noindent {\normalsize sending a message between VPs:}}
2.1898 +\begin{verbatim}
2.1899 +SSR__send_from_to( messagePtr, sendingVP, receivingVP );
2.1900 +\end{verbatim}
2.1901 +
2.1902 +{\noindent {\normalsize receiving the message (executed in a different VP):}}
2.1903 +\begin{verbatim}
2.1904 +messagePtr = SSR__receive_from_to( sendingVP, receivingVP );
2.1905 +\end{verbatim}
2.1906 +}
2.1907 +}
2.1908 +
2.1909 +\caption
2.1910 +{Examples of invoking embedded-style  constructs.
2.1911 +}
2.1912 +\label{fig:EmbeddedEx}
2.1913 +\end{figure}
2.1914 +An embedded-style language differs from a library in that it has a runtime system, and a way to switch from the behavior of the base language to the behavior inside the runtime.  In contrast, libraries never leave the base language.  Notice that this means, for example, that a posix threads library is not a library at all, but an embedded language.
2.1915 +
2.1916 +As a practical matter, embedded-style constructs normally have a thin wrapper that invokes the runtime. However, some DSLs perform significant effort inside the library before switching to the runtime, or else after returning from the runtime.  These look more like traditional libraries, but still involve an escape from the base language and more importantly are designed to work in concert with the parallel aspects of the language. They  concentrate key performance-critical aspects of the application inside the language, such as dividing work up, or, for example, implementing a solver for differential equations that accepts structures created by the divider.
2.1917 +
2.1918 +It is the appearance of constructs being library calls that brings the low-disruption benefit of embedded-style DSLs.  The syntax is that of the base language, so the existing development tools and work flows remain intact when moving to an embedded style DSL.  In addition, the fit between domain concepts and language constructs minimizes mental-model disruption when switching and makes the learning curve to adopt the DSL very low.
2.1919 +
2.1920 +\subsection{Application programmer's view of embedded-style DSLs}
2.1921 +\label{subsec:AppProgViewOfDSL}
2.1922 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
2.1923 +
2.1924 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
2.1925 +
2.1926 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
2.1927 +
2.1928 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
2.1929 +
2.1930 +
2.1931 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
2.1932 +
2.1933 +Well designed DSLs have very few constructs, yet capture the most performance-critical domain patterns, in a way that feels natural to the application programmer.  This often means that data structures and usage patterns are part of the language.
2.1934 +
2.1935 +For example, a linear-equation-solving language would define a standard data structure for the coefficients of the equations, and supply a construct by which the language is asked to perform the work of solving them. This feels very much like a library, but the runtime system dynamically performs division of work according to the hardware, and implements communication between cores and a scheduler that load balances and tries to take advantage of data affinity and even computational accelerators.  All of which puts performance in the hands of the runtime and is simple to use.
2.1936 +
2.1937 +An example of a DSL that we created using the proto-runtime approach is HWSim [], which is designed to be used for writing architectural simulators.
2.1938 +
2.1939 +When using HWSim, a simulator application is composed of just three things: netlist, behavior functions and timing functions. These are all sequential code that call HWSim constructs at boundaries, such as the end of behavior, and use HWSim supplied data structures. To use HWSim, one creates a netlist composed of elements and communication paths that connect them.  A communication path connects an outport of the sending element to an inport of the receiving element. An action is then attached to the inport. The action is triggered when a communication arrives. The action has  a behavior function, which changes the state of the element,  and a timing function which calculates how much simulated time the behavior takes.
2.1940 +
2.1941 +The language itself consists of only a few standard data structures, such as \texttt{Netlist}, \texttt{Inport}, \texttt{Outport},  and a small number of constructs, such as \texttt{send\_comm} and \texttt{end\_behavior}.  The advancement of simulated time is performed by a triggered action, and so is implied. The parallelism is also implied, by the only constraints on  order of execution of actions being  consistency.
2.1942 +
2.1943 +The only parallelism-related restriction is that a behavior function may only use data local to the element it is attached to.   Parallel work is created within the system by outports that connect to multiple destination inports which means one output triggers multiple actions, and by behavior functions that generate multiple output communications each.
2.1944 +
2.1945 +Overall, simulator writers have fewer issues to deal with because time-related code has been brought inside the language, where it is reused across simulators, and because parallelism issues reduce to simply being restricted to data local to the attached element.  Both these increase productivity of simulator writers, despite using a parallel language.  The language has so few commands that it takes only a matter of days to become proficient (as demonstrated informally by new users of HWSim).  Also, parallelism related constructs in the language are generic across hardware, eliminating the need to modify application code when porting to new hardware (if the language is used according to the recommended coding style).
2.1946 +
2.1947 +\subsection{Implementation of Embedded-style DSLs}
2.1948 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
2.1949 +
2.1950 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
2.1951 +
2.1952 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
2.1953 +
2.1954 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
2.1955 +
2.1956 +
2.1957 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
2.1958 +
2.1959 +When it comes to implementing an embedded-style of DSL, the bulk of the effort is in the runtime and the more complex domain specific constructs.
2.1960 +
2.1961 +Examples of constructs implemented for  DSLs include Abstract Data Types (ADTs), like linked lists, hash tables, and priority queues. Also, full algorithms, like solvers for systems of equations, or even linear algebra operations on matrices. It will be seen in subsection[] that the proto-runtime approach causes the implementation for such constructs to be reused, with high performance, across all the hardware targets in a hardware class such as the class of shared-memory multi-core platforms.
2.1962 +
2.1963 +In addition, embedded style DSLs rely heavily on data types that are part of the language.  These are often domain-specific such as \texttt{Netlist}, \texttt{Inport}, and \texttt{Outport} in HWSim, or \texttt{Protein} in a bio-informatics DSL, but can also be common such as \texttt{SparseMatrix} in domains like data mining and scientific applications.
2.1964 +
2.1965 +
2.1966 + During language design,  common patterns that consume significant development time or computation are placed into the language. Also, any patterns that expose hardware configuration, such as the number and size of pieces of work should be pulled into the language to aid portability.
2.1967 +
2.1968 +If such design is successful then porting the application reduces to just porting the language. When the language has successfully captured the main computational patterns of the domain, then the application code encapsulates only a small portion of the performance, so it does not need to be tuned. Further, when patterns that expose hardware-motivated choices or hardware-specific commands are in the language, then the application code has nothing that needs to change when the hardware changes.
2.1969 +
2.1970 +For example, HWSim pulls hardware-specific patterns inside the language by handling all inter-core communications inside the language, and also by aggregating multiple elements together on the same core to tune work-unit size.
2.1971 +
2.1972 +The advantage of placing these into the language, instead of application code, is portability and productivity.
2.1973 +
2.1974 +
2.1975 +\subsection{Implementation Details of Embedded-style DSLs}
2.1976 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
2.1977 +
2.1978 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
2.1979 +
2.1980 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
2.1981 +
2.1982 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
2.1983 +
2.1984 +
2.1985 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
2.1986 +
2.1987 +?
2.1988 +
2.1989 +Figure [] shows\ the implementation of the wrapper library for HWSim's send\_and\_idle construct, which sends a communication on the specified outport, and then causes the sending element to go idle. Of note is the packaging of information for the runtime. It is placing  into the HWSimSemReq data structure, and then the application work is ended by switching to the runtime. The switch is via the send\_and\_suspend call, which is a primitive implemented in assembly that jumps out of the base C language and into the runtime.
2.1990 +
2.1991 +The switch to the runtime can be done in multiple ways.  Our proto-runtime uses assembly to manipulate the stack and registers. For posix threads language, when implemented in Linux, the hardware trap instruction is used to switch from application to the OS. The OS serves as the runtime that implements the thread behavior.
2.1992 +
2.1993 +The core is  used by the construct implementation differently for   VP based languages vs  for task based languages.
2.1994 +
2.1995 +For VP based languages, once inside the runtime,  a synchronization construct performs the behavior shown abstractly in Figure []. In essence, a synchronization construct is a variable length delay, which waits for activities outside the calling code to cause specific conditions to become true.  These activities could be actions taken by other pieces of application code, such as releasing a lock, or they could be hardware related, such as waiting for a DMA transfer to complete.
2.1996 +
2.1997 +While one piece of application code (in a VP) is suspended, waiting, other pieces can use the core to perform their work, as long as the conditions for those other pieces are satisfied. Hence, the runtime's construct implementation checks if conditions are met, and if not stores the suspended piece (VP). If the construct can change conditions for others, it updates them. For example, the lock-release construct updates state for VPs waiting for the lock.  Separately, for VPs whose conditions have been met, when a core becomes available, the runtime chooses which VP to assign to which core.
2.1998 +
2.1999 +These are the two behaviors a construct performs inside the runtime: managing conditions on which work is free, and managing assignment of free work onto cores.
2.2000 +
2.2002 +
2.2003 +
2.2004 +One survey[] discusses DSLs for a variety of domains, and this list of DSLs was copied from their paper:
2.2005 +\begin{itemize}
2.2006 +\item In Software Engineering: Financial products [12, 22, 24], behavior control and coordination [9, 10], software architectures [54], and databases [39].
2.2007 +\item Systems Software:  Description and analysis of abstract syntax trees [77, 19, 51], video device driver specifications [76], cache coherence protocols [15], data structures in C [72], and operating system specialization [63].
2.2008 +\item Multi-Media: Web computing [14, 35, 4, 33], image manipulation [73], 3D animation [29], and drawing [44].
2.2009 +\item Telecommunications: String and tree languages for model checking [48], communication protocols [6], telecommunication switches [50], and signature computing [11].
2.2010 +\item Miscellaneous: Simulation [2, 13], mobile agents [36], robot control [61], solving partial differential equations [26], and digital hardware design [41].
2.2011 +\end{itemize}
2.2012 +
2.2013 +\subsection{Summary of Section}
2.2014 + [[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
2.2015 +
2.2016 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
2.2017 +
2.2018 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
2.2019 +
2.2020 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
2.2021 +
2.2022 +
2.2023 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
2.2024 +
2.2025 +This section illustrated the promise of DSLs for solving the issues with parallel programming. The HWSim example  showed that well designed parallel DSLs can actually improve productivity, and have a low learning curve, as well as reduce the need for touching application code when moving to new target hardware.  The section showed that the effort of implementing an embedded style DSL is mainly that of implementing its runtime and complex domain constructs, and that a well-designed DSL captures most of the performance-critical aspects of an application inside the DSL constructs. Hence, porting effort reduces to just performance-tuning the language (with caveats for some hardware). This effort is, in turn, reused by all the applications that use the DSL.
2.2026 +
2.2027 +The stumbling point of DSLs is the small number of users, after all, how many people write hardware simulators? Perhaps  a few thousand people a year write or modify applications suitable for HWSim. That means the effort to implement HWSim has to be so low as to make it no more effort than writing a library, effectively a small percentage of a simulator project.
2.2028 +
2.2029 +The runtime is a major piece of the DSL implementation, so reducing the effort of implementing the runtime goes a long way to reducing the effort of implementing a new DSL.
2.2030 +
2.2031 +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2.2032 +\section{Description}
2.2033 +\label{sec:idea}
2.2034 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
2.2035 +
2.2036 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
2.2037 +
2.2038 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
2.2039 +
2.2040 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
2.2041 +
2.2042 +
2.2043 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
2.2044 +
2.2045 +?
2.2046 +
2.2047 +
2.2048 +Now that we have made the case that embedded style DSLs have potential to solve many parallel programming issues, and that a major obstacle to uptake of them is their implementation effort,   we describe the proto-runtime concept and show how it addresses this obstacle to DSLs. As shown,  embedded style DSL implementation effort and porting effort is mainly that of creating the runtime and implementing the more complex language constructs. We show here that the proto-runtime approach dramatically reduces the effort of creating a DSL runtime, through a number of features.
2.2049 +
2.2050 +
2.2051 +\begin{figure}[ht]
2.2052 +  \centering
2.2053 +  \includegraphics[width = 2in, height = 1.8in]{../figures/PR_three_pieces.pdf}
2.2054 +  \caption{Shows how the proto-runtime approach modularizes the implementation of a runtime. The three pieces are the proto-runtime implementation, an implementation of the language construct behaviors, and an implementation of the portion of a scheduler that chooses which work is assigned to which processor. }
2.2055 +  \label{fig:PR_three_pieces}
2.2056 +\end{figure}
2.2057 +
2.2058 +
2.2059 +The main feature is the proto-runtime's approach to modularizing the runtime code. As shown in Fig \ref{fig:PR_three_pieces}, it  breaks the runtime into three pieces: a cross-language piece, which is the proto-runtime implementation, a piece that implements the language's constructs  and plugs into the proto-runtime, and a piece that assigns work onto  hardware and also plugs into the proto-runtime.
2.2060 +
2.2061 +The modularization appears to remain valid across parallel languages and execution models, and we present underlying patterns that support this observation.  We analyze the basic structure of a synchronization construct, and point  out how the proto-runtime modularization is consistent with it.
2.2062 +
2.2063 +\subsection{Creating an eDSL}
2.2064 +
2.2065 +
2.2066 +\begin{figure}[ht]
2.2067 +  \centering
2.2068 +  \includegraphics[width = 2in, height = 1.8in]{../figures/eDSL_two_pieces.pdf}
2.2069 +  \caption{An embedded style DSL consists of two parts: a runtime and a wrapper library that invokes the runtime}
2.2070 +  \label{fig:eDSL_two_pieces}
2.2071 +\end{figure}
2.2072 +
2.2073 +As shown in Fix \ref{fig:eDSL_two_pieces}, to create an embedded style DSL (eDSL), do two things: create the runtime and create a wrapper-library that invokes the runtime and also implements the more complex language constructs.
2.2074 +
2.2075 +As seen in Fig X, a library call that invokes a language construct is normally a thin wrapper that only communicates to the runtime. It places information to be sent to the runtime into a carrier, then invokes the runtime via a primitive. The primitive suspends the base language execution and switches the processor over to the runtime code.
2.2076 +
2.2077 +\subsection{The Proto-Runtime Modularization}
2.2078 +
2.2079 +\subsubsection{Dispatch pattern}
2.2080 +-- standardizes runtime code
2.2081 +-- makes familiar going from one lang to another
2.2082 +-- makes reuse realistic, as demonstrated by VSs taking SSR constructs
2.2083 +
2.2084 +-- show the enums, and the switch table
2.2085 +
2.2086 +-- point out how the handler receives critical info -- the semEnv, req struct and calling slave
2.2087 +
2.2088 +\subsubsection{The Request Handler}
2.2089 +-- cover what a request handler does.. connect it to the wrapper lib, and the info loaded into a request struct.
2.2090 +
2.2091 +-- give code of a request handler.. within on-going example of implementing pthreads, or possibly HWSim, or pick a new DSL
2.2092 +
2.2093 +\subsection{Exporting a performance-oriented machine view }
2.2094 +The proto-runtime interface exports a view of the machine that shows performance-critical aspects.  Machines that share the same architectural approach have the same performance-critical aspects, and differ only in the values.
2.2095 +
2.2096 +For example, the interface models cache-coherent shared-memory architectures  as a collection of memory pools connected by networks.  The essential variations among processor-chips are the sizes of the pools, the connections between them, such as which cores share the same L2 cache, and the latency and bandwidth between them.
2.2097 +
2.2098 +Hence, a single plugin can be written that gathers this information from the proto-runtime and uses it when deciding which work to assign to which core.  Such a plugin will then be efficient across all machines that share the same basic architecture.
2.2099 +
2.2100 +This saves significant effort by allowing the same plugin to be reused for all the machines in the category.
2.2101 +
2.2102 +\subsection{Services Provided by the Proto-runtime}
2.2103 +
2.2104 +-- Put services into the low-level piece..  plugins have those available, and inherit lang independent such as debugging, perf counters..  provides effort reduction because lang doesn't have to implement these services.
2.2105 +
2.2106 +-- -- examples of iherited lang services inside current proto-runtime: debugging and perf-tuning..  verification, playback have been started (?)
2.2107 +
2.2108 +-- -- examples of plugin services: creation of base VP, the switch primitives, the dispatch pattern (which reduces effort by cleanly separating code for each construct), handling consistency model (?), handling concurrency
2.2109 +
2.2110 +\subsection{eDSLs talking to each other}
2.2111 +-- show how VSs is example of three different DSLs, and H264 code is three different languages interacting (pthreads, OpenMP, StarSs)
2.2112 +
2.2113 +-- make case that proto-runtime is what makes this practical !  Their point of interaction is the common proto-runtime innards, which provides the interaction services.. they all use the same proto-runtime, and all have common proto-runtime objects, which is how the interaction becomes possible.
2.2114 +
2.2115 +\subsection{The Proto-runtime Approach Within the Big Picture}
2.2116 +
2.2117 +-- Give background on industry-wide, how have langs times machines..
2.2118 +-- say that proto-runtime has synergistic advantages within this context. -- repeat that eDSLs talk to each other.
2.2119 +-- give subsubsection on MetaBorg for rewriting eDSL syntax into base lang syntax.
2.2120 +-- bring up the tools issue with custom syntax -- compiling is covered by metaborg re-writing..  can address debugging with eclipse.. should be possible in straight forward way that covers ALL eDSLs.. their custom syntax being stepped through in one window, and stepping through what they generate in separate window (by integrating generation step into eclipse).. even adding eclipse understanding of proto-runtime.. so tracks the sequence of scheduling units..  and shows the request handling in action in third window..
2.2121 +
2.2122 +Preview idea that many players will contribute, and will get people that specialize in creating new eDSLs (such as one of authors)..
2.2123 +-- For them, code-reuse is reality, as supported by VSs example,
2.2124 +-- and the uniformity of the pattern becomes familiar, also speeding up development, as also supported by VSs, HWSim, VOMP, and DKU examples.
2.2125 +-- for those who only create a single eDSL, the pattern becomes a lowering of the learning curve, aiding adoption
2.2126 +
2.2127 +-- Restate and summarize the points below (covered above), showing how they combine to shrink the wide-spot where all the runtimes are.
2.2128 +
2.2129 +-- The low-level part implemented on each machine, exports a view of the machine that shows performance-critical aspects
2.2130 +
2.2131 +-- Collect machines into groups, based on performance critical aspects of hardware.. provides reduction in effort because only one plugin for entire group.
2.2132 +
2.2133 +-- Put services into the low-level piece..  plugins have those available, and inherit lang independent such as debugging..  provides effort reduction because lang doesn't have to implement these services.
2.2134 +
2.2135 +
2.2136 +\section{(outline and notes)}
2.2137 +
2.2138 +-- What a plugin looks like:
2.2139 +
2.2140 +-- -- pattern of parallel constructs.. ideas of Timeline, tie-point, animation, suspension, VP states, constraints, causality, work-units, meta-units, updates in constraint states attached to the meta-units
2.2141 +
2.2142 +-- -- a sych construct is something that creates a tie between two work-units.  So, the logic of the construct simply establishes causality -- the ending of one work-unit causes the freedom to start animation of another.
2.2143 +
2.2144 +-- -- --  Examples: mutex is end of work-unit that frees lock causes freedom to start work-unit that gets the lock.  They are causally tied.  The semantics of the construct is the particular conditions existing inside the runtime (in this case ownership condition of a mutex), and what changes those conditions (in this case releasing lock removes one from owner, plus acquire-lock sets one as wanting the lock), and how freedom to be animated is affected by the changes in conditions (in this case, removal of ownership must precede gaining ownership) on what makes a work-unit free (in this case, being given ownership of the mutex),
2.2145 +
2.2146 +-- Hence, precisely, the parallelism model of the language defines constraints, which are implemented as state inside the runtime. Constructs provided do a number of things:  signal bringing a set of constraints into existence (create a mutex), signal update to the state of those constraints (release mutex, state desire to acquire), and trigger the runtime to propagate those changes, which results in additional changes to states, including marking meta-units as free to be animated.  cause creation of meta-units (explicitly as in VSs, or via creating entities that trigger creation as in dataflow, or via creating entities that consist of consecutive work-units as in pthreads).
2.2147 +
2.2148 +
2.2149 +-- Recipe for how to make the language plugin: time reduction is part due to simplifying the parallelism construct logic..
2.2150 +
2.2151 +
2.2152 +
2.2153 +
2.2154 +\subsection{The Cross-language Patterns Behind the Proto-runtime}
2.2155 +
2.2156 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
2.2157 +
2.2158 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
2.2159 +
2.2160 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
2.2161 +
2.2162 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
2.2163 +
2.2164 +
2.2165 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
2.2166 +
2.2167 +An application switches to the runtime, which does scheduling work then switches back to application code.
2.2168 +
2.2169 +
2.2170 +\subsection{Some Definitions}
2.2171 +
2.2172 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
2.2173 +
2.2174 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
2.2175 +
2.2176 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
2.2177 +
2.2178 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
2.2179 +
2.2180 +
2.2181 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
2.2182 +
2.2183 +We adopt the concepts of work-unit, virtual processor (VP), animation, and tie-point as discussed in a previous paper []. A work-unit is the trace of instructions executed between two successive switches to the runtime, along with the data consumed and produced during that trace.  A Virtual Processor is defined as being able to animate either the code of a work-unit or else another VP, and has state that it uses during animation, organized as a stack.  Animation is definedd as causing time of a virtual processor to advance, which is equivalent to causing state changes according to instructions, while suspension halts animation, and consequently causes the end of a work-unit (a more complete definition of animation can be found in the dissertation of Halle[]).  A tie-point connects the end of one work-unit to the beginning of one in a different VP, so a tie-point represents a causal relationship between two work-units, and establishes an ordering between those work-units, effectively tying the time-line of the VP animating one to the time-line of the VP animating the other work-unit.
2.2184 +
2.2185 +In addition, we introduce a definition of the word task, which is a single work-unit coupled to a virtual-processor that comes into existence to animate the work-unit and dissipates at completion of the work-unit.  By definition of work-unit, a task cannot suspend, but rather runs to completion.  If the language defines an entity that has a timeline that can be suspended by switching to the runtime, then such an entity is not a task. Pure Dataflow[] specifies tasks that fit our definition.
2.2186 +
2.2187 +\subsection{Handling Memory Consistency Models}
2.2188 +
2.2189 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
2.2190 +
2.2191 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
2.2192 +
2.2193 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
2.2194 +
2.2195 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
2.2196 +
2.2197 +
2.2198 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
2.2199 +
2.2200 +Weak memory models can cause undesired behavior when work-units on different cores communicate through shared variables.  Specifically, the receiving work-unit can see memory operations complete in a different order than the code of the sending work-unit specifies.
2.2201 +
2.2202 +For example, consider a proto-runtime implemented on shared memory hardware that has a weak consistency model, along with a language that implements a traditional mutex lock.  All memory operations performed in the VP that releases the lock should be seen as complete by the VP that next acquires the lock.
2.2203 +
2.2204 +It is up to the proto-runtime to enforce this, using hardware primitives.  It has to ensure that all memory operations performed, by a task or VP, before switching to the runtime are completed before any dependent task or VP is switched into from the runtime.  More precisely, the proto-runtime has to ensure that all memory operations performed by a work-unit are visible in program order to any tied work-units. In some cases the language plugin has to alert the proto-runtime of the causality between work-units.
2.2205 +
2.2206 +
2.2207 +The proto-runtime does not, however, protect application code that attempts to communicate between VPs or tasks directly, without using a parallelism construct to protect the communication.
2.2208 +
2.2209 +
2.2210 +
2.2211 +=======
2.2212 +
2.2213 +  I plan to explain VMS as a universal pattern that exists in all runtimes: that is, that the application switches to runtime, which does a scheduling decision and then switches back.  I'll explain it first with just master and slaves, leaving out the core\_loop.  Explain it as a normal runtime that has had two key pieces removed and replaced with interfaces.  The language supplies the missing pieces.  Then, introduce the core\_loop stuff as a performance enhancement used when lock acquisition dominates (as it does on the 4 socket 40 core machine).
2.2214 +   Next, give HWSim as an example of a real domain specific (it's working, ref manual attached), and focus on how the modularity allowed pulling constructs from other languages (singleton and atomic), and a breakdown of implementation time vs design time, and so on.  Highlight how VMS's features for productivity and encapsulation solve the practical problems for domain-specific languages.
2.2215 +   Finally, show that VMS performance is good enough, by going head-to-head with pthreads and OpenMP (doing a VMS OpenMP implementation now).  And also StarSs if I have time.  I'll run overhead-measuring on them, and also regular benchmarks.
2.2216 +
2.2217 +=================
2.2218 +
2.2219 +\subsection{The patterns}
2.2220 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
2.2221 +
2.2222 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
2.2223 +
2.2224 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
2.2225 +
2.2226 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
2.2227 +
2.2228 +
2.2229 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
2.2230 +
2.2231 +
2.2232 +Soln: modularize runtime, to reduce part have to mess with, hide part that has low-level details, reuse low-level tuning effort, and reuse lang-spec parts.
2.2233 +
2.2234 +Benefits: lang impl doesn't have to touch low-level details, inherit centralized services, can reuse code from other languages to add features.
2.2235 +
2.2236 +Performance must be high, or the labor savings don't matter.  By isolating the low-level details inside the proto-runtime, they can be intensively tuned, then all the languages inherit the effort.
2.2237 +
2.2238 +Part of what makes this so easy is the dispatch pattern.. adding a construct reduces to adding into switch and writing handler..  borrow constructs by taking the handler from the other lang.
2.2239 +
2.2240 +By isolating the low-level details inside the proto-runtime, they can be intensively tuned, then all the languages inherit the effort.  Compare that to current practices, where the runtime code is monolithic.. each language has to separately modify the runtime, understanding and dealing with the concurrency, and then on a new machine, each language has to re-tune the low-level details, worrying about the consistency model on that machine, how its particular fence and atomic instructions work, and so on.
2.2241 +We spent 2 months performance tuning the current version, but only 18 hours implementing VSs on top of it, and VSs inherited the benefit from all that effort.  So did VOMP, and SSR, and VCilk, and so on..  each time we improved the proto-runtime, all the languages improved, with no effort on the part of the language creator.
2.2242 +
2.2243 +
2.2244 +\subsubsection{Views of synchronization constructs}
2.2245 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
2.2246 +
2.2247 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
2.2248 +
2.2249 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
2.2250 +
2.2251 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
2.2252 +
2.2253 +
2.2254 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
2.2255 +
2.2256 +One view of sync constructs is that they are variable-length calls. The
2.2257 +basic hardware does this by stalling the pipeline.
2.2258 +
2.2259 +Another view is that they mark the boundary of a communication made via shared read/write.  A load or store of a single location has a precise boundary enforced by the hardware, but if a pipeline desires to load, modify, then write a single location it has to have additional hardware. It has to make the multiple primitive load/store operations appear as a single operation.
2.2260 +
2.2261 +Moving up to the application level, the same pattern exists: an operation the application wants to do may involve many loads and stores, but it wants the collection to appear as a single indivisible operation.  So the application-level equivalent of a load or store involves multiple memory locations but is to be treated as a single indivisible operation.  This requires the application-level equivalent of the hardware that made the read-modify-write into a single indivisible operation.  That equivalent is what a synchronization construct is.  The reason a sync construct takes a variable amount of time is that it  waits until all other indivisible operations that might conflict have completed.
2.2262 +
2.2263 +Another way to think of the sync construct is that it enforces sharp communication boundaries.  The multiple read and write operations are treated as a single communication with the shared-state.  If any other part of the application sees only part of the communication, it sees something inconsistent and thus wrong.  So the sync constructs ensure that communications are complete, so the parts of the application only see complete communications from other parts.
2.2264 +
2.2265 +\subsubsection{Universal Runtime Patterns}
2.2266 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
2.2267 +
2.2268 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
2.2269 +
2.2270 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
2.2271 +
2.2272 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
2.2273 +
2.2274 +
2.2275 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
2.2276 +
2.2277 +Unified pattern within parallel languages: create multiple timelines, then control relative progress of them, and control location each chunk of progress takes place.
2.2278 +
2.2279 +Another universal pattern: code runs, switches to runtime, some point later switches back to code, making application run be a collection of trace segments bounded by runtime calls.
2.2280 +The runtime tracks constraints (dependencies) among units, creates and destroys units, and assigns ready units to hardware.
2.2281 +
2.2282 +Units have a life-line, which is fundamental to parallel computation, as demonstrated in a paper by some of the authors [].
2.2283 +
2.2284 +Every unit has a meta-unit that represents it in the runtime. A  unit is defined as the trace of application code that exists between two scheduling decisions. Looking at this in more detail, every runtime has some form of internal bookkeeping state for a unit, used to track constraints on it and make decisions about when and where to execute. This exists even if that state is just a pointer to a function that sits in a queue. We call this bookkeeping state for a unit the meta-unit.
2.2285 +
2.2286 +Each  unit also has a life-line, which progresses so:  creation of the meta-unit \pointer , state updates that affect constraints on the unit \pointer,   the decision is made to animate the unit  \pointer, movement of the meta-unit plus data to physical resources that do the animation \pointer  , animation of the unit, which does the work \pointer,  communication of state-update, that unit has completed, and hardware is free \pointer ,  constraint updates within runtime, possibly causing new meta-unit creations or freeing other meta-units to be chosen for animation.  This repeats for each unit. Each step is part of the model.
2.2287 +
2.2288 +Note a few implications: first, many activities internal to the runtime are part of a unit's life-line, and take place when only the meta-unit exists, before or after the work of the actual unit; second, communication that is internal to the runtime is part of the unit life-line, such as state updates; third, creation may be implied, such as in pthreads, or triggered such as in dataflow, or be by explicit command such as in StarSs, and once created, a meta-unit may languish before the unit it represents is free to be animated.
2.2289 +
2.2290 +\subsubsection{Putting synchronization constructs together with universal runtime patterns}
2.2291 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
2.2292 +
2.2293 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
2.2294 +
2.2295 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
2.2296 +
2.2297 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
2.2298 +
2.2299 +
2.2300 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
2.2301 +
2.2302 +Putting these  together, gives us that any parallelism construct that has a synchronization behavior causes the end of a work-unit, and a switch to the runtime.  The code following the construct is a different work-unit that will begin after the constraint implied by the construct is satisfied.
2.2303 +
2.2304 +The runtime is made up of the infrastructure for the constraints and assignment, such as communicating bookkeeping state between cores, and protecting internal runtime updates of shared information. Plus, the logic of the constructs and logic of choosing an assignment of work to cores.
2.2305 +
2.2306 +For large machines, the infrastructure dominates the time to execute a parallelism construct, while for smaller machines, like single-socket, the logic of constructs and assignments has a chance to be significant.
2.2307 +
2.2308 +\begin{figure}[ht]
2.2309 +  \centering
2.2310 +  \includegraphics[width = 2in, height = 1.8in]{../figures/SCG_stylized_for_expl.pdf}
2.2311 +  \caption{Something to help understanding}
2.2312 +  \label{fig:SCG_expl}
2.2313 +\end{figure}
2.2314 +
2.2315 +
2.2316 +
2.2317 +
2.2318 +%%%%%%%%%%%%%%%%%%%%%
2.2319 +\section{The Details}
2.2320 +[[Hypothesis: Embedded-style DSLs -> high productivity + low learning curve + low disruption + low app-port AND quick time to create + low effort to lang-port + high perf across targets]]
2.2321 +[[Claims: modularize runtime, mod is fund patterns, mod sep lang logic from RT internals, mod makes internal reusable & lang inherit internal perf tune & inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
2.2322 +
2.2323 +The interfaces between lang logic and proto-runtime.
2.2324 +
2.2325 +Demonstrate: modular runtime, how reduces part have to mess with, hides part that has low-level details, reuses low-level tuning effort, and reuses lang-spec parts.
2.2326 +
2.2327 +Demonstrate Benefits: lang impl doesn't touch low-level details, inherits centralized services (debug support), reuses code from other languages to add features.
2.2328 +
2.2329 +\subsection{Reuse of Language Logic}
2.2330 +[[Hypothesis: Embedded-style DSLs -\textgreater\ high productivity + low learning curve + low app-port + low disruption]]
2.2331 +
2.2332 +[[Bridge: Few users-\textgreater\ must be quick time to create + low effort to lang-port + high perf across targets]]
2.2333 +
2.2334 +[[Bridge: effort to create =  runtime + effort port = runtime + perf on new target = runtime]]
2.2335 +
2.2336 +[[Bridge: big picture = langs * runtimes -\textgreater runtime effort critical]]
2.2337 +
2.2338 +
2.2339 +[[Claims: given big picture, runtime effort minimized -\textgreater  modularize runtime, mod works across langs bec. fund patterns, mod sep lang logic from RT internals, mod makes internal reusable + lang inherit internal perf tune +inherit centralized serv, mod makes lang logic sequential, mod makes constructs reusable one lang to next, mod causes lang assigner to own HW]]
2.2340 +
2.2341 +Demonstrate reuse of language logic:
2.2342 +All the languages have copied singleton, atomic, critical section and transaction. In VOMP, took the task code from VSS, in VSS, took the send and receive code from SSR..  for DKU, took the code almost verbatim from earlier incarnation of these ideas, and welded it into SSR, and took VSs tasks and put into SSR. Thus, circle completes.. VSs took from SSR, now SSR takes from VSs..  pieces and parts are being borrowed all over the place and welded in where they're needed.
2.2343 +
2.2344 +Part of what makes this so easy is the dispatch pattern.. adding a construct reduces to adding into switch and writing handler..  borrow constructs by taking the handler from the other lang.
2.2345 +
2.2346 +Another part is that code for the constructs is isolated from concurrency details, which are inside the proto-runtime. All the dynamic system issues, and best way to impl locks, and need for fences, and so on is isolated from the construct logic.  This isolation is also how porting effort is lowered (or in many cases eliminated), and is how runtime performance is kept high.
2.2347 +
2.2348 +?
2.2349 +
2.2350 +Performance must be high, or the labor savings don't matter.  By isolating the low-level details inside the proto-runtime, they can be intensively tuned, then all the languages inherit the effort.  Compare that to current practices, where the runtime code is monolithic.. each language has to separately modify the runtime, understanding and dealing with the concurrency, and then on a new machine, each language has to re-tune the low-level details, worrying about the consistency model on that machine, how its particular fence and atomic instructions work, and so on.
2.2351 +We spent 2 months performance tuning the current version, but only 18 hours implementing VSs on top of it, and VSs inherited the benefit from all that effort.  So did VOMP, and SSR, and VCilk, and so on..  each time we improved the proto-runtime, all the languages improved, with no effort on the part of the language creator.
2.2352 +
2.2353 +?
2.2354 +
2.2355 +In addition to runtime performance, application level performance must be high.  The runtime's performance only affects overhead, and so is only a factor for small work-unit (task) sizes.  But data affinity affects performance for all work.
2.2356 +
2.2357 +The proto-runtime approach partially addresses this by giving the language the opportunity to directly control placement of work.  This isn't possible when building on top of threads, because the scheduling is in a separate, lower-level, layer where assignment of work to core is made in isolation, blind to language constructs and
2.2358 +other application features.
2.2359 +
2.2360 +
2.2361 +
2.2362 +
2.2363 +%%%%%%%%%%%%%%%%%%%%%
2.2364 +\section{Measurements}
2.2365 +
2.2366 +\subsection{Implementation time}
2.2367 +
2.2368 +
2.2369 +\subsection{Runtime and Application Performance}
2.2370 +
2.2371 +
2.2372 +%%%%%%%%%%%%%%%%%%%%%
2.2373 +\section{Related Work}
2.2374 +
2.2375 +
2.2376 +%%%%%%%%%%%%%%%%%%%%%
2.2377 +\section{Conclusion and Future Work}
2.2378 +\label{sec:conclusion}
2.2379 +
2.2380 +
2.2381 +
2.2382 +\end{document}
2.2383 +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2.2384 +Here is an example of netlist creation:
2.2385 +
2.2386 +The circuit has two elements, each with one input port, one output port, and a single activity-type. The elements are cross-coupled, so output port of one connects to input port of the other.  The input port has the  activity-type attached as its trigger.  The activity is empty, and just sends a NULL message on the output port.  The activity's duration in simulated time and the resulting communication's flight duration in simulated time are both constants.
2.2387 +
2.2388 + Note that HWSimElem data type is generic.  An elem is specialized by declaring  inports and outports, and by connecting activity types to in-ports. Behavior is attached to an element by attaching activity types to in-ports of the element.
2.2389 +
2.2390 +First, here is the top-level function that creates and returns the netlist structure:
2.2391 +
2.2392 +To use HWSim, one creates a netlist composed of elements and communication paths connecting them.  An element has a number of in-ports and outports, and a communication path connects an outport of the source element to an inport of the destination elements. The inport has an action attached, which in turn has a behavior function and a timing function, both of which are triggered by the arrival of a communication.  The behavior function has local persistent state of the element available to use, and can generate out-going communications. The timing function calculates how much Guest (simulated) time the behavior spanned.  In addition, communication paths have an attached function that calculates time from being sent until arrival of the communication. Both the behavior and timing function are application-programmer provided.  The entire simulator application is composed of those three things: netlist, behavior functions and timing functions, and all are sequential code.
2.2393 +
2.2394 +The embedded DSL consists of standard data structures, such as netlist, inport, outport, that the application must use in the language-defined way, and a small number of language calls, such as send_comm and end_behavior.  The advancement of simulated time is implied, and the parallelism is implied.  The only parallelism-related restriction is that a behavior function may only use data local to the element it is attached to.  If state in the hardware is shared, such as registers or memory, then other elements access that state by sending communications to the element that contains the state.  Parallelism is created within the system by outports that connect to muliple destination inports, and by behavior functions that generate multiple output communications each.
2.2395 +
2.2396 +
2.2397 +\begin{small}\begin{verbatim}
2.2398 +HWSimNetlist *
2.2399 +createPingPongNetlist()
2.2400 + { HWSimNetlist       *netlist;
2.2401 +   HWSimElem         **elems;
2.2402 +   HWSimActivityType **activityTypes;
2.2403 +   HWSimCommPath     **commPaths;
2.2404 +   int32               numElems, numActivityTypes, numCommPaths;
2.2405 +\end{verbatim}\end{small}
2.2406 +
2.2407 +The first thing to do is create the netlist structure, which holds three things: element structs, activity type structs, and communication path structs. It also has two collections of pointers to the traces collected during the run, but these are handled internally by HWSim.
2.2408 +\begin{small}\begin{verbatim}
2.2409 +   netlist = malloc( sizeof(HWSimNetlist) );
2.2410 +
2.2411 +   numElems         = 2;
2.2412 +   elems            = malloc( numElems * sizeof(HWSimElem *) );
2.2413 +
2.2414 +   numCommPaths     = 2;
2.2415 +   commPaths        = malloc( numCommPaths * sizeof(HWSimCommPath *) );
2.2416 +
2.2417 +   numActivityTypes = 1;
2.2418 +   activityTypes    = malloc( numActivityTypes * sizeof(HWSimActivityType *) );
2.2419 +
2.2420 +   netlist->numElems         = numElems;
2.2421 +   netlist->elems            = elems;
2.2422 +   netlist->numCommPaths     = numCommPaths;
2.2423 +   netlist->commPaths        = commPaths;
2.2424 +   netlist->numActivityTypes = numActivityTypes;
2.2425 +   netlist->activityTypes    = activityTypes;
2.2426 +\end{verbatim}\end{small}
2.2427 +
2.2428 +Now, create the activity types.  During the run, an activity instance is created each time a communication arrives on an in-port. The activity instance is a data structure that points to the activity type.  The activity type holds the pointers to the behavior and timing functions.
2.2429 +\begin{small}\begin{verbatim}
2.2430 +      //have to create activity types before create elements
2.2431 +      //PING_PONG_ACTIVITY is just a #define for readability
2.2432 +   netlist->activityTypes[PING_PONG_ACTIVITY] = createPingPongActivityType();
2.2433 +\end{verbatim}\end{small}
2.2434 +
2.2435 +Next, create the elements, and pass the netlist structure to the creator. It will take pointers to activity types out of the netlist and place them into the in-ports of the elements.
2.2436 +\begin{small}\begin{verbatim}
2.2437 +   elems[0] = createAPingPongElem( netlist ); //use activity types from netlist
2.2438 +   elems[1] = createAPingPongElem( netlist );
2.2439 +\end{verbatim}\end{small}
2.2440 +
2.2441 +Now, the reset in-port of one of the elements has to be set up to trigger an activity. Every element has a reset in-port, but normally they are set to NULL activity type. Here, we want only one of the two elements to have an activity triggered when the reset signal is sent to start the simulation.
2.2442 +
2.2443 +Note that during initialization, all the elements become active, each with its own timeline, but unless an activity is triggered in them they remain idle, with their timeline suspended and not making progress. Only ones that have an activity type attached to their reset in-port will begin to do something in simulated time when simulation starts.
2.2444 +\begin{small}\begin{verbatim}
2.2445 +      //make reset trigger an action on one of the elements
2.2446 +   elems[1]->inPorts[-1].triggeredActivityType =
2.2447 +              netlist->activityTypes[PING_PONG_ACTIVITY];
2.2448 +\end{verbatim}\end{small}
2.2449 +
2.2450 +Now, connect the elements together by creating commPath structures. A comm path connects the out-port of one element to the in-port of another. A given port may have many comm paths attached. However, an in-port has only one kind of activity type attached, and all incoming communications fire that same activity. There are multiple kinds of activity, including kinds that have no timing, and so can act as a dispatcher. These end themselves with a continuation activity, which is chosen according to the code in the behavior function. So, a commPath only connects an out port to an in port.
2.2451 +
2.2452 +This code sets fixed timing on the comm paths. It also uses a macro for setting the connections. The format is: sending elem-index, out-port, dest elem-index, in-port:
2.2453 +\begin{small}\begin{verbatim}
2.2454 +      //elem 0, out-port 0 to elem 1, in-port 0
2.2455 +   commPaths[0]= malloc(sizeof(HWSimCommPath));
2.2456 +   setCommPathValuesTo(commPaths[0],0,0,1,0);
2.2457 +   commPaths[0]->hasFixedTiming  = TRUE;
2.2458 +   commPaths[0]->fixedFlightTime = 10; //all time is stated in (integer) units
2.2459 +
2.2460 +      //elem 1, out-port 0 to elem 0, in-port 0
2.2461 +   commPaths[1]= malloc(sizeof(HWSimCommPath));
2.2462 +   setCommPathValuesTo(commPaths[1], 1,0,0,0);
2.2463 +   commPaths[1]->hasFixedTiming  = TRUE;
2.2464 +   commPaths[1]->fixedFlightTime = 10; //all time is stated in (integer) units
2.2465 +\end{verbatim}\end{small}
2.2466 +
2.2467 +done building netlist, return it
2.2468 +\begin{small}\begin{verbatim}
2.2469 +   return netlist;
2.2470 + }
2.2471 +\end{verbatim}\end{small}
2.2472 +
2.2473 +The macro that sets the connections inside a comm path struct
2.2474 +\begin{small}\begin{verbatim}
2.2475 +#define setCommPathValuesTo( commPath, fromElIdx, outPort, toElIdx, inPort)\
2.2476 +do{\
2.2477 +   commPath->idxOfFromElem     = fromElIdx; \
2.2478 +   commPath->idxOfFromOutPort  = outPort; \
2.2479 +   commPath->idxOfToElem       = toElIdx; \
2.2480 +   commPath->idxOfToInPort     = inPort; \
2.2481 + }while(0); //macro magic for namespace
2.2482 +\end{verbatim}\end{small}
2.2483 +
2.2484 +Creating an element involves creating arrays for the in-ports and out-ports, then configuring the in-ports.  The out-ports are automatically filled in during simulation start-up, by HWSim. The most interesting feature is that each in-port is assigned an activity type, which all arriving communications trigger. During the simulation, each incoming communication creates an activity instance, which points to this triggered activity type. The behavior and timing of the instance are calculated by the behavior and timing functions in the activity type. Notice that the activity type pointers are taken from the netlist, so they have to be created before creating the elements.
2.2485 +\begin{small}\begin{verbatim}
2.2486 +HWSimElem *
2.2487 +createAPingPongElem( HWSimNetlist *netlist )
2.2488 + { HWSimElem *elem;
2.2489 +   elem = malloc( sizeof(HWSimElem) );
2.2490 +   elem->numInPorts  = 1;
2.2491 +   elem->numOutPorts = 1;
2.2492 +   elem->inPorts = HWSim_ext__make_inPortsArray( elem->numInPorts );
2.2493 +   elem->inPorts[-1].triggeredActivityType = IDLE_SPAN; //reset port
2.2494 +   elem->inPorts[0].triggeredActivityType  = netlist->activityTypes[PING_PONG_ACTIVITY];
2.2495 +        return elem;
2.2496 + }
2.2497 +\end{verbatim}\end{small}
2.2498 +
2.2499 +Creating an activity type involves setting the pointers to the behavior and timing functions, which are defined inside a separate directory where all the behavior and timing functions are defined. An activity may have behavior set to NULL, or timing set to NULL, and may have fixed timing.  The structure has flags to state the combination.
2.2500 +\begin{small}\begin{verbatim}
2.2501 +HWSimActivityType *
2.2502 +createPingPongActivityType( )
2.2503 + { HWSimActivityType *pingPongActivityType;
2.2504 +   pingPongActivityType = malloc( sizeof(HWSimActivityType) );
2.2505 +
2.2506 +   pingPongActivityType->hasBehavior   = TRUE;
2.2507 +   pingPongActivityType->hasTiming     = TRUE;
2.2508 +   pingPongActivityType->timingIsFixed = TRUE;
2.2509 +   pingPongActivityType->fixedTime     = 10;
2.2510 +   pingPongActivityType->behaviorFn    = &pingPongElem_PingActivity_behavior;
2.2511 +   return pingPongActivityType;
2.2512 + }
2.2513 +\end{verbatim} \end{small}
2.2514 +
2.2515 +
2.2516 +=========
2.2517 +
2.2518 +All behavior functions take a ptr to the activity instance they are executing the behavior of. The instance contains a pointer to the elem, and most behaviors will use the element's elemState field. It holds all the persistent state of the element, which remains between activities.
2.2519 +
2.2520 +Here is the behavior function from the ping-pong example:
2.2521 +\begin{small}\begin{verbatim}
2.2522 +void
2.2523 +pingPongElem_PingActivity_behavior( HWSimActivityInst *activityInst )
2.2524 + {    //NO_MSG is #define'd to NULL, and PORT0 to 0
2.2525 +   HWSim__send_comm_on_port_and_idle( NO_MSG, PORT0, activityInst );
2.2526 + }
2.2527 +\end{verbatim}\end{small}
2.2528 +
2.2529 +There are four ways a behavior can end:
2.2530 +\begin{description}
2.2531 +\item end, no continuation:
2.2532 +\begin{small}\begin{verbatim} HWSim__end_activity_then_idle( HWSimActivityInst *endingActivityInstance )\end{verbatim}\end{small}
2.2533 +\item end, with continuation:
2.2534 +\begin{small}\begin{verbatim} HWSim__end_activity_then_cont( HWSimActivityInst *endingActivityInstance,
2.2535 +                                HWSimActivityType *continuationActivityType)\end{verbatim}\end{small}
2.2536 +\item end by sending a communication, with no continuation:
2.2537 +\begin{small}\begin{verbatim} HWSim__send_comm_on_port_then_idle( void *msg, int32 outPort,
2.2538 +                                HWSimActivityInst *endingActivityInstance)\end{verbatim}\end{small}
2.2539 +\item end by sending a communication, with continuation:
2.2540 +\begin{small}\begin{verbatim} HWSim__send_comm_on_port_then_cont( void *msg, int32 outPort,
2.2541 +                                HWSimActivityInst *endingActivityInstance
2.2542 +                                HWSimActivityType *continuationActivityType)\end{verbatim}\end{small}
2.2543 +
2.2544 +
2.2545 +=============
2.2546 +
2.2547 +
2.2548 +\subsection{Activity Timing Functions}
2.2549 +All activity timing functions take a ptr to the activity instance they are calculating the timing of. The instance contains a pointer to the element the activity is in. The behavior function is free to communicate to the timing function by leaving special data inside the element state.  The timing function might also simply depend on the current state of the element.
2.2550 +
2.2551 +Here's an example:
2.2552 +\begin{small}\begin{verbatim}
2.2553 +HWSimTimeSpan
2.2554 +sampleElem_sampleActivity_timing( HWSimActivityInst *activityInst )
2.2555 + {
2.2556 +   return doSomethingWithStateOfElem( sendingActivity->elem->elemState );
2.2557 + }
2.2558 +\end{verbatim}\end{small}
2.2559 +
2.2560 +\subsection{Calculating the time-in-flight of a communication path}
2.2561 +
2.2562 +The timing function for a communication path is similar to that of an activity. Except, the timing might also depend on configuration data or state stored inside the comm path struct, so that is passed to the timing function as well.
2.2563 +
2.2564 +\begin{small}\begin{verbatim}
2.2565 +HWSimTimeSpan
2.2566 +commPath_TimeSpanCalc( HWSimCommPath *commPath, HWSimActivityInst *sendingActivity )
2.2567 + { return doSomethingWithStateOfPathAndElem( commPath, sendingActivity->elem->elemState );
2.2568 + }
2.2569 +\end{verbatim}\end{small}
2.2570 +
2.2571 +
2.2572 +
2.2573 +

     3.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
3.2 +++ b/0__Papers/PRT/PRT__Tutorial_explanation_teaching/latex/Paper_Design_2.txt	Tue Sep 17 06:34:05 2013 -0700
3.3 @@ -0,0 +1,32 @@
3.4 +
3.5 +======
3.6 +
3.7 +Details of VMS interface, details of its impl on multi-core, details of differences on different machines.
3.8 +
3.9 +wrapper-lib calls VMS-supplied primitive that suspends the virtual-processor calling the lib, and sends a request to VMS.  VMS calls lang-supplied plugin to handle requests -- this is the part of the scheduler that handles constraints -- it determines which virt-processors must remain suspended, and which are free to be re-animated.
3.10 +
3.11 +The language is implemented as either a collection of wrapper-lib calls embedded into the base language, or as custom syntax that uses uses the VMS-supplied primitive to suspend virtual processors and send requests to VMS.
3.12 +
3.13 +
3.14 +VMS is invisible to the application, only language constructs are visible.  From the application-programmer point of view, the embedded version looks like a function call, albeit the data-struc of the virtual-processor animating the code has to be passed as a parameter to the wrapper-lib call.
3.15 +
3.16 +Hence, VMS is invisible to the application, only language constructs are visible.
3.17 +
3.18 +The wrapper-lib call is standard library code that is loaded along with the application executable.
3.19 +
3.20 +However, VMS primitives may be hardware-implemented, or loaded as OS modules, or dynamic or static libraries. Rhey are naturally custom instructions, but may be emulated by software.
3.21 +
3.22 +The interface between application-executable and language-runtime is the VMS-primitive that sends a request to VMS.  The language-runtime receives the request under control of VMS, which calls a language-supplied request-handling function and passes the request as a parameter.  This passive behavior of the request handler leaves control-flow inside VMS, which is part of hiding concurrency from the language-runtime implementation.
3.23 +
3.24 +The interface between the runtime and VMS is VMS's plugin API.  The runtime is implemented as two functions, whose pointers are handed to VMS.  VMS then controls the flow of execution.  When a request is ready for the runtime, VMS cIalls the request-handler function, and when a spot on hardware is free for work, VMS calls the scheduler-assign function.  Hence, the language implements its runtime as two isolated functions.  By keeping control-flow inside VMS, the language-specific portion of the runtiem is simplified.
3.25 +
3.26 +This structure is also the reason VMS encourages reuse of scheduler code. The VMS API separates out control flow from scheduling, so scheduling code is isolated, with well-defined interfaces.  Scheduling is then further sub-divided into modules: constraint-management (IE enforcing dependencies); and choosing physical location to place work. Each has its own well-defined interface, and they communicate to each other via VMS-managed shared state.
3.27 +
3.28 +The greatest application performance impact due to the scheduler is communication it causes.
3.29 +
3.30 +, management of the memory hierarchy, and the match between work-characteristics and hardware-characteristics (IE, assigning to accelerator vs CPU).  Hence, significant work goes into implementing strategies and mechanisms for finding the best assignment-choices. Such implementations are only loosely coupled to language, through the shared state by which the request-handler informs the assigner of what work is ready to be animated.
3.31 +
3.32 +Hence, it is straight-forward to reuse the code that assigns work to physical locations.  The only language-specific influence on the assigner is the shared constraint-state.
3.33 +
3.34 +
3.35 +