# HG changeset patch # User Some Random Person # Date 1336134197 25200 # Node ID 95c7bc4d8cc97d3fa0839cda20f9b46700d2708c # Parent d66564c88e9a750e9840ddd1c8979d299bbcdd7f Universal -- created paper for universal runtime, with abstract and intro diff -r d66564c88e9a -r 95c7bc4d8cc9 0__Papers/VMS/Universal/figures/control_flow.pdf Binary file 0__Papers/VMS/Universal/figures/control_flow.pdf has changed diff -r d66564c88e9a -r 95c7bc4d8cc9 0__Papers/VMS/Universal/figures/control_flow.svg --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/0__Papers/VMS/Universal/figures/control_flow.svg Fri May 04 05:23:17 2012 -0700 @@ -0,0 +1,2966 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + save VPcontext + + + + whichcontextswitch + + + + constraintupdate hdlr + + + + Push workonto Q + + + + Take work-unit from Q + + + + Non-suspend end of work-unit + + Suspend at end of work-unit + + + + + new work-unit is attached to a VP's context + new work-unit hasown localcontext + + load curr VPwith contextfrom newwork-unit + + + which has attached context + + Non-suspend end of work-unit + purely local (no ctxt attached) + + save stack& frame ptrs + + + CILK is example of this case,when leaf child finishes + Dataflow is example of this case + pthread is example of this case,as is Cilk when suspends on sync + + + + no workin Q + + + (in CILK, counts child ends, and handles sync) + (for CILK, Q filled by async spawns) + + + time tochk msgs? + + + + + scan for in-coming msgs,give them toconstraintupdater andAssigner + + + + send curr VPto pool,switch toVP of newwork-unit + + + + constraintmsg handler + + + + Push Workonto Q + + + + + send construpdate msg + + + (in dataflow & CILK, msgs from other cores go to Assigner to ask for work and to push) + (in dataflow & CILK, push work to other cores via msgs -- remember constraints that cross cores) + + + Assignermsg handler + + + + send work-push msg + + + + + + + send construpdate msgs + + + + done withmsgs + no + yes,chk + constraintmsgs + Assignermsgs + (in CILK, completion of child on remote core notifies parent's core. In dataflow, remote pro- pendent sends data to dependents' cores) + + + send "needwork" msg + + + receivework-push msg + + receive need-work msg & have work to give + No work + + Jmp to newwork-unit + + + + + + + Assigner + + + + request hdlr + + + + + time tochk msgs? + + + + + yes,chk + + no + done withmsgs + + push workonto Q, andsend "cancelneed work" + + + + send construpdate msgs + + + + + Push workonto Q + + + + send Assignermessages + + + + + + + + get VP frompool (makenew if none).Load it withcontext fromnew work-unit + + + new work-unit has ownlocal context, butprev VP suspended + + (in pthreads, checks mutex structures, cond var structs, etc) + work-unitstate chgs + done with msgs + + Suspend at end of work-unit + pthread is example of this case,as is Cilk when suspends on sync + + Assigner + + + + request hdlr + + + + + time tochk msgs? + + + + + yes,chk + no + done withmsgs + + get semEnvlock & updatestate of VP + + + + + get semEnvlock & pick aready VP + + + (in pthreads, checks mutex structures, cond var structs, etc) + work-unitstate chgs + + + Jmp to newwork-unit + + + + + + switch toready VP + + + + + + save VPcontext + + + + No VPsready + increase backoff eachrepetition & updatebackoff stateTry to make core enterpower-down idle statewhile waiting + + do Backoffwait + + + + + + + send "needwork"? + + + yes, send + + + gotwork? + + + + Don'tsend + + yes,gotwork + The difference between top two pathsis the way the request hdlr+assigner hasbeen implemented -- use shared state ontop path, but only local on the second.NOTE: the request hdlr and assigner arecombined into a single straight-line pieceof code. + + + No VPs ready + + Local semantic Env holds the Q of ready work-units -] Msgs update the local semantic state, and put work-units into this Q-] Shared sem state is traditional VMS, except req hdlr and assigner are same Fn-] Shared sem has its own structs to decide which VP is ready, and switches to it at end of Assigner-] Which core a VP runs on is decided between req hdlr and assigner, based on shared structs that hold the VPs-] For local-only, VP is moved to the core it runs on -- assigner only moves VPs to diff cores and receives them-] For atomic tasks, the task-info is sent betwn cores.. for VPs, whole live portion of stack is sent.Okay, so this fits the standard VMS model -- except now the core-controller is gone, so the extra level of UCC is taken away. Now, the MasterVP is "reusing" whatever VP has suspended. -- in a way, the core-controller plu AnimationMaster are reduced to the assembly call that the WrapperLib (WL) makes to end the current work-unit, which suspends the VP. Inside that suspend call, is the opportunity to switch between different processes, call upon VMS-only helper services, and so on.So, the switch-over is a function call to a wrapp-lib, which then does an assembly Fn call -- the assembly saves the stack state (regs already saved when did the wrapper-lib call), and then puts localEnv into the param reg (for 64 bit convention) and jumps to the plugin-fn. Note, there are three different assembly calls, one for each kind of work-unit, to end it. Sometimes work-unit calls assembly directly, sometimes it calls a wrapper-lib that just does all the work right there, sometimes the wrapper lib calls the appropriate assembly Fn.Need to modify request structure, so abstraction can supply services via request -- and perhaps library Fns that perform some of services direct from app, and other Fns for use inside the plugins.. such as Malloc-Free, create VP, create atomic-Task.. some are pure wrapper-lib, others are combo.. seeing one version of malloc for wrapper-lib, different for plugin and msg-system use.The msg system is details of impl of Lang Animator -- so plugin is still "completing" the lang animator by adding semanticsCore-controller was a second level beneath the runtime (Language Animator is the runtime).. The sched slots were virtual physical animators, and the masterVP was the Language Animator, which the AnimationMaster Fn and the plugin Fns supplied the behavior of. The coreCtlr switched between levels -- The MasterVP was "outside" the framework of the schedling slots -- in a way it was beside them (they took turns getting the actual phys animator, so same level) and in other way was above -- it controlled what went into the slots, so above them. I another way, the MasterVP was the Language Animator, while the slots were virtual physical, so they had no locial connection -- the slots were related to the same thing that animatored the language animator.. a very strange arrangement. + + diff -r d66564c88e9a -r 95c7bc4d8cc9 0__Papers/VMS/Universal/latex/VMS_universal.tex --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/0__Papers/VMS/Universal/latex/VMS_universal.tex Fri May 04 05:23:17 2012 -0700 @@ -0,0 +1,273 @@ +%&latex +%% Derived from: `accept.tex' (from acmconf.dtx), + +\documentclass[box,accept]{acmconf} + +\CopyrightText{\copyright ACM 2000, ....., used with the \texttt{box} option.} +\IfFileExists{graphicx.sty}{\usepackage{graphicx}}{} +\ConferenceName{1. Conference on Designing a \LaTeX2e Class for + Typesetting ACM Papers, Hawaii 2000} +\ConferenceShortName{CONF-2000} + +\def\XX{More text should follow, but keep in mind that a limit of 6 + pages has been set, including figures and references. More text + should follow, but keep in mind that a limit of 6 pages has been + set, including figures and references. More text should follow, but + keep in mind that a limit of 6 pages has been set, including figures + and references. More text should follow, but keep in mind that a + limit of 6 pages has been set, including figures and references. + \par +} + +\begin{document} + +%+Title +\date{31. December 1999} +\title{A Universal Proto-Runtime for Domain Specific Parallel Languages} +\author{\Author{Sean Halle}\\ + \Address{Open Source Research Institute}\\ + \Email{Sean.Halle@OpenSourceResearchInstitute.org} + } +\maketitle +%-Title + +%+Abstract +\begin{abstract} +Software has not been keeping up with parallel hardware, which slows the economy and retards adoption of new hardware. The gap is due in part to the disruption caused by moving to parallel languages, and in part to the prohibitive effort of porting application code across platforms. A leading idea for solving this is domain-specific parallel languages, where custom constructs are made to match features of the problem. However, such languages have a small number of users, which can't support the currently large effort to create such languages and port them across hardware platforms. + +To simplify creation of domain-specific languages, we propose a "proto" runtime, which provides the cross-language portion of runtime behavior. This includes handling the concurrency issues within the runtime itself. A given language only provides sequential implementations of its constructs. + + + +We explain the practical usage and theory, and show measurements of implementation time of three simple languages and one domain-specific language for hardware simulation. We also give runtime overhead measurements, which are orders of magnitude better than pthreads and OpenMP. +\end{abstract} +%-Abstract + + +\section{Introduction} +Current parallel languages, such as pthreads, TBB, OpenMP, and MPI, require programmers to learn new, complex mental models. Sequential programmers have to be retrained to start using them, and a new set of programming practices must be adopted for them. Such retraining generates a large disruption in application-software companies that attempt to start using these parallel languages. + +Further, after learning the new language and adopting the new programming practices, the companies still have to hand-tune each application for each target hardware platform. It also means that customers have to get a new binary when they upgrade hardware. Both effects are costly, and tend to retard adoption of new hardware, despite potential performance gains. + +The net effect is that software lags behind hardware, and the potential advantages across the economy of new hardware designs are lost. What is needed is a way to ease the transition from sequential programming to parallel programming, and to reduce the need for hand-tuning to get efficient performance on new hardware. + + +Domain-specific languages promise to deliver both easier transition and efficiency across hardware[]. They do this by providing custom constructs that match patterns in the application. Thus, they are natural for the sequential programmer to use. The constructs ``hide'' the parallelism inside themselves, letting the runtime or toolchain handle it, freeing the programmer. + +To illustrate the concept, we briefly cover one such language, which is used for describing the behavior of hardware, called HWSim. It has a straight-forward means for describing hardware, and then extracts parallelism from the description. The extracted parallelism is then efficiently exploited on a variety of target platforms. + +One blockage to wide-spread adoption of such domain-specific languages is the cost of developing them. Currently, toolchains are typically created, with an optimizer and back end for each target platform, which is expensive to create. + +An alternative approach is so-called embedded languages, in which the new constructs are viewed as library calls made from a base language, such as C. This relies on the runtime system to provide efficient execution on a given target platform. It avoids the expense of creating a toolchain for each target, by instead creating a runtime for each target that is tuned to the hardware. + +This is a step forward, but such runtimes are still expensive to create. A technology to simplify the runtime creation would be helpful. + +In this paper, we present a means to reduce runtime creation, by not only reducing the complexity, but also by making reuse across languages more practical. It essentially breaks a runtime into two pieces: a part that implements the semantic behavior of the custom constructs, and a part that is the same for every language. + + A nice happenstance is that the complicated multi-threaded issues that come up inside current runtime implementations can be collected inside the part that stays the same across languages. This means the language only need supply a sequential implementation of its constructs' behavior. + +Another nice happenstance is that the interface between the two parts of the runtime modularizes the runtime code. This, by itself, speeds development of new runtimes. In addition, it makes sharing between languages practical, especially for the portion that chooses where to perform work, and in which order. + + This portion of the runtime handles data affinity and the shape of the dependency graph, which are responsible for the resulting performance. However, it contains few, if any, language-specific portions, so is practical to share between languages, for a given target platform. This saves a non-trivial amount of development work. + +While runtimes built without our contribution are still free to share such code, they have no equivalent interface between the runtime pieces. This makes isolating this portion of code more time consuming, and forces more effort to fit code from one runtime into the one of a different language. + +We call our contribution Universal Proto-Runtime (UPR) in order to capture the idea that we supply a partial runtime that must be completed by the language. Unlike a thread package, our contribution cannot be used directly by application code. Rather, a \emph{plugin} that contains the language-specific portions must be supplied. +The application then uses the combination. + + +Organization of paper + +\section{Background and Related Work} +For performance, the proto-runtime supports multiple levels of runtime hierarchy. In higher levels, work-units are large, leaving time for the decision about where to execute them to use advanced algorithms, which track data affinity and analyze dependency patterns. For lower levels, the work-units are smaller, leaving less time to search for the best location, so they have simpler algorithms. + +UPR differs from pthreads, TBB, and other thread packages in that it provides more services to simplify runtime creation, and, more importantly, UPR has a mental model that is specific to runtime creation. pthreads and TBB are programming languages in their own right -- but UPR has no semantics usable in application code, because it's only a \emph{part} of a runtime. + +? + + + +\section{The Story Begins\ldots} +A real article is supposed to have some deep results and good +explanations. That, however, is your job and not mine so you should +replace this text with something more appropriate\footnote{Another a + footnote}.. + +\section{Some often used \LaTeX\ commands} + +\subsection{\texttt{emph}, etc.} +Text may be set as \emph{emph}.\\ +Text may be set as \texttt{texttt}.\\ +Text may be set as \underline{unterline}.\\ +Text may be set as \textbf{textbf}.\\ +Text may be set as \textrm{textrm}.\\ +Text may be set as {\tiny tiny}.\\ +Text may be set as {\scriptsize scriptsize}.\\ +Text may be set as {\footnotesize footnotesize}.\\ +Text may be set as {\normalfont normalsize}.\\ +Text may be set as {\large large}.\\ +Text may be set as {\Large Large}.\\ +Text may be set as {\LARGE LARGE}.\\ +Text may be set as {\huge huge}.\\ +Text may be set as {\Huge Huge}.\\ +Text may have$^{\textrm{super}}$ and$_{\textrm{sub}}$scripts. + +\subsection{\texttt{itemize}} +\begin{itemize} +\item More text should follow, but keep in mind that a limit of 6 + pages has been set, including figures and references. More text + should follow, but keep in mind that a limit of 6 pages has been + set, including figures and references. +\item More text should follow, but keep in mind that a limit of 6 + pages has been set, including figures and references. More text + should follow, but keep in mind that a limit of 6 pages has been + set, including figures and references. +\end{itemize} + +\subsection{\texttt{enumerate}} +\begin{enumerate} +\item More text should follow, but keep in mind that a limit of 6 + pages has been set, including figures and references. More text + should follow, but keep in mind that a limit of 6 pages has been + set, including figures and references. +\item More text should follow, but keep in mind that a limit of 6 + pages has been set, including figures and references. More text + should follow, but keep in mind that a limit of 6 pages has been + set, including figures and references. +\end{enumerate} + +\subsection{\texttt{description}} +\begin{description} +\item[Foo] More text should follow, but keep in mind that a limit of 6 + pages has been set, including figures and references. More text + should follow, but keep in mind that a limit of 6 pages has been + set, including figures and references. +\item[Bar] More text should follow, but keep in mind that a limit of 6 + pages has been set, including figures and references. More text + should follow, but keep in mind that a limit of 6 pages has been + set, including figures and references. +\end{description} + +\subsection{\texttt{center} and \texttt{tabular}} +\begin{center} +\begin{tabular}{|l|c|r|}\hline +left & center & right \\\hline\hline +AAAAAAAA & BBBBBBBB & CCCCCCCC \\ +AAAAAAAA & BBBBBBBB & CCCCCCCC \\\cline{3-3} +AAAAAAAA & BBBBBBBB & CCCCCCCC \\\cline{2-2} +AAAAAAAA & BBBBBBBB & CCCCCCCC \\\cline{1-2} +AAAAAAAA & BBBBBBBB & CCCCCCCC \\\hline +AAAAAAAA & BBBBBBBB & CCCCCCCC \\\hline +1 & \multicolumn{2}{|c|}{2} \\\hline +\end{tabular} +\end{center} + +\subsection{\texttt{figure} and Postscript pictures} +Have a look to to figure~\ref{fig-1} and~\ref{fig-2}. + +\begin{figure} +\hrule +Nice Postscript, isn't it? +\begin{center} +\IfFileExists{graphicx.sty}{ + \includegraphics{body.eps} +}{ + Sorry, package \texttt{graphicx} not present. +} +\end{center} + +Same, a little bit smaller: +\begin{center} +\IfFileExists{graphicx.sty}{ + \includegraphics[scale=.5]{body.eps} + }{ + Sorry, package \texttt{graphicx} not present. +} +\end{center} +\caption{\label{fig-1}This is a nice floating figure} +\hrule +\end{figure} + +\begin{figure*} +\hrule +This figure uses both columns, using \texttt{figure*} +\begin{center} +\IfFileExists{graphicx.sty}{ + \includegraphics[scale=.5]{body.eps} + \hspace{1cm} + \includegraphics[scale=.5]{body.eps} +}{ + Sorry, package \texttt{graphicx} not present. +} +\end{center} +\caption{\label{fig-2}This is a nice floating figure} +\hrule +\end{figure*} + +\section{The Story Continues 1} + +This is a \verb+\section+. + +\XX\XX + +\subsection{The Story Continues 2} + +This is a \verb+\subsection+. + +\XX\XX + +\subsubsection{The Story Continues 3} + +This is a \verb+\subsubsection+. + +\XX\XX + +\subsubsubsection{The Story Continues 4} + +This is a \verb+\subsubsubsection+. + +\XX\XX + +\subsubsubsubsection{The Story Continues 5} + +This is a \verb+\subsubsubsubsection+. + +\XX\XX + +\paragraph{The Story Continues 6} + +This is a \verb+\paragraph+. +\XX\XX + +\subparagraph{The Story Continues 7} +This is a \verb+\subparagraph+. +\XX\XX\XX + +\section{Conclusion} +The end, at last! In this example there really are no results or +points to summarize but I trust your article has more food for though +and thus will need a conclusion. + +\appendix +\section{Appendices} +If you have any, appendices might go here. Note that appendices +should not be used to circumvent the word count limit. + +This is "doing it by hand" --- you might be better off using BibTeX. + +%+Bibliography +\begin{thebibliography}{X} +\bibitem[1]{Lam94} Leslie Lamport: {\em \LaTeX, A Document + Preparation System,} Addison Wesley~1994. +\end{thebibliography} +%-Bibliography + +\IfPrepare{ + \tableofcontents + \listoffigures + \listoftables +}{} + +\end{document} + + diff -r d66564c88e9a -r 95c7bc4d8cc9 0__Papers/VMS/VMS__Performance_on_Multicore/Universal/figures/control_flow.pdf Binary file 0__Papers/VMS/VMS__Performance_on_Multicore/Universal/figures/control_flow.pdf has changed diff -r d66564c88e9a -r 95c7bc4d8cc9 0__Papers/VMS/VMS__Performance_on_Multicore/Universal/figures/control_flow.svg --- a/0__Papers/VMS/VMS__Performance_on_Multicore/Universal/figures/control_flow.svg Fri Apr 27 18:47:48 2012 +0200 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,2966 +0,0 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - image/svg+xml - - - - - - - - - - save VPcontext - - - - whichcontextswitch - - - - constraintupdate hdlr - - - - Push workonto Q - - - - Take work-unit from Q - - - - Non-suspend end of work-unit - - Suspend at end of work-unit - - - - - new work-unit is attached to a VP's context - new work-unit hasown localcontext - - load curr VPwith contextfrom newwork-unit - - - which has attached context - - Non-suspend end of work-unit - purely local (no ctxt attached) - - save stack& frame ptrs - - - CILK is example of this case,when leaf child finishes - Dataflow is example of this case - pthread is example of this case,as is Cilk when suspends on sync - - - - no workin Q - - - (in CILK, counts child ends, and handles sync) - (for CILK, Q filled by async spawns) - - - time tochk msgs? - - - - - scan for in-coming msgs,give them toconstraintupdater andAssigner - - - - send curr VPto pool,switch toVP of newwork-unit - - - - constraintmsg handler - - - - Push Workonto Q - - - - - send construpdate msg - - - (in dataflow & CILK, msgs from other cores go to Assigner to ask for work and to push) - (in dataflow & CILK, push work to other cores via msgs -- remember constraints that cross cores) - - - Assignermsg handler - - - - send work-push msg - - - - - - - send construpdate msgs - - - - done withmsgs - no - yes,chk - constraintmsgs - Assignermsgs - (in CILK, completion of child on remote core notifies parent's core. In dataflow, remote pro- pendent sends data to dependents' cores) - - - send "needwork" msg - - - receivework-push msg - - receive need-work msg & have work to give - No work - - Jmp to newwork-unit - - - - - - - Assigner - - - - request hdlr - - - - - time tochk msgs? - - - - - yes,chk - - no - done withmsgs - - push workonto Q, andsend "cancelneed work" - - - - send construpdate msgs - - - - - Push workonto Q - - - - send Assignermessages - - - - - - - - get VP frompool (makenew if none).Load it withcontext fromnew work-unit - - - new work-unit has ownlocal context, butprev VP suspended - - (in pthreads, checks mutex structures, cond var structs, etc) - work-unitstate chgs - done with msgs - - Suspend at end of work-unit - pthread is example of this case,as is Cilk when suspends on sync - - Assigner - - - - request hdlr - - - - - time tochk msgs? - - - - - yes,chk - no - done withmsgs - - get semEnvlock & updatestate of VP - - - - - get semEnvlock & pick aready VP - - - (in pthreads, checks mutex structures, cond var structs, etc) - work-unitstate chgs - - - Jmp to newwork-unit - - - - - - switch toready VP - - - - - - save VPcontext - - - - No VPsready - increase backoff eachrepetition & updatebackoff stateTry to make core enterpower-down idle statewhile waiting - - do Backoffwait - - - - - - - send "needwork"? - - - yes, send - - - gotwork? - - - - Don'tsend - - yes,gotwork - The difference between top two pathsis the way the request hdlr+assigner hasbeen implemented -- use shared state ontop path, but only local on the second.NOTE: the request hdlr and assigner arecombined into a single straight-line pieceof code. - - - No VPs ready - - Local semantic Env holds the Q of ready work-units -] Msgs update the local semantic state, and put work-units into this Q-] Shared sem state is traditional VMS, except req hdlr and assigner are same Fn-] Shared sem has its own structs to decide which VP is ready, and switches to it at end of Assigner-] Which core a VP runs on is decided between req hdlr and assigner, based on shared structs that hold the VPs-] For local-only, VP is moved to the core it runs on -- assigner only moves VPs to diff cores and receives them-] For atomic tasks, the task-info is sent betwn cores.. for VPs, whole live portion of stack is sent.Okay, so this fits the standard VMS model -- except now the core-controller is gone, so the extra level of UCC is taken away. Now, the MasterVP is "reusing" whatever VP has suspended. -- in a way, the core-controller plu AnimationMaster are reduced to the assembly call that the WrapperLib (WL) makes to end the current work-unit, which suspends the VP. Inside that suspend call, is the opportunity to switch between different processes, call upon VMS-only helper services, and so on.So, the switch-over is a function call to a wrapp-lib, which then does an assembly Fn call -- the assembly saves the stack state (regs already saved when did the wrapper-lib call), and then puts localEnv into the param reg (for 64 bit convention) and jumps to the plugin-fn. Note, there are three different assembly calls, one for each kind of work-unit, to end it. Sometimes work-unit calls assembly directly, sometimes it calls a wrapper-lib that just does all the work right there, sometimes the wrapper lib calls the appropriate assembly Fn.Need to modify request structure, so abstraction can supply services via request -- and perhaps library Fns that perform some of services direct from app, and other Fns for use inside the plugins.. such as Malloc-Free, create VP, create atomic-Task.. some are pure wrapper-lib, others are combo.. seeing one version of malloc for wrapper-lib, different for plugin and msg-system use.The msg system is details of impl of Lang Animator -- so plugin is still "completing" the lang animator by adding semanticsCore-controller was a second level beneath the runtime (Language Animator is the runtime).. The sched slots were virtual physical animators, and the masterVP was the Language Animator, which the AnimationMaster Fn and the plugin Fns supplied the behavior of. The coreCtlr switched between levels -- The MasterVP was "outside" the framework of the schedling slots -- in a way it was beside them (they took turns getting the actual phys animator, so same level) and in other way was above -- it controlled what went into the slots, so above them. I another way, the MasterVP was the Language Animator, while the slots were virtual physical, so they had no locial connection -- the slots were related to the same thing that animatored the language animator.. a very strange arrangement. - -