

\documentclass[conference]{IEEEtran}
%
\usepackage{makeidx,geometry,amssymb,graphicx,calc,ifthen}
%

% *** CITATION PACKAGES ***
%
%\usepackage{cite}
% cite.sty was written by Donald Arseneau
% V1.6 and later of IEEEtran pre-defines the format of the cite.sty package
% \cite{} output to follow that of IEEE. Loading the cite package will
% result in citation numbers being automatically sorted and properly
% "compressed/ranged". e.g., [1], [9], [2], [7], [5], [6] without using
% cite.sty will become [1], [2], [5]--[7], [9] using cite.sty. cite.sty's
% \cite will automatically add leading space, if needed. Use cite.sty's
% noadjust option (cite.sty V3.8 and later) if you want to turn this off.
% cite.sty is already installed on most LaTeX systems. Be sure and use
% version 4.0 (2003-05-27) and later if using hyperref.sty. cite.sty does
% not currently provide for hyperlinked citations.
% The latest version can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/cite/
% The documentation is contained in the cite.sty file itself.






% *** GRAPHICS RELATED PACKAGES ***
%
\ifCLASSINFOpdf
  % \usepackage[pdftex]{graphicx}
  % declare the path(s) where your graphic files are
  % \graphicspath{{../pdf/}{../jpeg/}}
  % and their extensions so you won't have to specify these with
  % every instance of \includegraphics
  % \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
\else
  % or other class option (dvipsone, dvipdf, if not using dvips). graphicx
  % will default to the driver specified in the system graphics.cfg if no
  % driver is specified.
  % \usepackage[dvips]{graphicx}
  % declare the path(s) where your graphic files are
  % \graphicspath{{../eps/}}
  % and their extensions so you won't have to specify these with
  % every instance of \includegraphics
  % \DeclareGraphicsExtensions{.eps}
\fi
% graphicx was written by David Carlisle and Sebastian Rahtz. It is
% required if you want graphics, photos, etc. graphicx.sty is already
% installed on most LaTeX systems. The latest version and documentation can
% be obtained at: 
% http://www.ctan.org/tex-archive/macros/latex/required/graphics/
% Another good source of documentation is "Using Imported Graphics in
% LaTeX2e" by Keith Reckdahl which can be found as epslatex.ps or
% epslatex.pdf at: http://www.ctan.org/tex-archive/info/
%
% latex, and pdflatex in dvi mode, support graphics in encapsulated
% postscript (.eps) format. pdflatex in pdf mode supports graphics
% in .pdf, .jpeg, .png and .mps (metapost) formats. Users should ensure
% that all non-photo figures use a vector format (.eps, .pdf, .mps) and
% not a bitmapped formats (.jpeg, .png). IEEE frowns on bitmapped formats
% which can result in "jaggedy"/blurry rendering of lines and letters as
% well as large increases in file sizes.
%
% You can find documentation about the pdfTeX application at:
% http://www.tug.org/applications/pdftex





% *** MATH PACKAGES ***
%
%\usepackage[cmex10]{amsmath}
% A popular package from the American Mathematical Society that provides
% many useful and powerful commands for dealing with mathematics. If using
% it, be sure to load this package with the cmex10 option to ensure that
% only type 1 fonts will utilized at all point sizes. Without this option,
% it is possible that some math symbols, particularly those within
% footnotes, will be rendered in bitmap form which will result in a
% document that can not be IEEE Xplore compliant!
%
% Also, note that the amsmath package sets \interdisplaylinepenalty to 10000
% thus preventing page breaks from occurring within multiline equations. Use:
%\interdisplaylinepenalty=2500
% after loading amsmath to restore such page breaks as IEEEtran.cls normally
% does. amsmath.sty is already installed on most LaTeX systems. The latest
% version and documentation can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/required/amslatex/math/





% *** SPECIALIZED LIST PACKAGES ***
%
%\usepackage{algorithmic}
% algorithmic.sty was written by Peter Williams and Rogerio Brito.
% This package provides an algorithmic environment fo describing algorithms.
% You can use the algorithmic environment in-text or within a figure
% environment to provide for a floating algorithm. Do NOT use the algorithm
% floating environment provided by algorithm.sty (by the same authors) or
% algorithm2e.sty (by Christophe Fiorio) as IEEE does not use dedicated
% algorithm float types and packages that provide these will not provide
% correct IEEE style captions. The latest version and documentation of
% algorithmic.sty can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/algorithms/
% There is also a support site at:
% http://algorithms.berlios.de/index.html
% Also of interest may be the (relatively newer and more customizable)
% algorithmicx.sty package by Szasz Janos:
% http://www.ctan.org/tex-archive/macros/latex/contrib/algorithmicx/




% *** ALIGNMENT PACKAGES ***
%
%\usepackage{array}
% Frank Mittelbach's and David Carlisle's array.sty patches and improves
% the standard LaTeX2e array and tabular environments to provide better
% appearance and additional user controls. As the default LaTeX2e table
% generation code is lacking to the point of almost being broken with
% respect to the quality of the end results, all users are strongly
% advised to use an enhanced (at the very least that provided by array.sty)
% set of table tools. array.sty is already installed on most systems. The
% latest version and documentation can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/required/tools/


%\usepackage{mdwmath}
%\usepackage{mdwtab}
% Also highly recommended is Mark Wooding's extremely powerful MDW tools,
% especially mdwmath.sty and mdwtab.sty which are used to format equations
% and tables, respectively. The MDWtools set is already installed on most
% LaTeX systems. The lastest version and documentation is available at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/mdwtools/


% IEEEtran contains the IEEEeqnarray family of commands that can be used to
% generate multiline equations as well as matrices, tables, etc., of high
% quality.


%\usepackage{eqparbox}
% Also of notable interest is Scott Pakin's eqparbox package for creating
% (automatically sized) equal width boxes - aka "natural width parboxes".
% Available at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/eqparbox/





% *** SUBFIGURE PACKAGES ***
%\usepackage[tight,footnotesize]{subfigure}
% subfigure.sty was written by Steven Douglas Cochran. This package makes it
% easy to put subfigures in your figures. e.g., "Figure 1a and 1b". For IEEE
% work, it is a good idea to load it with the tight package option to reduce
% the amount of white space around the subfigures. subfigure.sty is already
% installed on most LaTeX systems. The latest version and documentation can
% be obtained at:
% http://www.ctan.org/tex-archive/obsolete/macros/latex/contrib/subfigure/
% subfigure.sty has been superceeded by subfig.sty.



%\usepackage[caption=false]{caption}
%\usepackage[font=footnotesize]{subfig}
% subfig.sty, also written by Steven Douglas Cochran, is the modern
% replacement for subfigure.sty. However, subfig.sty requires and
% automatically loads Axel Sommerfeldt's caption.sty which will override
% IEEEtran.cls handling of captions and this will result in nonIEEE style
% figure/table captions. To prevent this problem, be sure and preload
% caption.sty with its "caption=false" package option. This is will preserve
% IEEEtran.cls handing of captions. Version 1.3 (2005/06/28) and later 
% (recommended due to many improvements over 1.2) of subfig.sty supports
% the caption=false option directly:
%\usepackage[caption=false,font=footnotesize]{subfig}
%
% The latest version and documentation can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/subfig/
% The latest version and documentation of caption.sty can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/caption/




% *** FLOAT PACKAGES ***
%
%\usepackage{fixltx2e}
% fixltx2e, the successor to the earlier fix2col.sty, was written by
% Frank Mittelbach and David Carlisle. This package corrects a few problems
% in the LaTeX2e kernel, the most notable of which is that in current
% LaTeX2e releases, the ordering of single and double column floats is not
% guaranteed to be preserved. Thus, an unpatched LaTeX2e can allow a
% single column figure to be placed prior to an earlier double column
% figure. The latest version and documentation can be found at:
% http://www.ctan.org/tex-archive/macros/latex/base/



%\usepackage{stfloats}
% stfloats.sty was written by Sigitas Tolusis. This package gives LaTeX2e
% the ability to do double column floats at the bottom of the page as well
% as the top. (e.g., "\begin{figure*}[!b]" is not normally possible in
% LaTeX2e). It also provides a command:
%\fnbelowfloat
% to enable the placement of footnotes below bottom floats (the standard
% LaTeX2e kernel puts them above bottom floats). This is an invasive package
% which rewrites many portions of the LaTeX2e float routines. It may not work
% with other packages that modify the LaTeX2e float routines. The latest
% version and documentation can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/sttools/
% Documentation is contained in the stfloats.sty comments as well as in the
% presfull.pdf file. Do not use the stfloats baselinefloat ability as IEEE
% does not allow \baselineskip to stretch. Authors submitting work to the
% IEEE should note that IEEE rarely uses double column equations and
% that authors should try to avoid such use. Do not be tempted to use the
% cuted.sty or midfloat.sty packages (also by Sigitas Tolusis) as IEEE does
% not format its papers in such ways.





% *** PDF, URL AND HYPERLINK PACKAGES ***
%
%\usepackage{url}
% url.sty was written by Donald Arseneau. It provides better support for
% handling and breaking URLs. url.sty is already installed on most LaTeX
% systems. The latest version can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/misc/
% Read the url.sty source comments for usage information. Basically,
% \url{my_url_here}.





% *** Do not adjust lengths that control margins, column widths, etc. ***
% *** Do not use packages that alter fonts (such as pslatex).         ***
% There should be no need to do such things with IEEEtran.cls V1.6 and later.
% (Unless specifically asked to do so by the journal or conference you plan
% to submit to, of course. )


% correct bad hyphenation here
\hyphenation{op-tical net-works semi-conduc-tor}


\begin{document}
\bibliographystyle{plain}
%

\title{The Shape of Future Architectures}

\author
{
 \IEEEauthorblockN{Sean Halle}
 \IEEEauthorblockA
 {
   University of California at Santa Cruz\\
   and INRIA Paris\\
   Email: sean.halle@inria.fr
 }
 \and
 \IEEEauthorblockN{Albert Cohen}
 \IEEEauthorblockA
 { INRIA Paris\\
   Email: albert.cohen@infria.fr
 }
}


\maketitle             
%

\begin{abstract}
  Process technology is rapidly approaching multiple fundamental physical limits that combine to increase the cost of producing chips faster than the increase in density.  Whether or not process technology continues to shrink, the economics will slow the demand for chips from future processes.  This will, for the first time, shift the design decisions from being shortest time-to-market towards being the greatest efficiency per transistor -- either in computation delivered towards the goal end-user measurement such as throughput, or in energy per operation.
  
The implication for processor architecture is to accelerate the trend away from "easy to program" structures such as out-of-order pipelines and coherent shared memory, and towards hierarchical designs that force application information to be exploited in explicitly managing the placement and movement of data, in order to keep it as local for as much of the computation as possible.  The cost of communication will be the dominant factor driving the shape of future architectures, both from a computation-bottleneck cost and an energy per operation cost.

Integration will be pushed towards the third dimension in order to mitigate the communication costs -- indeed main memories have long been on that path.  This approach will hasten the development of 3 dimensional integration processes.  The architecture implication is that 
\end{abstract}

\section{Motivation}

Threads were invented to be virtual sequential processors, but are 
problematic when one has multiple physical cores. Not only
are they difficult to use for parallel code, but their parallel performance is poor
because they block the language from controlling
task placement. The language often knows which cores are likely to already
have data in the cache which a new task requires as input. If it had
control, the language could place tasks where the data is likely to already
reside, reducing communication and increasing performance and power
efficiency. A thread replacement is needed that is both easy to use in
applications and gives the language control over task placement.

The first question is, what is the easiest to use parallelism construct?  There
exist many to choose from: Threads (for legacy reasons); Actors
{\cite{Hewitt10}}{\cite{Actors97}}; Components {\cite{ComponentModel00}};
process calculi {\cite{hoare78}} {\cite{milner99}};
coordination languages {\cite{Gelernter85Linda}}; and new ones
continually invented.

We demonstrate in this paper  a thread replacement that lets any such parallelism
construct be plugged in.  It is the first extensible hardware abstraction,
allowing to plug-in both preferred parallelism constructs and preferred 
runtime scheduler. We focus in this introductory paper on
the parallelism construct support, illustrating how to implement synchronous
\texttt{send}-\texttt{receive} constructs motivated by process calculi,
the \texttt{spawn} and \texttt{sync} constructs from Cilk, and even
\texttt{mutex} and \texttt{condition variable} constructs from pthreads.
However, the schedulers we implemented have no
interesting performance optimizations.

The next question is, how much effort is required to implement a plugin? Our
mechanism protects the plugin from exposure to concurrency issues, allowing
the parallelism constructs to be implemented with sequential algorithms.  It
accomplishes this by introducing a ``virtual'' time that sequentializes events
that appear, to the program, to be simultaneous.

We call our abstraction Virtualized Master-Slave, or VMS. We demonstrate a user-level
implementation in this paper, although it is ideally implemented as the base
hardware abstraction of the OS.

It involves subtle concepts related to time in the program vs time in the
plugin vs physical time.  We explain it in four levels, starting with an abstract definition
 and moving down to implementation.

The first level is the theoretical definition, given in Section \ref{secAbsModel}. Second is the elements
of our implementation and how they relate to the theoretical definition, in Section \ref{secInternal}.
Third is the application code point of view, in Section \ref{secApp}, which is tied back to both
the abstract model and the internal elements. The fourth view,
also in Section \ref{secApp}, shows the implementation of the plugin for a parallel
construct. Additionally,
measurements of effectiveness appear
in Section \ref{secResults} and conclusion is in \ref{secConclusion}.

\section{Background and Related Work}

{\bf Side-stepping OS Threads:\ }User-level thread packages and most parallel
language runtimes have to side-step OS threads, by pinning one to each core,
which effectively gives the user-level package control over the core. Our VMS
implementation also does this. We are not claiming in this paper to have the
OS level implementation of VMS that the model is capable of -- just the
user-space version.

Related work either provides a variation on the Thread model, or is a full
language with included parallelism constructs.  For example,
 TBB {\cite{TBBHome}} is a user-space thread package with added features.
It allows
tasks to create other tasks and control when to start their
execution. However, one cannot plug in alternate parallelism-control constructs, such as
rendez-vous style send and receive, which would have to be made using locks.
 Likewise, Cilk {\cite{CILKHome}}
provides a simpler and easier to use version of TBB's task scheduler, but is
limited to only tree-shaped task graphs. OpenMP {\cite{OpenMPHome}} allows
 creating tasks and controlling their execution order,
but is a language, not a hardware abstraction.  All of these may
alternatively be implemented via plugins to VMS, with similar performance.

In contrast, VMS is the first hardware abstraction that doesn't impose its own concurrency
semantics, but rather takes preferred ones as plugins. This makes it not a
language itself, but a {\em{support}} mechanism for language level
parallelism constructs. The parallelism constructs in Actors, Components,
pthreads, and so on may all be implemented as VMS plugins.

{\bf Virtual Processor (VP) Definition:\ }We want to avoid the confusion
associated with the terms ``thread'' and ``task'' so will use the term
{\em{virtual processor}} (VP), which we define as state in combination with
the ability to animate code or {\em{an additional level of virtual
processors}}. The state consists of a stack with its contents, a program
counter, a pointer to top of stack, and a pointer to the current stack frame.

\section{Abstract Definition of VMS}
\label{secAbsModel}

We give an intuitive overview, then add details in the following sub-sections.

{\bf Intuitive Overview:\ }VMS is concerned primarily with time and
guarantees about it. This is because parallelism constructs control how the
time-lines of different virtual processors intersect. They also guarantee
relations of time lines to hardware events.

As an example, consider a program that writes into a data structure in one
time-line, then calls a \texttt{send} construct, meanwhile in a different
time-line it calls the \texttt{receive} construct then reads the data
structure. The constructs should guarantee that all data written before the
\texttt{send} is readable in the other time-line after the
\texttt{receive}.  VMS provides primitive guarantees, which plugged-in code
builds upon to provide such higher-level guarantees.

To support parallelism constructs, VMS provides: primitive operations to create and suspend
VPs; a way for plugged-in code to control when each VP is (re)started; and
time-related guarantees. These are enforced on all hardware, be it shared
memory or distributed, with strong memory consistency or weak.

{\bf Definition in Three Parts:\ }We give the abstract definition in three
parts: a definition of the elements of a VMS computation system; a definition
of time and the key VMS guarantee; and a definition of virtual processor
scheduling states and transitions between them.

The definition we give is for VMS {\em{with plugins present}}. \ Hence, it
covers the behavior of all possible parallelism constructs implementable with
VMS. The Master mentioned in the definition is an abstract
entity, with a plugin present. In practice, this Master entity is implemented
as part of a core VMS, and plugins later added. This VMS-core is the
hardware abstraction. It hides the physical hardware behind an interface that
creates virtual processors and enforces well-defined time-behavior.


\subsection{The Elements of a VMS Computation System}

\texttt{- }A VMS program has multiple VPs, which are Slaves,
each with an independent time-line.

\texttt{- }A schedule of Slaves is generated by a Master
entity, from within separate time-line(s).

\texttt{- }A schedule is defined as the set of points at which
VPs are (re)animated.

\texttt{- }All semantic parallelism behavior is invoked via
communication with the Master.

\texttt{- }Communication with the Master happens by using a
model-provided primitive, which causes {\em{voluntary}} suspension of the
program's VP.


What is important here is: that the choice of which VP is animated, at which
point, happens in a separate time-line; and that the VPs suspend voluntarily
for each parallelism construct. This means that {\em{scheduling is separated
from the application code}}, the key point.

The Master entity appears to be a single entity to the slaves, but may be
implemented by multiple Master VPs hidden inside the VMS implementation.

VPs use the Master as an intermediary to: semantically communicate with each
other; cause creation of new program VPs; and to influence re-animation of
VPs. As a subtlety, notice that hardware mechanisms, such as coherent shared
memory, allow communication to take place that is not visible to the
parallelism constructs. Parallelism constructs must be separately called in
order to make use of shared variable communication safe.

{\bf Definitions:\ }VMS is intended only for hardware systems that consist of processing elements
connected by communication.  We define a memory-space to be a processing
element, albeit without the ability to transform data. We define a
{\em{physical core}} to be a processing element that {\em{does}} transform
data, and require that it execute a sequential stream of instructions. We
define a program-time as the sequence of instructions animated by a Slave VP
(which is eventually animated by a physical core). A Slave VP has associated
{\em{scheduling state}} that, among other things, relates to how its
program-time progresses relative to physical time on the cores.

\subsection{Time in VMS}

\texttt{- }VMS has three levels of time: {\em{Program time}},
{\em{ Master time}}, and {\em{Virtual time}}.

\texttt{- }Program time is local to a Slave VP, measured in
instruction executions.

\texttt{- }Master time is hidden from the program and is
independent from all Program times.

\texttt{- }Virtual time is the ordered set of changes in
scheduling state of Slave VPs.



What is most important here is that Virtual time defines a global sequential
ordering. This ordering is consistent with the key VMS guarantee (given
below), and each point in it is computed within Master time.

Also, the independence between program times and master time has subtle
advantages. It enables elegant enforcement of the VMS guarantee, and
implementation simplifications that become clear after gaining deep
implementation knowledge.

In VMS, each event relevant to parallel semantics is tied to a transition of
the state of a Slave VP. This means that implementing the behavior of parallel
semantics is equivalent to controlling the order of transitions of state of
virtual processors.

{\bf Definitions:\ }The stream of instructions in a given program-time is
broken into a number of {\em{trace-segments}}, separated by suspension
points. Each trace-segment is animated by a single physical core, but not
necessarily the same core as animated the other trace segments. A suspend
point is created by a Slave VP executing the ``suspend'' primitive provided by
VMS. A suspend point has no duration in program time, but has distinct start
and end points in virtual time. The end-suspension points of two different
program times can be tied together within virtual time, which is called a
{\em{tie point}} and has special significance to parallel constructs. The
physical-time of a core has no relationship to any program time, except for
the various time-guarantees in this definition of VMS.

\begin{figure}[ht]
 \includegraphics[width=2.9in]{../figures/Time_in_VMS_1.png}
 \caption
 {Mapping program time onto Virtual time. \ The
  Master controls creation of new program time lines, and ending suspend
  points. Here, it has ended two suspend points at a common tie-point.
  }
\label{figTimeMapping}
\end{figure}

{\bf Relating time-lines to each other:\ }Figure \ref{figTimeMapping}
illustrates how trace-segments relate to suspend points, and map onto
virtual time. A trace segment starts in virtual time where suspend is ended, as seen.
 In fact, the two trace segments shown have a common
start-point within virtual time. This is because the parallelism semantics
chose to start them at the same point -- this is what a tie
point is.  A key note is that the lengths in virtual time have no relation to
the lengths in program-time. The only defined feature is that those two
trace-segments have a common start-point in virtual time. This means that the
two suspend points are considered to be tied together.

{\bf The Key VMS guarantee:\ }\label{VMSguarantee}Being tied together
means that all physical events that can be observed by both program-times are
covered by the key VMS guarantee: any events triggered before the common
suspend point in one program time are guaranteed visible in the other program
time after the common suspend point. They {\em{might}} be visible before,
but it's not guaranteed. In addition, events triggered after the common
suspend point in one are guaranteed not visible before the common suspend
point in the other. {\em{This two-part guarantee can be considered the heart
of VMS.}}

Intuitively, a tie-point separates before it from after such that tied program
times agree (illustrated with code in Section \ref{secApp} Figure \ref{figAnimVP}). But the subtlety is
that events triggered before the tie-point, {\em{might}} be visible to the
other before, and ones triggered after {\em{might not}} be visible to the
other after -- physical events triggered before are only guaranteed visible
{\em{after}} the tie point, and events after are only guaranteed
{\em{not}} visible {\em{before}} the tie point.

This is a form of bounded non-determinism. The pattern of suspension
end-points determines which trace-segments overlap in Virtual time, and events
triggered in one might be visible in overlapped ones. But no guarantees cover
these. If one segment tries to observe, it will see events triggered by
overlapped segments in non-deterministic order.

The VMS implementation defines which physical events are covered by the key
VMS guarantee (reads/writes, network communication, DMA, I/O).

{\bf Globally consistent sequential order:\ }VMS maps suspend-start,
suspend-end, and hence tie-points, to a globally-consistent sequential order
in Virtual time. This enables  one of VMS's key benefits: sequential
algorithms for parallel constructs.

Tie points define parallel behavior, so an implementation of how to choose tie
points equals an implementation of parallel constructs. The Master chooses
tie-points, so plugging code to choose tie-points into the Master equals
plugging in parallel constructs.

\subsection{Scheduling State}

Scheduling state is used in VMS to organize internal activity, for enforcing
the guarantees.

\texttt{- }VPs have three scheduling states: {\em{Animated}},
{\em{Blocked}}, {\em{Ready}} (Figure \ref{figStates}).

\texttt{- }VPs in Animated are {\em{allowed}} to advance
program-time with {\em{local}} physical time.

\texttt{- }VPs in Blocked and Ready do not advance their
program-time.

\texttt{- }Animated has two physical states:
{\em{Progressing}} and {\em{Stalled}}.

\texttt{- }VPs in Progressing advance program-time with local
physical time, those in Stalled do not (allowing non-semantic suspend).

\texttt{- }Scheduling states are defined in Virtual time only.

\texttt{- }Progressing and Stalled are defined in local
physical time only (invisible in Virtual).


\begin{figure}[h]
  \includegraphics{../figures/Scheduling_states_2.png}
  \caption{Scheduling states of a slave VP in the VMS model.}
  \label{figStates}
\end{figure}

Some important points: 1) only VPs Animated can trigger physical events that
are seen in other program time-lines; 2) the distinction between Blocked vs
Stalled is that a VP has to explicitly execute a VMS primitive operation to
enter Blocked, making it part of the semantics of parallelism constructs. In
contrast, Stalled happens invisibly, with no effect on semantic behavior. It
is due to hardware events hidden inside VMS, such as interrupts.

The Ready state is used to separate the parallelism-construct behavior from
the scheduling behavior. It acts as a ``staging area'' for scheduling. VPs
placed into this state are {\em{allowed}} to be animated, then the scheduler
decides when and where.

A subtle but illustrative point is that actions {\em{outside}} a given
program time cause the VP to transition Blocked$\rightarrow$Ready, which
contrasts to lock algorithms like spin-locks or Dijkstra's, where the
concurrency-related behavior takes place {\em{inside}} program time.

{\bf Transition Between Slave Scheduling States:\ }

\texttt{- }VPs transition states as shown in Figure \ref{figStates}.

\texttt{- }Animated$\rightarrow$Blocked is caused by a Slave VP
executing the Suspend VMS primitive.

\texttt{- }Blocked$\rightarrow$Ready is determined by the
semantics implemented in the plugin.

\texttt{- }Ready$\rightarrow$Animated is determined by the
scheduler in the plugin.

\texttt{- }Transitions in scheduling state have a globally
consistent order in Virtual time.


The parallelism primitives executed by a program do not control change in
scheduling states. They merely communicate messages to the Master, via a VMS
supplied primitive. Inside the Master, the plugin's parallelism construct
implementation processes the message. Based on that, it performs changes in
state from Blocked$\rightarrow$Ready, creates new VPs, and dissipates existing
VPs. Most communication from Slave to Master requires the VP to suspend when
it sends the message. A few messages, like creating new Slave may be sent
without suspending.

The suspend primitive decouples local physical time from Virtual time.
Execution causes immediate transition to Stalled in physical time, then the
Master performs Animated$\rightarrow$Blocked, fixing that transition in
Virtual time. The only relationship is causality. This weak relation is what
allows suspension-points to be serialized in Virtual time, which in turn is
what allows using sequential algorithms to implement parallelism constructs.


\subsection{Plugins}

\begin{figure}[ht]
  \includegraphics{../figures/VMS-core__plugins.png}
  \caption
 {
  The Master has been split into a generic core and a language-specific plug-in.
  The core encapsulates the hardware and remains the same across applications.
  The plug-in is part of the parallelism-construct implementation. It is
  loaded separately onto the hardware and linked to the application when run.
 }
 \label{figMasterSplit}
\end{figure}


The Master entity has two parts, a generic core part and a plugin (Figure \ref{figMasterSplit}).
 The core part of the Master is implemented as part of
VMS-core. The plug-in supplies two functions: the communication-handler and
the scheduler, both having a standard prototype. The communication-handler
implements the parallelism constructs, while scheduler assigns VPs to cores.

An {\em{instance}} of a plugin is created as part of initializing an
application, and the instance holds the semantic and scheduling state for that
run of the application. This state, combined with the virtual processor states
of the slaves created during that application run, represents progress of the
work of the application. \ For example, multi-tasking is performed simply by
the Master switching among plug-in instances when it has a resource to offer
to a scheduler. The parallelism-semantic state holds all information needed to
resume (hardware state, such as TLB and cache-tags is inside VMS-core).



\section{Internal Workings of Our Implementation}
\label{secInternal}

We name the elements of our example implementation and describe their logical
function, then relate them to the abstract model. We then step through the
operation of the elements.

{\bf Elements and Their Logical Function:\ }As illustrated in Figure \ref{figInternals},
our VMS implementation is organized around physical cores.
 Each core has its own {\em{master
virtual-processor}}, \texttt{masterVP}, and a {\em{physical-core controller}}, which communicate via a set of
scheduling slots, \texttt{schedSlot}. The Master in the abstract definition
is implemented by the multiple \texttt{masterVP}s plus a particular plugin
instance with its shared parallelism-semantic state (seen at the top).

On a given core, only one of: the core-controller, \texttt{masterVP}, or a
slave VP, is animated at any point in local physical time. Each
\texttt{masterVP} animates the same function, called
\texttt{master\_loop}, and each slave VP animates a function from the
application, starting with the top-level function the slave is created with,
and following its call sequence. The core controller is implemented here as a
Linux pthread that runs the \texttt{core\_loop} function.

Switching between VPs is done by executing a VMS primitive that suspends the
VP. This switches the physical core over to the controller, by jumping to the
start of the \texttt{core\_loop} function, which chooses the next VP and
switches to that (switching is detailed in Section \ref{secApp} Figure \ref{figAssembly}).

{\bf Relation to Abstract Model:\ }We chose to implement the Master entity
of the model by a set of \texttt{masterVP}s, plus plug-in functions and
shared parallelism-semantic state. What we call VMS-core consists of this
implementation of the Master, plus the core-controllers, plus the VMS
primitive libraries, for creating new VPs and dissipating existing VPs, suspending VPs,
and communicating from slave VP to Master. In Figure \ref{figInternals},
everything in green is part of VMS-core, while the plugin is in red, and
application code appears as blue, inside the slave VP.

Virtual time in the model is implemented via a combination of four things: a
\texttt{masterLock} (not shown) that guarantees non-overlap of
\texttt{masterVP} trace-segments; the \texttt{master\_loop} which performs
transition Animated$\rightarrow$Blocked; the \texttt{comm\_handler\_fn}
which performs Blocked$\rightarrow$Ready and the \texttt{scheduler\_fn}
which performs Ready$\rightarrow$Animated. \ Each state transition is one step
of Virtual time; is guaranteed sequential by the non-overlap of
\texttt{masterVP} trace segments; and is global due to being in
parallelism-semantic state that is shared (top of Figure \ref{figInternals}).

Transitions Progressing$\rightleftarrows$Stalled within the Animated state are
invisible to the parallelism semantics, the Master, and Virtual time, and so
have no effect on the elements seen.


\begin{figure*}[!t]
  \includegraphics[width=5in]{../figures/VMS-core__internal_workings.png}
  \caption
  { Internal elements of our example VMS implementation
  }
  \label{figInternals}
\end{figure*}

{\bf Steps of Operation:\ }The steps of operation are numbered, in Figure \ref{figInternals}.
 Taking them in order, 1) \texttt{master\_loop} scans
the scheduling slots to see which ones' slaves have suspended since the
previous scan. \ 2) It hands these to the \texttt{comm\_handler\_fn} plugged
in (which equals transition Animated$\rightarrow$Blocked). \ 3) The VP has a
request attached, and data in it causes \ the \texttt{comm\_handler\_fn}
to manipulate data structures in the shared parallelism-semantic state. \
These structures hold all the slaves in the blocked state (code-level detail
in Figure \ref{figReqHdlr}, Section \ref{secApp}). \ 4) Some requests cause slaves to be moved to a
\texttt{readyQ} on one of the cores (Blocked$\rightarrow$Ready). Which
core's \texttt{readyQ} receives the slave is under plugin control,
determined by a combination of request contents, semantic state and physical
machine state. 5) During the scan, the \texttt{master\_loop} also looks for
empty slots, and for each calls the \texttt{scheduler\_fn} plugged in. It
chooses a slave from the \texttt{readyQ} on the core animating
\texttt{master\_loop}. \ 6) The \texttt{master\_loop} then places the
slave VP's pointer into the scheduling slot (Ready$\rightarrow$Animated),
making it available to the \texttt{core\_loop}. 7) When done with the scan,
\texttt{masterVP} suspends, switching animation back to the
\texttt{core\_loop}. \ 8) \texttt{core\_loop} takes slave VPs out of the
slots, then 9) switches animation to them. \ 10) When a slave self-suspends,
animation returns to the \texttt{core\_loop} (detail in code in Figure 9),
which picks another, until 11) all slots are empty and the
\texttt{core\_loop} switches animation to the \texttt{masterVP}.

{\bf Enabling sequential implementation of parallelism semantics:\ }All of
that happens on each core separately, but in this particular implementation we
use a central \texttt{masterLock} to ensure that only one core's
\texttt{masterVP} can be active at any time.  This guarantees non-overlap
of trace-segments from different \texttt{masterVP}s, allowing the plugins to
use sequential algorithms, without a performance penalty, as verified in
Section \ref{secResults}.

Relating this to the abstract model: the parallelism-semantic behavior of the
Master is implemented by the communication handler, in the plugin. It thus
runs in the Master time referred to, in the model, in Section \ref{secAbsModel}. Requests are
sent to the Master by self-suspension of the slaves, but sit idle until the
other slaves in the scheduling slots have also run. This is the passive
behavior of requests that was noted in Section \ref{secAbsModel}, which allows the
\texttt{masterVP}s to remain suspended until needed.  This in turn enables
the \texttt{masterVP}s from different cores to be non-overlapped. It is the
non-overlap that enables the algorithms for the parallelism semantics to be
sequential.




\section{Code Level View}
\label{secApp}

To relate the abstract model and the internal elements to application code and
parallelism-library code, we give code snippets that illustrate key features.
 We start with the application then work down through the sequence of calls,
to the plugin, using our SSR {\cite{VMSHome}} parallelism-library as an
example.

In general, applications are either written in terms of a full custom language
that has its own syntax, or a base language with a parallelism library, which
is often called an {\em{embedded language}}.  Our demonstrators, VCilk
{\cite{VMSHome}}, Vthread, and SSR, are all parallelism libraries. A full
custom language would follow the standard practice of performing
source-to-source transform, from custom syntax into C plus parallelism-library
calls.

{\bf SSR:\ }SSR stands for Synchronous Send-Receive, and details of its
calls and internal implementation will be given throughout this section. It
has two types of construct. The first, called {\em{from-to}} has two calls:
\texttt{SSR\_send\_from\_to} and \texttt{SSR\_receive\_from\_to}, both of
which specify the sending VP as well as the receiving VP. \ The other, called
{\em{of-type}} also has two calls: \texttt{SSR\_\_send\_of\_type\_to} and
\texttt{SSR\_\_receive\_of\_type}, which allow a receiver to accept from
anonymous senders, but select according to type of message.


% An example of a double column floating figure using two subfigures.
% (The subfig.sty package must be loaded for this to work.)
% The subfigure \label commands are set within each subfloat command, the
% \label for the overall figure must come after \caption.
% \hfil must be used as a separator to get equal spacing.
% The subfigure.sty package works much the same way, except \subfigure is
% used instead of \subfloat.
%
%\begin{figure*}[!t]
%\centerline{\subfloat[Case I]\includegraphics[width=2.5in]{subfigcase1}%
%\label{fig_first_case}}
%\hfil
%\subfloat[Case II]{\includegraphics[width=2.5in]{subfigcase2}%
%\label{fig_second_case}}}
%\caption{Simulation results}
%\label{fig_sim}
%\end{figure*}
%
% Note that often IEEE papers with subfigures do not employ subfigure
% captions (using the optional argument to \subfloat), but instead will
% reference/describe all of them (a), (b), etc., within the main caption.


{\bf Application View:\ } Figure \ref{figAnimVP} shows snippets of application code, which use the SSR parallelism
library. The most important feature is that all calls take a pointer to the
VP that is animating the call. This is seen at the top of the figure where slave VP creation takes a pointer
to the VP asking for creation. Below that is the standard prototype for top level functions,
showing that the function receives a pointer to the VP it is the top level function for.

The pointer is placed on the stack by VMS when it creates the VP, and is the means by
which the application comes into possession of the pointer. This animating VP is
passed to all library calls made from there. For example, the bottom shows a pointer to the
animating VP placed in the position of sender in
the \texttt{send} construct call.  Correspondingly, for the \texttt{receive} construct,
the position of receiving VP is filled by the VP animating the call.

\begin{figure}[ht]
{\noindent
{\scriptsize
{\small Creating a new processor:}
 \begin{verbatim}
newProcessor = SSR__create_procr( &top_VP_fn,
 paramsPtr, animatingVP ); \end{verbatim}

{\small prototype for the top level function:}
 \begin{verbatim}
top_VP_fn( void *parameterStrucPtr, VirtProcr 
 *animatingVP ); \end{verbatim}

{\small handing animating VP to parallelism constructs:}
 \begin{verbatim}
SSR__send_from_to( messagePtr, animatingVP, 
 receivingVP );
messagePtr = SSR__receive_from_to( sendingVP,
 animatingVP ); \end{verbatim}
}
}
\caption
{
Application code snippets showing that all calls to the parallelism library
take the VP animating that call as a parameter.
}
\label{figAnimVP}
\end{figure}

Relating these to the internal elements of our implementation, the
 \texttt{animatingVP} suspends inside each of these
calls, passing a request (generated in the library) to one of the \texttt{masterVP}s.
 The \texttt{masterVP} then calls the \texttt{comm-handler}
 plugin, and so on, as described in Section  \ref{secInternal}.

For the \texttt{SSR\_\_create\_processor} call, the comm-handler
 in turn calls a VMS primitive to perform the creation.
  The primitive places a pointer to the newly created VP onto its stack, so that when
\texttt{top\_VP\_fn} is later animated, it sees the VP-pointer as a
parameter passed to it. \ All
application code is either such a top-level function, or has one at the root
of the call-stack.

The send and receive calls both suspend their animating VP. When both have
been called, the communication handler pairs them up and resumes both. This
ties time-lines together, invoking the VMS guarantee. Both
application-functions know, because of the VMS guarantee (Section \ref{secAbsModel}), that
writes to shared variables made before the send call by the sender are visible
to the receiver after the receive call. This is the programmer's view of tying
together the local time-lines of two different VPs, as defined in Section \ref{secAbsModel}.


{\bf Concurrency-Library View:\ }A parallelism library function, in
general, only creates a request, sends it, and returns, as seen below.  To
send a request, it uses the combined request-and-suspend VMS primitive that
attaches the request then suspends the VP.  The primitive requires the
pointer to the VP, to attach the request and to suspend it.


\begin{figure}[ht]
{\noindent
{\scriptsize
 \begin{verbatim}
void * SSR__receive_from_to( VirtProcr *sendVP,
 VirtProcr *receiveVP )
 { SSRSemReq  reqData;
   reqData.receiveVP = receiveVP;
   reqData.sendVP    = sendVP;
   reqData.reqType   = receive_from_to;
   VMS__send_sem_request( &reqData, receiveVP );
   return receiveVP->dataReturnedFromRequest;
 } \end{verbatim}
}
}
\caption{Implementation of SSR's receive\_from\_to library function.}
\label{figImplLib}
\end{figure}


In Figure \ref{figImplLib}, notice that the request's data is on the stack of the virtual
processor that's animating the call, which is the \texttt{receiveVP}.  The
\texttt{VMS\_\_send\_sem\_request} suspends this VP, which changes the
physical core's stack pointer to a different stack. So the request data is
guaranteed to remain undisturbed while the VP is suspended.

Figure \ref{figAssembly} shows the implementation of the VMS suspend primitive. As seen in
Figure \ref{figInternals}, suspending the \texttt{receiveVP} involves
switching to the \texttt{core\_loop}. In our implementation, this is done by
switching to the stack of the pthread pinned to the physical core and then
jumping to the start-point of \texttt{core\_loop}.

This code uses standard techniques commonly employed in co-routine
implementations. Tuning effort spent in \texttt{core\_loop} is inherited by
all applications.


\begin{figure}[ht]
{\noindent
{\scriptsize
 \begin{verbatim}
VMS__suspend_procr( VirtProcr *animatingVP )
 { animatingVP->resumeInstrAddr = &&ResumePt;
    //GCC takes addr of label
   animatingVP->schedSlotAssignedTo->
    isNewlySuspended = TRUE;
    //for master_loop to see
   <assembly code stores current physical core's
     stack reg into animatingVP struct>
   <assembly code loads stack reg with core_loop
     stackPtr, which was saved into animatingVP>
   <assembly code jmps to core_loop start instr
     addr, which was also saved into animatingVP>
 ResumePt:
   return;
 } \end{verbatim}
}
}
  \caption
{Implementation of VMS suspend processor.
Re-animating the virtual processor reverses this sequence. \ It saves the
\texttt{core\_loop}'s resume instr-addr and stack ptr into the VP structure,
then loads the VP's stack ptr and jmps to its \texttt{resumeInstrAddr}.
}
\label{figAssembly}
\end{figure}


{\bf Plugin View:\ }SSR's communication handler dispatches on the
\texttt{reqType} field of the request data, as set by the
\texttt{SSR\_\_receive\_from\_to} code. It calls the handler code in
Figure \ref{figReqHdlr}. This constructs a hash-key, by concatenating the from-VP's pointer
with the to-VP's pointer. Then it looks-up that key in the hash-table that SSR uses
to match sends with receives, which is in the shared semantic state seen at
the top of Figure \ref{figInternals} in Section \ref{secInternal}.

The most important feature in Figure \ref{figReqHdlr} is that both send and receive
will construct the same key, so will find the same
hash entry.  Whichever request is handled first in Virtual time will see the
hash entry empty, and save itself in that entry.  The second to arrive
sees the waiting request and then resumes both VPs, by putting them into their
\texttt{readyQ}s.

Access to the shared hash
table can be considered private, as in a sequential algorithm.  This is because
 our VMS-core implementation ensures that only
one handler on one core is executing at a time. 


\begin{figure}[ht]
{\noindent
{\scriptsize
 \begin{verbatim}
handle_receive_from_to( VirtProcr *requestingVP,
 SSRSemReq *reqData, SSRSemEnv *semEnv )
 { commHashTbl = semEnv->communicatingVPHashTable;
   key[0] = reqData->receiveVP;   key[1] = 
    reqData->sendVP; //send uses same key
   waitingReqData = lookup_and_remove( key,
    commHashTbl );  //get waiting request
   if( waitingReqData != NULL )
    { resume_virt_procr( waitingReqData->sendVP );
      resume_virt_procr( waitingReqData->
       receiveVP );
    }
   else
      insert( key, reqData, commHashTbl );
       //receive is first to arrive, make it wait
 } \end{verbatim}
}
}
  \caption
{Pseudo-code of communication-handler for
\texttt{receive\_from\_to} request type. The \texttt{semEnv} is a pointer
to the shared parallelism-semantic state seen at the top of Figure
\ref{figInternals}.
}
\label{figReqHdlr}
\end{figure}




\section{Results}
\label{secResults}

{\bf Setup:\ }We implemented blocked dense matrix multiply with right
sub-matrices copied to transposed form. We ran on a 1 socket by 4 core
Core2Quad 2.4Ghz chip.

{\bf Implementation-Time:\ }As shown in Table 1, time to implement the
three parallel libraries averages 2 days each. As an example of productivity,
adding nested transactions, parallel singleton, and atomic function-execution
to SSR required a single afternoon, totaling less than 100 lines of C code.


\begin{table}[ht]
\caption
{Person-days to design, code, and
  test each parallelism library. L.O.C. is lines of (original) C code, excluding libraries and
  comments.
}
\label{tabPersonDaysLang}
\begin{center}
\begin{tabular}{llll}
    & SSR & Vthread & VCilk\\
    Design & 4 & 1 & 0.5\\
    Code & 2 & 0.5 & 0.5\\
    Test & 1 & 0.5 & 0.5\\
    L.O.C. & 470 & 290 & 310
  \end{tabular}
\end{center}
\end{table}


{\bf Execution Performance:\ }Performance of VMS is seen in Table
\ref{tabOverheadCycles}. The code is not optimized, but rather written
to be easy to understand and modify. The majority of the plugin time
is lost to cache misses because the shared parallelism-semantic state moves
between cores on a majority of accesses. Acquisition of the master lock is slow due to the hardware
implementing the CAS instruction.

Existing techniques will likely improve performance, such
as localizing semantic data to cores, splitting malloc across the cores,
pre-allocating slabs that are recycled, and pre-fetching. However, in many cases, several hundred
nano-seconds per task is as optimal as the applications can benefit from.

\begin{table}[ht]
\caption
{Cycles of overhead, per scheduled
    slave. ``comp only'' is perfect memory, ``comp + mem'' is actual cycles.
    ``Plugin-concur'' only concurrency requests, ``plugin-all'' includes
    create and malloc requests. Two significant digits due to variability.
}
\label{tabOverheadCycles}
\begin{center}
\begin{tabular}{|l|ll|r|r|}
\hline
    &  & comp & comp\\
    &  & only & +mem\\
    VMS Only & \texttt{master\_loop} &  91 &  \ 110\\
    & switch VPs  & 77 &  \ 130\\
    & (malloc) & 160 & 2300\\
    & (create VP)  & 540 & 3800\\
\hline
 Language: &  &  & \\
 SSR & plugin -- concur & 190 & 540\\
 & plugin -- all & 530 & 2200\\
 & lock &  &  250\\
 Vthread & plugin -- concur & 66 & 710\\
 & plugin -- all & 180 & 1500\\
 & lock &  &  250\\
 VCilk & plugin -- concur & 65 & 260\\
 & plugin -- all & 330 & 1800\\
 & lock &  &  250\\
\hline
\end{tabular}
\end{center}
\end{table}



{\bf Head to Head:\ }We
compare our implementation of the \texttt{spawn} and \texttt{sync}
constructs against Cilk 5.4, on the top in Table
\ref{tabHeadToHead}, which shows that the same application code has similar
performance. For large matrices, Cilk 5.4's better use of the memory hierarchy
achieves 23\% better performance. However, for small matrices, VCilk is better,
 with a factor 2 lower overhead. Cilk 5.4 does not allow controlling the number of spawn events it
actually executes, and chooses to run smaller matrices sequentially, limiting
our comparison.

When comparing to pthreads, our VMS based implementation has more than an
order of magnitude better overhead per invocation of mutex or condition
variable functionality, as seen on the bottom of Table \ref{tabHeadToHead}.
Applications that inherently have short trace segments will synchronize often
and benefit the most from Vthread.



\begin{table}[ht]
\caption
{On top, exe time in seconds for MM. Below, overhead for pthread vs Vthread. First column is cycles
  for perfect memory and second is total measured cycles. pthread cycles are
  deduced from round-trip experiments.
}
\label{tabHeadToHead}
\begin{center}
\begin{tabular}{|l@{\ }|@{\ }lr|}
\hline
\rule{0pt}{12pt}
    Matrix size&Lang.&sec.\\
[2pt]\hline
    81x81 & Cilk & 0.017\\
    & VCilk & 0.008\\
\hline
    324x324 & Cilk & 0.13\\
    & VCilk & 0.13\\
\hline
    648x648 & Cilk & 0.71\\
    & VCilk & 0.85\\
\hline
    1296x1296 & Cilk & 4.8\\
    & VCilk & 6.2 \\
[2pt]\hline
\end{tabular}
\begin{tabular}{c}
\begin{tabular}{|l|rr|r|r|}
\hline
    operation & \multicolumn{2}{c|}{Vthread} & pthread & ratio\\
\hline
    & comp & total &  & \\
    & only &  &  & \\
[2pt]\hline
    mutex\_lock & 85 & 1050 & 50,000 & 48:1\\
    mutex\_unlock & 85 & 610 & 45,000 & 74:1\\
    cond\_wait  & 85 & 850 & 60,000 & 71:1\\
    cond\_signal & 90 & 650 & 60,000 & 92:1\\
\hline
\end{tabular}\\
\end{tabular}
\end{center}
\end{table}

\section{Conclusion}\label{secConclusion}

We have shown an alternative to the Thread model that enables easier-to-use
parallelism constructs by splitting
the scheduler open, to accept new parallelism constructs in the form of
plugins. This gives the language control over assigning virtual
processors to physical cores, for performance, debugging, and flexibility
benefits. Parallelism constructs of languages are implementable using
sequential algorithms, within a matter of days, while maintaining low run-time
overhead, on the order of a few hundred nano-seconds per concurrency
operation.

\bibliography{Bib_for_papers}

\end{document}
