VMS/0__Writings/kshalle: 0__Papers/Future_Architecture/Future

annotate 0__Papers/Future_Architecture/Future_Architecture.tex @ 4:01aefc2be990

HotPar 2012 with figs -- for Merten to fix up visuals of figs

author	Me@portablequad
date	Sat, 21 Jan 2012 20:51:58 -0800
parents
children

rev	line source
Me@0	1
Me@0	2
Me@0	3 \documentclass[conference]{IEEEtran}
Me@0	4 %
Me@0	5 \usepackage{makeidx,geometry,amssymb,graphicx,calc,ifthen}
Me@0	6 %
Me@0	7
Me@0	8 % * CITATION PACKAGES *
Me@0	9 %
Me@0	10 %\usepackage{cite}
Me@0	11 % cite.sty was written by Donald Arseneau
Me@0	12 % V1.6 and later of IEEEtran pre-defines the format of the cite.sty package
Me@0	13 % \cite{} output to follow that of IEEE. Loading the cite package will
Me@0	14 % result in citation numbers being automatically sorted and properly
Me@0	15 % "compressed/ranged". e.g., [1], [9], [2], [7], [5], [6] without using
Me@0	16 % cite.sty will become [1], [2], [5]--[7], [9] using cite.sty. cite.sty's
Me@0	17 % \cite will automatically add leading space, if needed. Use cite.sty's
Me@0	18 % noadjust option (cite.sty V3.8 and later) if you want to turn this off.
Me@0	19 % cite.sty is already installed on most LaTeX systems. Be sure and use
Me@0	20 % version 4.0 (2003-05-27) and later if using hyperref.sty. cite.sty does
Me@0	21 % not currently provide for hyperlinked citations.
Me@0	22 % The latest version can be obtained at:
Me@0	23 % http://www.ctan.org/tex-archive/macros/latex/contrib/cite/
Me@0	24 % The documentation is contained in the cite.sty file itself.
Me@0	25
Me@0	26
Me@0	27
Me@0	28
Me@0	29
Me@0	30
Me@0	31 % * GRAPHICS RELATED PACKAGES *
Me@0	32 %
Me@0	33 \ifCLASSINFOpdf
Me@0	34 % \usepackage[pdftex]{graphicx}
Me@0	35 % declare the path(s) where your graphic files are
Me@0	36 % \graphicspath{{../pdf/}{../jpeg/}}
Me@0	37 % and their extensions so you won't have to specify these with
Me@0	38 % every instance of \includegraphics
Me@0	39 % \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
Me@0	40 \else
Me@0	41 % or other class option (dvipsone, dvipdf, if not using dvips). graphicx
Me@0	42 % will default to the driver specified in the system graphics.cfg if no
Me@0	43 % driver is specified.
Me@0	44 % \usepackage[dvips]{graphicx}
Me@0	45 % declare the path(s) where your graphic files are
Me@0	46 % \graphicspath{{../eps/}}
Me@0	47 % and their extensions so you won't have to specify these with
Me@0	48 % every instance of \includegraphics
Me@0	49 % \DeclareGraphicsExtensions{.eps}
Me@0	50 \fi
Me@0	51 % graphicx was written by David Carlisle and Sebastian Rahtz. It is
Me@0	52 % required if you want graphics, photos, etc. graphicx.sty is already
Me@0	53 % installed on most LaTeX systems. The latest version and documentation can
Me@0	54 % be obtained at:
Me@0	55 % http://www.ctan.org/tex-archive/macros/latex/required/graphics/
Me@0	56 % Another good source of documentation is "Using Imported Graphics in
Me@0	57 % LaTeX2e" by Keith Reckdahl which can be found as epslatex.ps or
Me@0	58 % epslatex.pdf at: http://www.ctan.org/tex-archive/info/
Me@0	59 %
Me@0	60 % latex, and pdflatex in dvi mode, support graphics in encapsulated
Me@0	61 % postscript (.eps) format. pdflatex in pdf mode supports graphics
Me@0	62 % in .pdf, .jpeg, .png and .mps (metapost) formats. Users should ensure
Me@0	63 % that all non-photo figures use a vector format (.eps, .pdf, .mps) and
Me@0	64 % not a bitmapped formats (.jpeg, .png). IEEE frowns on bitmapped formats
Me@0	65 % which can result in "jaggedy"/blurry rendering of lines and letters as
Me@0	66 % well as large increases in file sizes.
Me@0	67 %
Me@0	68 % You can find documentation about the pdfTeX application at:
Me@0	69 % http://www.tug.org/applications/pdftex
Me@0	70
Me@0	71
Me@0	72
Me@0	73
Me@0	74
Me@0	75 % * MATH PACKAGES *
Me@0	76 %
Me@0	77 %\usepackage[cmex10]{amsmath}
Me@0	78 % A popular package from the American Mathematical Society that provides
Me@0	79 % many useful and powerful commands for dealing with mathematics. If using
Me@0	80 % it, be sure to load this package with the cmex10 option to ensure that
Me@0	81 % only type 1 fonts will utilized at all point sizes. Without this option,
Me@0	82 % it is possible that some math symbols, particularly those within
Me@0	83 % footnotes, will be rendered in bitmap form which will result in a
Me@0	84 % document that can not be IEEE Xplore compliant!
Me@0	85 %
Me@0	86 % Also, note that the amsmath package sets \interdisplaylinepenalty to 10000
Me@0	87 % thus preventing page breaks from occurring within multiline equations. Use:
Me@0	88 %\interdisplaylinepenalty=2500
Me@0	89 % after loading amsmath to restore such page breaks as IEEEtran.cls normally
Me@0	90 % does. amsmath.sty is already installed on most LaTeX systems. The latest
Me@0	91 % version and documentation can be obtained at:
Me@0	92 % http://www.ctan.org/tex-archive/macros/latex/required/amslatex/math/
Me@0	93
Me@0	94
Me@0	95
Me@0	96
Me@0	97
Me@0	98 % * SPECIALIZED LIST PACKAGES *
Me@0	99 %
Me@0	100 %\usepackage{algorithmic}
Me@0	101 % algorithmic.sty was written by Peter Williams and Rogerio Brito.
Me@0	102 % This package provides an algorithmic environment fo describing algorithms.
Me@0	103 % You can use the algorithmic environment in-text or within a figure
Me@0	104 % environment to provide for a floating algorithm. Do NOT use the algorithm
Me@0	105 % floating environment provided by algorithm.sty (by the same authors) or
Me@0	106 % algorithm2e.sty (by Christophe Fiorio) as IEEE does not use dedicated
Me@0	107 % algorithm float types and packages that provide these will not provide
Me@0	108 % correct IEEE style captions. The latest version and documentation of
Me@0	109 % algorithmic.sty can be obtained at:
Me@0	110 % http://www.ctan.org/tex-archive/macros/latex/contrib/algorithms/
Me@0	111 % There is also a support site at:
Me@0	112 % http://algorithms.berlios.de/index.html
Me@0	113 % Also of interest may be the (relatively newer and more customizable)
Me@0	114 % algorithmicx.sty package by Szasz Janos:
Me@0	115 % http://www.ctan.org/tex-archive/macros/latex/contrib/algorithmicx/
Me@0	116
Me@0	117
Me@0	118
Me@0	119
Me@0	120 % * ALIGNMENT PACKAGES *
Me@0	121 %
Me@0	122 %\usepackage{array}
Me@0	123 % Frank Mittelbach's and David Carlisle's array.sty patches and improves
Me@0	124 % the standard LaTeX2e array and tabular environments to provide better
Me@0	125 % appearance and additional user controls. As the default LaTeX2e table
Me@0	126 % generation code is lacking to the point of almost being broken with
Me@0	127 % respect to the quality of the end results, all users are strongly
Me@0	128 % advised to use an enhanced (at the very least that provided by array.sty)
Me@0	129 % set of table tools. array.sty is already installed on most systems. The
Me@0	130 % latest version and documentation can be obtained at:
Me@0	131 % http://www.ctan.org/tex-archive/macros/latex/required/tools/
Me@0	132
Me@0	133
Me@0	134 %\usepackage{mdwmath}
Me@0	135 %\usepackage{mdwtab}
Me@0	136 % Also highly recommended is Mark Wooding's extremely powerful MDW tools,
Me@0	137 % especially mdwmath.sty and mdwtab.sty which are used to format equations
Me@0	138 % and tables, respectively. The MDWtools set is already installed on most
Me@0	139 % LaTeX systems. The lastest version and documentation is available at:
Me@0	140 % http://www.ctan.org/tex-archive/macros/latex/contrib/mdwtools/
Me@0	141
Me@0	142
Me@0	143 % IEEEtran contains the IEEEeqnarray family of commands that can be used to
Me@0	144 % generate multiline equations as well as matrices, tables, etc., of high
Me@0	145 % quality.
Me@0	146
Me@0	147
Me@0	148 %\usepackage{eqparbox}
Me@0	149 % Also of notable interest is Scott Pakin's eqparbox package for creating
Me@0	150 % (automatically sized) equal width boxes - aka "natural width parboxes".
Me@0	151 % Available at:
Me@0	152 % http://www.ctan.org/tex-archive/macros/latex/contrib/eqparbox/
Me@0	153
Me@0	154
Me@0	155
Me@0	156
Me@0	157
Me@0	158 % * SUBFIGURE PACKAGES *
Me@0	159 %\usepackage[tight,footnotesize]{subfigure}
Me@0	160 % subfigure.sty was written by Steven Douglas Cochran. This package makes it
Me@0	161 % easy to put subfigures in your figures. e.g., "Figure 1a and 1b". For IEEE
Me@0	162 % work, it is a good idea to load it with the tight package option to reduce
Me@0	163 % the amount of white space around the subfigures. subfigure.sty is already
Me@0	164 % installed on most LaTeX systems. The latest version and documentation can
Me@0	165 % be obtained at:
Me@0	166 % http://www.ctan.org/tex-archive/obsolete/macros/latex/contrib/subfigure/
Me@0	167 % subfigure.sty has been superceeded by subfig.sty.
Me@0	168
Me@0	169
Me@0	170
Me@0	171 %\usepackage[caption=false]{caption}
Me@0	172 %\usepackage[font=footnotesize]{subfig}
Me@0	173 % subfig.sty, also written by Steven Douglas Cochran, is the modern
Me@0	174 % replacement for subfigure.sty. However, subfig.sty requires and
Me@0	175 % automatically loads Axel Sommerfeldt's caption.sty which will override
Me@0	176 % IEEEtran.cls handling of captions and this will result in nonIEEE style
Me@0	177 % figure/table captions. To prevent this problem, be sure and preload
Me@0	178 % caption.sty with its "caption=false" package option. This is will preserve
Me@0	179 % IEEEtran.cls handing of captions. Version 1.3 (2005/06/28) and later
Me@0	180 % (recommended due to many improvements over 1.2) of subfig.sty supports
Me@0	181 % the caption=false option directly:
Me@0	182 %\usepackage[caption=false,font=footnotesize]{subfig}
Me@0	183 %
Me@0	184 % The latest version and documentation can be obtained at:
Me@0	185 % http://www.ctan.org/tex-archive/macros/latex/contrib/subfig/
Me@0	186 % The latest version and documentation of caption.sty can be obtained at:
Me@0	187 % http://www.ctan.org/tex-archive/macros/latex/contrib/caption/
Me@0	188
Me@0	189
Me@0	190
Me@0	191
Me@0	192 % * FLOAT PACKAGES *
Me@0	193 %
Me@0	194 %\usepackage{fixltx2e}
Me@0	195 % fixltx2e, the successor to the earlier fix2col.sty, was written by
Me@0	196 % Frank Mittelbach and David Carlisle. This package corrects a few problems
Me@0	197 % in the LaTeX2e kernel, the most notable of which is that in current
Me@0	198 % LaTeX2e releases, the ordering of single and double column floats is not
Me@0	199 % guaranteed to be preserved. Thus, an unpatched LaTeX2e can allow a
Me@0	200 % single column figure to be placed prior to an earlier double column
Me@0	201 % figure. The latest version and documentation can be found at:
Me@0	202 % http://www.ctan.org/tex-archive/macros/latex/base/
Me@0	203
Me@0	204
Me@0	205
Me@0	206 %\usepackage{stfloats}
Me@0	207 % stfloats.sty was written by Sigitas Tolusis. This package gives LaTeX2e
Me@0	208 % the ability to do double column floats at the bottom of the page as well
Me@0	209 % as the top. (e.g., "\begin{figure*}[!b]" is not normally possible in
Me@0	210 % LaTeX2e). It also provides a command:
Me@0	211 %\fnbelowfloat
Me@0	212 % to enable the placement of footnotes below bottom floats (the standard
Me@0	213 % LaTeX2e kernel puts them above bottom floats). This is an invasive package
Me@0	214 % which rewrites many portions of the LaTeX2e float routines. It may not work
Me@0	215 % with other packages that modify the LaTeX2e float routines. The latest
Me@0	216 % version and documentation can be obtained at:
Me@0	217 % http://www.ctan.org/tex-archive/macros/latex/contrib/sttools/
Me@0	218 % Documentation is contained in the stfloats.sty comments as well as in the
Me@0	219 % presfull.pdf file. Do not use the stfloats baselinefloat ability as IEEE
Me@0	220 % does not allow \baselineskip to stretch. Authors submitting work to the
Me@0	221 % IEEE should note that IEEE rarely uses double column equations and
Me@0	222 % that authors should try to avoid such use. Do not be tempted to use the
Me@0	223 % cuted.sty or midfloat.sty packages (also by Sigitas Tolusis) as IEEE does
Me@0	224 % not format its papers in such ways.
Me@0	225
Me@0	226
Me@0	227
Me@0	228
Me@0	229
Me@0	230 % * PDF, URL AND HYPERLINK PACKAGES *
Me@0	231 %
Me@0	232 %\usepackage{url}
Me@0	233 % url.sty was written by Donald Arseneau. It provides better support for
Me@0	234 % handling and breaking URLs. url.sty is already installed on most LaTeX
Me@0	235 % systems. The latest version can be obtained at:
Me@0	236 % http://www.ctan.org/tex-archive/macros/latex/contrib/misc/
Me@0	237 % Read the url.sty source comments for usage information. Basically,
Me@0	238 % \url{my_url_here}.
Me@0	239
Me@0	240
Me@0	241
Me@0	242
Me@0	243
Me@0	244 % * Do not adjust lengths that control margins, column widths, etc. *
Me@0	245 % * Do not use packages that alter fonts (such as pslatex). *
Me@0	246 % There should be no need to do such things with IEEEtran.cls V1.6 and later.
Me@0	247 % (Unless specifically asked to do so by the journal or conference you plan
Me@0	248 % to submit to, of course. )
Me@0	249
Me@0	250
Me@0	251 % correct bad hyphenation here
Me@0	252 \hyphenation{op-tical net-works semi-conduc-tor}
Me@0	253
Me@0	254
Me@0	255 \begin{document}
Me@0	256 \bibliographystyle{plain}
Me@0	257 %
Me@0	258
Me@0	259 \title{The Shape of Future Architectures}
Me@0	260
Me@0	261 \author
Me@0	262 {
Me@0	263 \IEEEauthorblockN{Sean Halle}
Me@0	264 \IEEEauthorblockA
Me@0	265 {
Me@0	266 University of California at Santa Cruz\\
Me@0	267 and INRIA Paris\\
Me@0	268 Email: sean.halle@inria.fr
Me@0	269 }
Me@0	270 \and
Me@0	271 \IEEEauthorblockN{Albert Cohen}
Me@0	272 \IEEEauthorblockA
Me@0	273 { INRIA Paris\\
Me@0	274 Email: albert.cohen@infria.fr
Me@0	275 }
Me@0	276 }
Me@0	277
Me@0	278
Me@0	279 \maketitle
Me@0	280 %
Me@0	281
Me@0	282 \begin{abstract}
Me@0	283 Process technology is rapidly approaching multiple fundamental physical limits that combine to increase the cost of producing chips faster than the increase in density. Whether or not process technology continues to shrink, the economics will slow the demand for chips from future processes. This will, for the first time, shift the design decisions from being shortest time-to-market towards being the greatest efficiency per transistor -- either in computation delivered towards the goal end-user measurement such as throughput, or in energy per operation.
Me@0	284
Me@0	285 The implication for processor architecture is to accelerate the trend away from "easy to program" structures such as out-of-order pipelines and coherent shared memory, and towards hierarchical designs that force application information to be exploited in explicitly managing the placement and movement of data, in order to keep it as local for as much of the computation as possible. The cost of communication will be the dominant factor driving the shape of future architectures, both from a computation-bottleneck cost and an energy per operation cost.
Me@0	286
Me@0	287 Integration will be pushed towards the third dimension in order to mitigate the communication costs -- indeed main memories have long been on that path. This approach will hasten the development of 3 dimensional integration processes. The architecture implication is that
Me@0	288 \end{abstract}
Me@0	289
Me@0	290 \section{Motivation}
Me@0	291
Me@0	292 Threads were invented to be virtual sequential processors, but are
Me@0	293 problematic when one has multiple physical cores. Not only
Me@0	294 are they difficult to use for parallel code, but their parallel performance is poor
Me@0	295 because they block the language from controlling
Me@0	296 task placement. The language often knows which cores are likely to already
Me@0	297 have data in the cache which a new task requires as input. If it had
Me@0	298 control, the language could place tasks where the data is likely to already
Me@0	299 reside, reducing communication and increasing performance and power
Me@0	300 efficiency. A thread replacement is needed that is both easy to use in
Me@0	301 applications and gives the language control over task placement.
Me@0	302
Me@0	303 The first question is, what is the easiest to use parallelism construct? There
Me@0	304 exist many to choose from: Threads (for legacy reasons); Actors
Me@0	305 {\cite{Hewitt10}}{\cite{Actors97}}; Components {\cite{ComponentModel00}};
Me@0	306 process calculi {\cite{hoare78}} {\cite{milner99}};
Me@0	307 coordination languages {\cite{Gelernter85Linda}}; and new ones
Me@0	308 continually invented.
Me@0	309
Me@0	310 We demonstrate in this paper a thread replacement that lets any such parallelism
Me@0	311 construct be plugged in. It is the first extensible hardware abstraction,
Me@0	312 allowing to plug-in both preferred parallelism constructs and preferred
Me@0	313 runtime scheduler. We focus in this introductory paper on
Me@0	314 the parallelism construct support, illustrating how to implement synchronous
Me@0	315 \texttt{send}-\texttt{receive} constructs motivated by process calculi,
Me@0	316 the \texttt{spawn} and \texttt{sync} constructs from Cilk, and even
Me@0	317 \texttt{mutex} and \texttt{condition variable} constructs from pthreads.
Me@0	318 However, the schedulers we implemented have no
Me@0	319 interesting performance optimizations.
Me@0	320
Me@0	321 The next question is, how much effort is required to implement a plugin? Our
Me@0	322 mechanism protects the plugin from exposure to concurrency issues, allowing
Me@0	323 the parallelism constructs to be implemented with sequential algorithms. It
Me@0	324 accomplishes this by introducing a ``virtual'' time that sequentializes events
Me@0	325 that appear, to the program, to be simultaneous.
Me@0	326
Me@0	327 We call our abstraction Virtualized Master-Slave, or VMS. We demonstrate a user-level
Me@0	328 implementation in this paper, although it is ideally implemented as the base
Me@0	329 hardware abstraction of the OS.
Me@0	330
Me@0	331 It involves subtle concepts related to time in the program vs time in the
Me@0	332 plugin vs physical time. We explain it in four levels, starting with an abstract definition
Me@0	333 and moving down to implementation.
Me@0	334
Me@0	335 The first level is the theoretical definition, given in Section \ref{secAbsModel}. Second is the elements
Me@0	336 of our implementation and how they relate to the theoretical definition, in Section \ref{secInternal}.
Me@0	337 Third is the application code point of view, in Section \ref{secApp}, which is tied back to both
Me@0	338 the abstract model and the internal elements. The fourth view,
Me@0	339 also in Section \ref{secApp}, shows the implementation of the plugin for a parallel
Me@0	340 construct. Additionally,
Me@0	341 measurements of effectiveness appear
Me@0	342 in Section \ref{secResults} and conclusion is in \ref{secConclusion}.
Me@0	343
Me@0	344 \section{Background and Related Work}
Me@0	345
Me@0	346 {\bf Side-stepping OS Threads:\ }User-level thread packages and most parallel
Me@0	347 language runtimes have to side-step OS threads, by pinning one to each core,
Me@0	348 which effectively gives the user-level package control over the core. Our VMS
Me@0	349 implementation also does this. We are not claiming in this paper to have the
Me@0	350 OS level implementation of VMS that the model is capable of -- just the
Me@0	351 user-space version.
Me@0	352
Me@0	353 Related work either provides a variation on the Thread model, or is a full
Me@0	354 language with included parallelism constructs. For example,
Me@0	355 TBB {\cite{TBBHome}} is a user-space thread package with added features.
Me@0	356 It allows
Me@0	357 tasks to create other tasks and control when to start their
Me@0	358 execution. However, one cannot plug in alternate parallelism-control constructs, such as
Me@0	359 rendez-vous style send and receive, which would have to be made using locks.
Me@0	360 Likewise, Cilk {\cite{CILKHome}}
Me@0	361 provides a simpler and easier to use version of TBB's task scheduler, but is
Me@0	362 limited to only tree-shaped task graphs. OpenMP {\cite{OpenMPHome}} allows
Me@0	363 creating tasks and controlling their execution order,
Me@0	364 but is a language, not a hardware abstraction. All of these may
Me@0	365 alternatively be implemented via plugins to VMS, with similar performance.
Me@0	366
Me@0	367 In contrast, VMS is the first hardware abstraction that doesn't impose its own concurrency
Me@0	368 semantics, but rather takes preferred ones as plugins. This makes it not a
Me@0	369 language itself, but a {\em{support}} mechanism for language level
Me@0	370 parallelism constructs. The parallelism constructs in Actors, Components,
Me@0	371 pthreads, and so on may all be implemented as VMS plugins.
Me@0	372
Me@0	373 {\bf Virtual Processor (VP) Definition:\ }We want to avoid the confusion
Me@0	374 associated with the terms ``thread'' and ``task'' so will use the term
Me@0	375 {\em{virtual processor}} (VP), which we define as state in combination with
Me@0	376 the ability to animate code or {\em{an additional level of virtual
Me@0	377 processors}}. The state consists of a stack with its contents, a program
Me@0	378 counter, a pointer to top of stack, and a pointer to the current stack frame.
Me@0	379
Me@0	380 \section{Abstract Definition of VMS}
Me@0	381 \label{secAbsModel}
Me@0	382
Me@0	383 We give an intuitive overview, then add details in the following sub-sections.
Me@0	384
Me@0	385 {\bf Intuitive Overview:\ }VMS is concerned primarily with time and
Me@0	386 guarantees about it. This is because parallelism constructs control how the
Me@0	387 time-lines of different virtual processors intersect. They also guarantee
Me@0	388 relations of time lines to hardware events.
Me@0	389
Me@0	390 As an example, consider a program that writes into a data structure in one
Me@0	391 time-line, then calls a \texttt{send} construct, meanwhile in a different
Me@0	392 time-line it calls the \texttt{receive} construct then reads the data
Me@0	393 structure. The constructs should guarantee that all data written before the
Me@0	394 \texttt{send} is readable in the other time-line after the
Me@0	395 \texttt{receive}. VMS provides primitive guarantees, which plugged-in code
Me@0	396 builds upon to provide such higher-level guarantees.
Me@0	397
Me@0	398 To support parallelism constructs, VMS provides: primitive operations to create and suspend
Me@0	399 VPs; a way for plugged-in code to control when each VP is (re)started; and
Me@0	400 time-related guarantees. These are enforced on all hardware, be it shared
Me@0	401 memory or distributed, with strong memory consistency or weak.
Me@0	402
Me@0	403 {\bf Definition in Three Parts:\ }We give the abstract definition in three
Me@0	404 parts: a definition of the elements of a VMS computation system; a definition
Me@0	405 of time and the key VMS guarantee; and a definition of virtual processor
Me@0	406 scheduling states and transitions between them.
Me@0	407
Me@0	408 The definition we give is for VMS {\em{with plugins present}}. \ Hence, it
Me@0	409 covers the behavior of all possible parallelism constructs implementable with
Me@0	410 VMS. The Master mentioned in the definition is an abstract
Me@0	411 entity, with a plugin present. In practice, this Master entity is implemented
Me@0	412 as part of a core VMS, and plugins later added. This VMS-core is the
Me@0	413 hardware abstraction. It hides the physical hardware behind an interface that
Me@0	414 creates virtual processors and enforces well-defined time-behavior.
Me@0	415
Me@0	416
Me@0	417 \subsection{The Elements of a VMS Computation System}
Me@0	418
Me@0	419 \texttt{- }A VMS program has multiple VPs, which are Slaves,
Me@0	420 each with an independent time-line.
Me@0	421
Me@0	422 \texttt{- }A schedule of Slaves is generated by a Master
Me@0	423 entity, from within separate time-line(s).
Me@0	424
Me@0	425 \texttt{- }A schedule is defined as the set of points at which
Me@0	426 VPs are (re)animated.
Me@0	427
Me@0	428 \texttt{- }All semantic parallelism behavior is invoked via
Me@0	429 communication with the Master.
Me@0	430
Me@0	431 \texttt{- }Communication with the Master happens by using a
Me@0	432 model-provided primitive, which causes {\em{voluntary}} suspension of the
Me@0	433 program's VP.
Me@0	434
Me@0	435
Me@0	436 What is important here is: that the choice of which VP is animated, at which
Me@0	437 point, happens in a separate time-line; and that the VPs suspend voluntarily
Me@0	438 for each parallelism construct. This means that {\em{scheduling is separated
Me@0	439 from the application code}}, the key point.
Me@0	440
Me@0	441 The Master entity appears to be a single entity to the slaves, but may be
Me@0	442 implemented by multiple Master VPs hidden inside the VMS implementation.
Me@0	443
Me@0	444 VPs use the Master as an intermediary to: semantically communicate with each
Me@0	445 other; cause creation of new program VPs; and to influence re-animation of
Me@0	446 VPs. As a subtlety, notice that hardware mechanisms, such as coherent shared
Me@0	447 memory, allow communication to take place that is not visible to the
Me@0	448 parallelism constructs. Parallelism constructs must be separately called in
Me@0	449 order to make use of shared variable communication safe.
Me@0	450
Me@0	451 {\bf Definitions:\ }VMS is intended only for hardware systems that consist of processing elements
Me@0	452 connected by communication. We define a memory-space to be a processing
Me@0	453 element, albeit without the ability to transform data. We define a
Me@0	454 {\em{physical core}} to be a processing element that {\em{does}} transform
Me@0	455 data, and require that it execute a sequential stream of instructions. We
Me@0	456 define a program-time as the sequence of instructions animated by a Slave VP
Me@0	457 (which is eventually animated by a physical core). A Slave VP has associated
Me@0	458 {\em{scheduling state}} that, among other things, relates to how its
Me@0	459 program-time progresses relative to physical time on the cores.
Me@0	460
Me@0	461 \subsection{Time in VMS}
Me@0	462
Me@0	463 \texttt{- }VMS has three levels of time: {\em{Program time}},
Me@0	464 {\em{ Master time}}, and {\em{Virtual time}}.
Me@0	465
Me@0	466 \texttt{- }Program time is local to a Slave VP, measured in
Me@0	467 instruction executions.
Me@0	468
Me@0	469 \texttt{- }Master time is hidden from the program and is
Me@0	470 independent from all Program times.
Me@0	471
Me@0	472 \texttt{- }Virtual time is the ordered set of changes in
Me@0	473 scheduling state of Slave VPs.
Me@0	474
Me@0	475
Me@0	476
Me@0	477 What is most important here is that Virtual time defines a global sequential
Me@0	478 ordering. This ordering is consistent with the key VMS guarantee (given
Me@0	479 below), and each point in it is computed within Master time.
Me@0	480
Me@0	481 Also, the independence between program times and master time has subtle
Me@0	482 advantages. It enables elegant enforcement of the VMS guarantee, and
Me@0	483 implementation simplifications that become clear after gaining deep
Me@0	484 implementation knowledge.
Me@0	485
Me@0	486 In VMS, each event relevant to parallel semantics is tied to a transition of
Me@0	487 the state of a Slave VP. This means that implementing the behavior of parallel
Me@0	488 semantics is equivalent to controlling the order of transitions of state of
Me@0	489 virtual processors.
Me@0	490
Me@0	491 {\bf Definitions:\ }The stream of instructions in a given program-time is
Me@0	492 broken into a number of {\em{trace-segments}}, separated by suspension
Me@0	493 points. Each trace-segment is animated by a single physical core, but not
Me@0	494 necessarily the same core as animated the other trace segments. A suspend
Me@0	495 point is created by a Slave VP executing the ``suspend'' primitive provided by
Me@0	496 VMS. A suspend point has no duration in program time, but has distinct start
Me@0	497 and end points in virtual time. The end-suspension points of two different
Me@0	498 program times can be tied together within virtual time, which is called a
Me@0	499 {\em{tie point}} and has special significance to parallel constructs. The
Me@0	500 physical-time of a core has no relationship to any program time, except for
Me@0	501 the various time-guarantees in this definition of VMS.
Me@0	502
Me@0	503 \begin{figure}[ht]
Me@0	504 \includegraphics[width=2.9in]{../figures/Time_in_VMS_1.png}
Me@0	505 \caption
Me@0	506 {Mapping program time onto Virtual time. \ The
Me@0	507 Master controls creation of new program time lines, and ending suspend
Me@0	508 points. Here, it has ended two suspend points at a common tie-point.
Me@0	509 }
Me@0	510 \label{figTimeMapping}
Me@0	511 \end{figure}
Me@0	512
Me@0	513 {\bf Relating time-lines to each other:\ }Figure \ref{figTimeMapping}
Me@0	514 illustrates how trace-segments relate to suspend points, and map onto
Me@0	515 virtual time. A trace segment starts in virtual time where suspend is ended, as seen.
Me@0	516 In fact, the two trace segments shown have a common
Me@0	517 start-point within virtual time. This is because the parallelism semantics
Me@0	518 chose to start them at the same point -- this is what a tie
Me@0	519 point is. A key note is that the lengths in virtual time have no relation to
Me@0	520 the lengths in program-time. The only defined feature is that those two
Me@0	521 trace-segments have a common start-point in virtual time. This means that the
Me@0	522 two suspend points are considered to be tied together.
Me@0	523
Me@0	524 {\bf The Key VMS guarantee:\ }\label{VMSguarantee}Being tied together
Me@0	525 means that all physical events that can be observed by both program-times are
Me@0	526 covered by the key VMS guarantee: any events triggered before the common
Me@0	527 suspend point in one program time are guaranteed visible in the other program
Me@0	528 time after the common suspend point. They {\em{might}} be visible before,
Me@0	529 but it's not guaranteed. In addition, events triggered after the common
Me@0	530 suspend point in one are guaranteed not visible before the common suspend
Me@0	531 point in the other. {\em{This two-part guarantee can be considered the heart
Me@0	532 of VMS.}}
Me@0	533
Me@0	534 Intuitively, a tie-point separates before it from after such that tied program
Me@0	535 times agree (illustrated with code in Section \ref{secApp} Figure \ref{figAnimVP}). But the subtlety is
Me@0	536 that events triggered before the tie-point, {\em{might}} be visible to the
Me@0	537 other before, and ones triggered after {\em{might not}} be visible to the
Me@0	538 other after -- physical events triggered before are only guaranteed visible
Me@0	539 {\em{after}} the tie point, and events after are only guaranteed
Me@0	540 {\em{not}} visible {\em{before}} the tie point.
Me@0	541
Me@0	542 This is a form of bounded non-determinism. The pattern of suspension
Me@0	543 end-points determines which trace-segments overlap in Virtual time, and events
Me@0	544 triggered in one might be visible in overlapped ones. But no guarantees cover
Me@0	545 these. If one segment tries to observe, it will see events triggered by
Me@0	546 overlapped segments in non-deterministic order.
Me@0	547
Me@0	548 The VMS implementation defines which physical events are covered by the key
Me@0	549 VMS guarantee (reads/writes, network communication, DMA, I/O).
Me@0	550
Me@0	551 {\bf Globally consistent sequential order:\ }VMS maps suspend-start,
Me@0	552 suspend-end, and hence tie-points, to a globally-consistent sequential order
Me@0	553 in Virtual time. This enables one of VMS's key benefits: sequential
Me@0	554 algorithms for parallel constructs.
Me@0	555
Me@0	556 Tie points define parallel behavior, so an implementation of how to choose tie
Me@0	557 points equals an implementation of parallel constructs. The Master chooses
Me@0	558 tie-points, so plugging code to choose tie-points into the Master equals
Me@0	559 plugging in parallel constructs.
Me@0	560
Me@0	561 \subsection{Scheduling State}
Me@0	562
Me@0	563 Scheduling state is used in VMS to organize internal activity, for enforcing
Me@0	564 the guarantees.
Me@0	565
Me@0	566 \texttt{- }VPs have three scheduling states: {\em{Animated}},
Me@0	567 {\em{Blocked}}, {\em{Ready}} (Figure \ref{figStates}).
Me@0	568
Me@0	569 \texttt{- }VPs in Animated are {\em{allowed}} to advance
Me@0	570 program-time with {\em{local}} physical time.
Me@0	571
Me@0	572 \texttt{- }VPs in Blocked and Ready do not advance their
Me@0	573 program-time.
Me@0	574
Me@0	575 \texttt{- }Animated has two physical states:
Me@0	576 {\em{Progressing}} and {\em{Stalled}}.
Me@0	577
Me@0	578 \texttt{- }VPs in Progressing advance program-time with local
Me@0	579 physical time, those in Stalled do not (allowing non-semantic suspend).
Me@0	580
Me@0	581 \texttt{- }Scheduling states are defined in Virtual time only.
Me@0	582
Me@0	583 \texttt{- }Progressing and Stalled are defined in local
Me@0	584 physical time only (invisible in Virtual).
Me@0	585
Me@0	586
Me@0	587 \begin{figure}[h]
Me@0	588 \includegraphics{../figures/Scheduling_states_2.png}
Me@0	589 \caption{Scheduling states of a slave VP in the VMS model.}
Me@0	590 \label{figStates}
Me@0	591 \end{figure}
Me@0	592
Me@0	593 Some important points: 1) only VPs Animated can trigger physical events that
Me@0	594 are seen in other program time-lines; 2) the distinction between Blocked vs
Me@0	595 Stalled is that a VP has to explicitly execute a VMS primitive operation to
Me@0	596 enter Blocked, making it part of the semantics of parallelism constructs. In
Me@0	597 contrast, Stalled happens invisibly, with no effect on semantic behavior. It
Me@0	598 is due to hardware events hidden inside VMS, such as interrupts.
Me@0	599
Me@0	600 The Ready state is used to separate the parallelism-construct behavior from
Me@0	601 the scheduling behavior. It acts as a ``staging area'' for scheduling. VPs
Me@0	602 placed into this state are {\em{allowed}} to be animated, then the scheduler
Me@0	603 decides when and where.
Me@0	604
Me@0	605 A subtle but illustrative point is that actions {\em{outside}} a given
Me@0	606 program time cause the VP to transition Blocked$\rightarrow$Ready, which
Me@0	607 contrasts to lock algorithms like spin-locks or Dijkstra's, where the
Me@0	608 concurrency-related behavior takes place {\em{inside}} program time.
Me@0	609
Me@0	610 {\bf Transition Between Slave Scheduling States:\ }
Me@0	611
Me@0	612 \texttt{- }VPs transition states as shown in Figure \ref{figStates}.
Me@0	613
Me@0	614 \texttt{- }Animated$\rightarrow$Blocked is caused by a Slave VP
Me@0	615 executing the Suspend VMS primitive.
Me@0	616
Me@0	617 \texttt{- }Blocked$\rightarrow$Ready is determined by the
Me@0	618 semantics implemented in the plugin.
Me@0	619
Me@0	620 \texttt{- }Ready$\rightarrow$Animated is determined by the
Me@0	621 scheduler in the plugin.
Me@0	622
Me@0	623 \texttt{- }Transitions in scheduling state have a globally
Me@0	624 consistent order in Virtual time.
Me@0	625
Me@0	626
Me@0	627 The parallelism primitives executed by a program do not control change in
Me@0	628 scheduling states. They merely communicate messages to the Master, via a VMS
Me@0	629 supplied primitive. Inside the Master, the plugin's parallelism construct
Me@0	630 implementation processes the message. Based on that, it performs changes in
Me@0	631 state from Blocked$\rightarrow$Ready, creates new VPs, and dissipates existing
Me@0	632 VPs. Most communication from Slave to Master requires the VP to suspend when
Me@0	633 it sends the message. A few messages, like creating new Slave may be sent
Me@0	634 without suspending.
Me@0	635
Me@0	636 The suspend primitive decouples local physical time from Virtual time.
Me@0	637 Execution causes immediate transition to Stalled in physical time, then the
Me@0	638 Master performs Animated$\rightarrow$Blocked, fixing that transition in
Me@0	639 Virtual time. The only relationship is causality. This weak relation is what
Me@0	640 allows suspension-points to be serialized in Virtual time, which in turn is
Me@0	641 what allows using sequential algorithms to implement parallelism constructs.
Me@0	642
Me@0	643
Me@0	644 \subsection{Plugins}
Me@0	645
Me@0	646 \begin{figure}[ht]
Me@0	647 \includegraphics{../figures/VMS-core__plugins.png}
Me@0	648 \caption
Me@0	649 {
Me@0	650 The Master has been split into a generic core and a language-specific plug-in.
Me@0	651 The core encapsulates the hardware and remains the same across applications.
Me@0	652 The plug-in is part of the parallelism-construct implementation. It is
Me@0	653 loaded separately onto the hardware and linked to the application when run.
Me@0	654 }
Me@0	655 \label{figMasterSplit}
Me@0	656 \end{figure}
Me@0	657
Me@0	658
Me@0	659 The Master entity has two parts, a generic core part and a plugin (Figure \ref{figMasterSplit}).
Me@0	660 The core part of the Master is implemented as part of
Me@0	661 VMS-core. The plug-in supplies two functions: the communication-handler and
Me@0	662 the scheduler, both having a standard prototype. The communication-handler
Me@0	663 implements the parallelism constructs, while scheduler assigns VPs to cores.
Me@0	664
Me@0	665 An {\em{instance}} of a plugin is created as part of initializing an
Me@0	666 application, and the instance holds the semantic and scheduling state for that
Me@0	667 run of the application. This state, combined with the virtual processor states
Me@0	668 of the slaves created during that application run, represents progress of the
Me@0	669 work of the application. \ For example, multi-tasking is performed simply by
Me@0	670 the Master switching among plug-in instances when it has a resource to offer
Me@0	671 to a scheduler. The parallelism-semantic state holds all information needed to
Me@0	672 resume (hardware state, such as TLB and cache-tags is inside VMS-core).
Me@0	673
Me@0	674
Me@0	675
Me@0	676 \section{Internal Workings of Our Implementation}
Me@0	677 \label{secInternal}
Me@0	678
Me@0	679 We name the elements of our example implementation and describe their logical
Me@0	680 function, then relate them to the abstract model. We then step through the
Me@0	681 operation of the elements.
Me@0	682
Me@0	683 {\bf Elements and Their Logical Function:\ }As illustrated in Figure \ref{figInternals},
Me@0	684 our VMS implementation is organized around physical cores.
Me@0	685 Each core has its own {\em{master
Me@0	686 virtual-processor}}, \texttt{masterVP}, and a {\em{physical-core controller}}, which communicate via a set of
Me@0	687 scheduling slots, \texttt{schedSlot}. The Master in the abstract definition
Me@0	688 is implemented by the multiple \texttt{masterVP}s plus a particular plugin
Me@0	689 instance with its shared parallelism-semantic state (seen at the top).
Me@0	690
Me@0	691 On a given core, only one of: the core-controller, \texttt{masterVP}, or a
Me@0	692 slave VP, is animated at any point in local physical time. Each
Me@0	693 \texttt{masterVP} animates the same function, called
Me@0	694 \texttt{master\_loop}, and each slave VP animates a function from the
Me@0	695 application, starting with the top-level function the slave is created with,
Me@0	696 and following its call sequence. The core controller is implemented here as a
Me@0	697 Linux pthread that runs the \texttt{core\_loop} function.
Me@0	698
Me@0	699 Switching between VPs is done by executing a VMS primitive that suspends the
Me@0	700 VP. This switches the physical core over to the controller, by jumping to the
Me@0	701 start of the \texttt{core\_loop} function, which chooses the next VP and
Me@0	702 switches to that (switching is detailed in Section \ref{secApp} Figure \ref{figAssembly}).
Me@0	703
Me@0	704 {\bf Relation to Abstract Model:\ }We chose to implement the Master entity
Me@0	705 of the model by a set of \texttt{masterVP}s, plus plug-in functions and
Me@0	706 shared parallelism-semantic state. What we call VMS-core consists of this
Me@0	707 implementation of the Master, plus the core-controllers, plus the VMS
Me@0	708 primitive libraries, for creating new VPs and dissipating existing VPs, suspending VPs,
Me@0	709 and communicating from slave VP to Master. In Figure \ref{figInternals},
Me@0	710 everything in green is part of VMS-core, while the plugin is in red, and
Me@0	711 application code appears as blue, inside the slave VP.
Me@0	712
Me@0	713 Virtual time in the model is implemented via a combination of four things: a
Me@0	714 \texttt{masterLock} (not shown) that guarantees non-overlap of
Me@0	715 \texttt{masterVP} trace-segments; the \texttt{master\_loop} which performs
Me@0	716 transition Animated$\rightarrow$Blocked; the \texttt{comm\_handler\_fn}
Me@0	717 which performs Blocked$\rightarrow$Ready and the \texttt{scheduler\_fn}
Me@0	718 which performs Ready$\rightarrow$Animated. \ Each state transition is one step
Me@0	719 of Virtual time; is guaranteed sequential by the non-overlap of
Me@0	720 \texttt{masterVP} trace segments; and is global due to being in
Me@0	721 parallelism-semantic state that is shared (top of Figure \ref{figInternals}).
Me@0	722
Me@0	723 Transitions Progressing$\rightleftarrows$Stalled within the Animated state are
Me@0	724 invisible to the parallelism semantics, the Master, and Virtual time, and so
Me@0	725 have no effect on the elements seen.
Me@0	726
Me@0	727
Me@0	728 \begin{figure*}[!t]
Me@0	729 \includegraphics[width=5in]{../figures/VMS-core__internal_workings.png}
Me@0	730 \caption
Me@0	731 { Internal elements of our example VMS implementation
Me@0	732 }
Me@0	733 \label{figInternals}
Me@0	734 \end{figure*}
Me@0	735
Me@0	736 {\bf Steps of Operation:\ }The steps of operation are numbered, in Figure \ref{figInternals}.
Me@0	737 Taking them in order, 1) \texttt{master\_loop} scans
Me@0	738 the scheduling slots to see which ones' slaves have suspended since the
Me@0	739 previous scan. \ 2) It hands these to the \texttt{comm\_handler\_fn} plugged
Me@0	740 in (which equals transition Animated$\rightarrow$Blocked). \ 3) The VP has a
Me@0	741 request attached, and data in it causes \ the \texttt{comm\_handler\_fn}
Me@0	742 to manipulate data structures in the shared parallelism-semantic state. \
Me@0	743 These structures hold all the slaves in the blocked state (code-level detail
Me@0	744 in Figure \ref{figReqHdlr}, Section \ref{secApp}). \ 4) Some requests cause slaves to be moved to a
Me@0	745 \texttt{readyQ} on one of the cores (Blocked$\rightarrow$Ready). Which
Me@0	746 core's \texttt{readyQ} receives the slave is under plugin control,
Me@0	747 determined by a combination of request contents, semantic state and physical
Me@0	748 machine state. 5) During the scan, the \texttt{master\_loop} also looks for
Me@0	749 empty slots, and for each calls the \texttt{scheduler\_fn} plugged in. It
Me@0	750 chooses a slave from the \texttt{readyQ} on the core animating
Me@0	751 \texttt{master\_loop}. \ 6) The \texttt{master\_loop} then places the
Me@0	752 slave VP's pointer into the scheduling slot (Ready$\rightarrow$Animated),
Me@0	753 making it available to the \texttt{core\_loop}. 7) When done with the scan,
Me@0	754 \texttt{masterVP} suspends, switching animation back to the
Me@0	755 \texttt{core\_loop}. \ 8) \texttt{core\_loop} takes slave VPs out of the
Me@0	756 slots, then 9) switches animation to them. \ 10) When a slave self-suspends,
Me@0	757 animation returns to the \texttt{core\_loop} (detail in code in Figure 9),
Me@0	758 which picks another, until 11) all slots are empty and the
Me@0	759 \texttt{core\_loop} switches animation to the \texttt{masterVP}.
Me@0	760
Me@0	761 {\bf Enabling sequential implementation of parallelism semantics:\ }All of
Me@0	762 that happens on each core separately, but in this particular implementation we
Me@0	763 use a central \texttt{masterLock} to ensure that only one core's
Me@0	764 \texttt{masterVP} can be active at any time. This guarantees non-overlap
Me@0	765 of trace-segments from different \texttt{masterVP}s, allowing the plugins to
Me@0	766 use sequential algorithms, without a performance penalty, as verified in
Me@0	767 Section \ref{secResults}.
Me@0	768
Me@0	769 Relating this to the abstract model: the parallelism-semantic behavior of the
Me@0	770 Master is implemented by the communication handler, in the plugin. It thus
Me@0	771 runs in the Master time referred to, in the model, in Section \ref{secAbsModel}. Requests are
Me@0	772 sent to the Master by self-suspension of the slaves, but sit idle until the
Me@0	773 other slaves in the scheduling slots have also run. This is the passive
Me@0	774 behavior of requests that was noted in Section \ref{secAbsModel}, which allows the
Me@0	775 \texttt{masterVP}s to remain suspended until needed. This in turn enables
Me@0	776 the \texttt{masterVP}s from different cores to be non-overlapped. It is the
Me@0	777 non-overlap that enables the algorithms for the parallelism semantics to be
Me@0	778 sequential.
Me@0	779
Me@0	780
Me@0	781
Me@0	782
Me@0	783 \section{Code Level View}
Me@0	784 \label{secApp}
Me@0	785
Me@0	786 To relate the abstract model and the internal elements to application code and
Me@0	787 parallelism-library code, we give code snippets that illustrate key features.
Me@0	788 We start with the application then work down through the sequence of calls,
Me@0	789 to the plugin, using our SSR {\cite{VMSHome}} parallelism-library as an
Me@0	790 example.
Me@0	791
Me@0	792 In general, applications are either written in terms of a full custom language
Me@0	793 that has its own syntax, or a base language with a parallelism library, which
Me@0	794 is often called an {\em{embedded language}}. Our demonstrators, VCilk
Me@0	795 {\cite{VMSHome}}, Vthread, and SSR, are all parallelism libraries. A full
Me@0	796 custom language would follow the standard practice of performing
Me@0	797 source-to-source transform, from custom syntax into C plus parallelism-library
Me@0	798 calls.
Me@0	799
Me@0	800 {\bf SSR:\ }SSR stands for Synchronous Send-Receive, and details of its
Me@0	801 calls and internal implementation will be given throughout this section. It
Me@0	802 has two types of construct. The first, called {\em{from-to}} has two calls:
Me@0	803 \texttt{SSR\_send\_from\_to} and \texttt{SSR\_receive\_from\_to}, both of
Me@0	804 which specify the sending VP as well as the receiving VP. \ The other, called
Me@0	805 {\em{of-type}} also has two calls: \texttt{SSR\_\_send\_of\_type\_to} and
Me@0	806 \texttt{SSR\_\_receive\_of\_type}, which allow a receiver to accept from
Me@0	807 anonymous senders, but select according to type of message.
Me@0	808
Me@0	809
Me@0	810 % An example of a double column floating figure using two subfigures.
Me@0	811 % (The subfig.sty package must be loaded for this to work.)
Me@0	812 % The subfigure \label commands are set within each subfloat command, the
Me@0	813 % \label for the overall figure must come after \caption.
Me@0	814 % \hfil must be used as a separator to get equal spacing.
Me@0	815 % The subfigure.sty package works much the same way, except \subfigure is
Me@0	816 % used instead of \subfloat.
Me@0	817 %
Me@0	818 %\begin{figure*}[!t]
Me@0	819 %\centerline{\subfloat[Case I]\includegraphics[width=2.5in]{subfigcase1}%
Me@0	820 %\label{fig_first_case}}
Me@0	821 %\hfil
Me@0	822 %\subfloat[Case II]{\includegraphics[width=2.5in]{subfigcase2}%
Me@0	823 %\label{fig_second_case}}}
Me@0	824 %\caption{Simulation results}
Me@0	825 %\label{fig_sim}
Me@0	826 %\end{figure*}
Me@0	827 %
Me@0	828 % Note that often IEEE papers with subfigures do not employ subfigure
Me@0	829 % captions (using the optional argument to \subfloat), but instead will
Me@0	830 % reference/describe all of them (a), (b), etc., within the main caption.
Me@0	831
Me@0	832
Me@0	833 {\bf Application View:\ } Figure \ref{figAnimVP} shows snippets of application code, which use the SSR parallelism
Me@0	834 library. The most important feature is that all calls take a pointer to the
Me@0	835 VP that is animating the call. This is seen at the top of the figure where slave VP creation takes a pointer
Me@0	836 to the VP asking for creation. Below that is the standard prototype for top level functions,
Me@0	837 showing that the function receives a pointer to the VP it is the top level function for.
Me@0	838
Me@0	839 The pointer is placed on the stack by VMS when it creates the VP, and is the means by
Me@0	840 which the application comes into possession of the pointer. This animating VP is
Me@0	841 passed to all library calls made from there. For example, the bottom shows a pointer to the
Me@0	842 animating VP placed in the position of sender in
Me@0	843 the \texttt{send} construct call. Correspondingly, for the \texttt{receive} construct,
Me@0	844 the position of receiving VP is filled by the VP animating the call.
Me@0	845
Me@0	846 \begin{figure}[ht]
Me@0	847 {\noindent
Me@0	848 {\scriptsize
Me@0	849 {\small Creating a new processor:}
Me@0	850 \begin{verbatim}
Me@0	851 newProcessor = SSR__create_procr( &top_VP_fn,
Me@0	852 paramsPtr, animatingVP ); \end{verbatim}
Me@0	853
Me@0	854 {\small prototype for the top level function:}
Me@0	855 \begin{verbatim}
Me@0	856 top_VP_fn( void *parameterStrucPtr, VirtProcr
Me@0	857 *animatingVP ); \end{verbatim}
Me@0	858
Me@0	859 {\small handing animating VP to parallelism constructs:}
Me@0	860 \begin{verbatim}
Me@0	861 SSR__send_from_to( messagePtr, animatingVP,
Me@0	862 receivingVP );
Me@0	863 messagePtr = SSR__receive_from_to( sendingVP,
Me@0	864 animatingVP ); \end{verbatim}
Me@0	865 }
Me@0	866 }
Me@0	867 \caption
Me@0	868 {
Me@0	869 Application code snippets showing that all calls to the parallelism library
Me@0	870 take the VP animating that call as a parameter.
Me@0	871 }
Me@0	872 \label{figAnimVP}
Me@0	873 \end{figure}
Me@0	874
Me@0	875 Relating these to the internal elements of our implementation, the
Me@0	876 \texttt{animatingVP} suspends inside each of these
Me@0	877 calls, passing a request (generated in the library) to one of the \texttt{masterVP}s.
Me@0	878 The \texttt{masterVP} then calls the \texttt{comm-handler}
Me@0	879 plugin, and so on, as described in Section \ref{secInternal}.
Me@0	880
Me@0	881 For the \texttt{SSR\_\_create\_processor} call, the comm-handler
Me@0	882 in turn calls a VMS primitive to perform the creation.
Me@0	883 The primitive places a pointer to the newly created VP onto its stack, so that when
Me@0	884 \texttt{top\_VP\_fn} is later animated, it sees the VP-pointer as a
Me@0	885 parameter passed to it. \ All
Me@0	886 application code is either such a top-level function, or has one at the root
Me@0	887 of the call-stack.
Me@0	888
Me@0	889 The send and receive calls both suspend their animating VP. When both have
Me@0	890 been called, the communication handler pairs them up and resumes both. This
Me@0	891 ties time-lines together, invoking the VMS guarantee. Both
Me@0	892 application-functions know, because of the VMS guarantee (Section \ref{secAbsModel}), that
Me@0	893 writes to shared variables made before the send call by the sender are visible
Me@0	894 to the receiver after the receive call. This is the programmer's view of tying
Me@0	895 together the local time-lines of two different VPs, as defined in Section \ref{secAbsModel}.
Me@0	896
Me@0	897
Me@0	898 {\bf Concurrency-Library View:\ }A parallelism library function, in
Me@0	899 general, only creates a request, sends it, and returns, as seen below. To
Me@0	900 send a request, it uses the combined request-and-suspend VMS primitive that
Me@0	901 attaches the request then suspends the VP. The primitive requires the
Me@0	902 pointer to the VP, to attach the request and to suspend it.
Me@0	903
Me@0	904
Me@0	905 \begin{figure}[ht]
Me@0	906 {\noindent
Me@0	907 {\scriptsize
Me@0	908 \begin{verbatim}
Me@0	909 void * SSR__receive_from_to( VirtProcr *sendVP,
Me@0	910 VirtProcr *receiveVP )
Me@0	911 { SSRSemReq reqData;
Me@0	912 reqData.receiveVP = receiveVP;
Me@0	913 reqData.sendVP = sendVP;
Me@0	914 reqData.reqType = receive_from_to;
Me@0	915 VMS__send_sem_request( &reqData, receiveVP );
Me@0	916 return receiveVP->dataReturnedFromRequest;
Me@0	917 } \end{verbatim}
Me@0	918 }
Me@0	919 }
Me@0	920 \caption{Implementation of SSR's receive\_from\_to library function.}
Me@0	921 \label{figImplLib}
Me@0	922 \end{figure}
Me@0	923
Me@0	924
Me@0	925 In Figure \ref{figImplLib}, notice that the request's data is on the stack of the virtual
Me@0	926 processor that's animating the call, which is the \texttt{receiveVP}. The
Me@0	927 \texttt{VMS\_\_send\_sem\_request} suspends this VP, which changes the
Me@0	928 physical core's stack pointer to a different stack. So the request data is
Me@0	929 guaranteed to remain undisturbed while the VP is suspended.
Me@0	930
Me@0	931 Figure \ref{figAssembly} shows the implementation of the VMS suspend primitive. As seen in
Me@0	932 Figure \ref{figInternals}, suspending the \texttt{receiveVP} involves
Me@0	933 switching to the \texttt{core\_loop}. In our implementation, this is done by
Me@0	934 switching to the stack of the pthread pinned to the physical core and then
Me@0	935 jumping to the start-point of \texttt{core\_loop}.
Me@0	936
Me@0	937 This code uses standard techniques commonly employed in co-routine
Me@0	938 implementations. Tuning effort spent in \texttt{core\_loop} is inherited by
Me@0	939 all applications.
Me@0	940
Me@0	941
Me@0	942 \begin{figure}[ht]
Me@0	943 {\noindent
Me@0	944 {\scriptsize
Me@0	945 \begin{verbatim}
Me@0	946 VMS__suspend_procr( VirtProcr *animatingVP )
Me@0	947 { animatingVP->resumeInstrAddr = &&ResumePt;
Me@0	948 //GCC takes addr of label
Me@0	949 animatingVP->schedSlotAssignedTo->
Me@0	950 isNewlySuspended = TRUE;
Me@0	951 //for master_loop to see
Me@0	952 <assembly code stores current physical core's
Me@0	953 stack reg into animatingVP struct>
Me@0	954 <assembly code loads stack reg with core_loop
Me@0	955 stackPtr, which was saved into animatingVP>
Me@0	956 <assembly code jmps to core_loop start instr
Me@0	957 addr, which was also saved into animatingVP>
Me@0	958 ResumePt:
Me@0	959 return;
Me@0	960 } \end{verbatim}
Me@0	961 }
Me@0	962 }
Me@0	963 \caption
Me@0	964 {Implementation of VMS suspend processor.
Me@0	965 Re-animating the virtual processor reverses this sequence. \ It saves the
Me@0	966 \texttt{core\_loop}'s resume instr-addr and stack ptr into the VP structure,
Me@0	967 then loads the VP's stack ptr and jmps to its \texttt{resumeInstrAddr}.
Me@0	968 }
Me@0	969 \label{figAssembly}
Me@0	970 \end{figure}
Me@0	971
Me@0	972
Me@0	973 {\bf Plugin View:\ }SSR's communication handler dispatches on the
Me@0	974 \texttt{reqType} field of the request data, as set by the
Me@0	975 \texttt{SSR\_\_receive\_from\_to} code. It calls the handler code in
Me@0	976 Figure \ref{figReqHdlr}. This constructs a hash-key, by concatenating the from-VP's pointer
Me@0	977 with the to-VP's pointer. Then it looks-up that key in the hash-table that SSR uses
Me@0	978 to match sends with receives, which is in the shared semantic state seen at
Me@0	979 the top of Figure \ref{figInternals} in Section \ref{secInternal}.
Me@0	980
Me@0	981 The most important feature in Figure \ref{figReqHdlr} is that both send and receive
Me@0	982 will construct the same key, so will find the same
Me@0	983 hash entry. Whichever request is handled first in Virtual time will see the
Me@0	984 hash entry empty, and save itself in that entry. The second to arrive
Me@0	985 sees the waiting request and then resumes both VPs, by putting them into their
Me@0	986 \texttt{readyQ}s.
Me@0	987
Me@0	988 Access to the shared hash
Me@0	989 table can be considered private, as in a sequential algorithm. This is because
Me@0	990 our VMS-core implementation ensures that only
Me@0	991 one handler on one core is executing at a time.
Me@0	992
Me@0	993
Me@0	994 \begin{figure}[ht]
Me@0	995 {\noindent
Me@0	996 {\scriptsize
Me@0	997 \begin{verbatim}
Me@0	998 handle_receive_from_to( VirtProcr *requestingVP,
Me@0	999 SSRSemReq reqData, SSRSemEnv semEnv )
Me@0	1000 { commHashTbl = semEnv->communicatingVPHashTable;
Me@0	1001 key[0] = reqData->receiveVP; key[1] =
Me@0	1002 reqData->sendVP; //send uses same key
Me@0	1003 waitingReqData = lookup_and_remove( key,
Me@0	1004 commHashTbl ); //get waiting request
Me@0	1005 if( waitingReqData != NULL )
Me@0	1006 { resume_virt_procr( waitingReqData->sendVP );
Me@0	1007 resume_virt_procr( waitingReqData->
Me@0	1008 receiveVP );
Me@0	1009 }
Me@0	1010 else
Me@0	1011 insert( key, reqData, commHashTbl );
Me@0	1012 //receive is first to arrive, make it wait
Me@0	1013 } \end{verbatim}
Me@0	1014 }
Me@0	1015 }
Me@0	1016 \caption
Me@0	1017 {Pseudo-code of communication-handler for
Me@0	1018 \texttt{receive\_from\_to} request type. The \texttt{semEnv} is a pointer
Me@0	1019 to the shared parallelism-semantic state seen at the top of Figure
Me@0	1020 \ref{figInternals}.
Me@0	1021 }
Me@0	1022 \label{figReqHdlr}
Me@0	1023 \end{figure}
Me@0	1024
Me@0	1025
Me@0	1026
Me@0	1027
Me@0	1028 \section{Results}
Me@0	1029 \label{secResults}
Me@0	1030
Me@0	1031 {\bf Setup:\ }We implemented blocked dense matrix multiply with right
Me@0	1032 sub-matrices copied to transposed form. We ran on a 1 socket by 4 core
Me@0	1033 Core2Quad 2.4Ghz chip.
Me@0	1034
Me@0	1035 {\bf Implementation-Time:\ }As shown in Table 1, time to implement the
Me@0	1036 three parallel libraries averages 2 days each. As an example of productivity,
Me@0	1037 adding nested transactions, parallel singleton, and atomic function-execution
Me@0	1038 to SSR required a single afternoon, totaling less than 100 lines of C code.
Me@0	1039
Me@0	1040
Me@0	1041 \begin{table}[ht]
Me@0	1042 \caption
Me@0	1043 {Person-days to design, code, and
Me@0	1044 test each parallelism library. L.O.C. is lines of (original) C code, excluding libraries and
Me@0	1045 comments.
Me@0	1046 }
Me@0	1047 \label{tabPersonDaysLang}
Me@0	1048 \begin{center}
Me@0	1049 \begin{tabular}{llll}
Me@0	1050 & SSR & Vthread & VCilk\\
Me@0	1051 Design & 4 & 1 & 0.5\\
Me@0	1052 Code & 2 & 0.5 & 0.5\\
Me@0	1053 Test & 1 & 0.5 & 0.5\\
Me@0	1054 L.O.C. & 470 & 290 & 310
Me@0	1055 \end{tabular}
Me@0	1056 \end{center}
Me@0	1057 \end{table}
Me@0	1058
Me@0	1059
Me@0	1060 {\bf Execution Performance:\ }Performance of VMS is seen in Table
Me@0	1061 \ref{tabOverheadCycles}. The code is not optimized, but rather written
Me@0	1062 to be easy to understand and modify. The majority of the plugin time
Me@0	1063 is lost to cache misses because the shared parallelism-semantic state moves
Me@0	1064 between cores on a majority of accesses. Acquisition of the master lock is slow due to the hardware
Me@0	1065 implementing the CAS instruction.
Me@0	1066
Me@0	1067 Existing techniques will likely improve performance, such
Me@0	1068 as localizing semantic data to cores, splitting malloc across the cores,
Me@0	1069 pre-allocating slabs that are recycled, and pre-fetching. However, in many cases, several hundred
Me@0	1070 nano-seconds per task is as optimal as the applications can benefit from.
Me@0	1071
Me@0	1072 \begin{table}[ht]
Me@0	1073 \caption
Me@0	1074 {Cycles of overhead, per scheduled
Me@0	1075 slave. ``comp only'' is perfect memory, ``comp + mem'' is actual cycles.
Me@0	1076 ``Plugin-concur'' only concurrency requests, ``plugin-all'' includes
Me@0	1077 create and malloc requests. Two significant digits due to variability.
Me@0	1078 }
Me@0	1079 \label{tabOverheadCycles}
Me@0	1080 \begin{center}
Me@0	1081 \begin{tabular}{\|l\|ll\|r\|r\|}
Me@0	1082 \hline
Me@0	1083 & & comp & comp\\
Me@0	1084 & & only & +mem\\
Me@0	1085 VMS Only & \texttt{master\_loop} & 91 & \ 110\\
Me@0	1086 & switch VPs & 77 & \ 130\\
Me@0	1087 & (malloc) & 160 & 2300\\
Me@0	1088 & (create VP) & 540 & 3800\\
Me@0	1089 \hline
Me@0	1090 Language: & & & \\
Me@0	1091 SSR & plugin -- concur & 190 & 540\\
Me@0	1092 & plugin -- all & 530 & 2200\\
Me@0	1093 & lock & & 250\\
Me@0	1094 Vthread & plugin -- concur & 66 & 710\\
Me@0	1095 & plugin -- all & 180 & 1500\\
Me@0	1096 & lock & & 250\\
Me@0	1097 VCilk & plugin -- concur & 65 & 260\\
Me@0	1098 & plugin -- all & 330 & 1800\\
Me@0	1099 & lock & & 250\\
Me@0	1100 \hline
Me@0	1101 \end{tabular}
Me@0	1102 \end{center}
Me@0	1103 \end{table}
Me@0	1104
Me@0	1105
Me@0	1106
Me@0	1107 {\bf Head to Head:\ }We
Me@0	1108 compare our implementation of the \texttt{spawn} and \texttt{sync}
Me@0	1109 constructs against Cilk 5.4, on the top in Table
Me@0	1110 \ref{tabHeadToHead}, which shows that the same application code has similar
Me@0	1111 performance. For large matrices, Cilk 5.4's better use of the memory hierarchy
Me@0	1112 achieves 23\% better performance. However, for small matrices, VCilk is better,
Me@0	1113 with a factor 2 lower overhead. Cilk 5.4 does not allow controlling the number of spawn events it
Me@0	1114 actually executes, and chooses to run smaller matrices sequentially, limiting
Me@0	1115 our comparison.
Me@0	1116
Me@0	1117 When comparing to pthreads, our VMS based implementation has more than an
Me@0	1118 order of magnitude better overhead per invocation of mutex or condition
Me@0	1119 variable functionality, as seen on the bottom of Table \ref{tabHeadToHead}.
Me@0	1120 Applications that inherently have short trace segments will synchronize often
Me@0	1121 and benefit the most from Vthread.
Me@0	1122
Me@0	1123
Me@0	1124
Me@0	1125 \begin{table}[ht]
Me@0	1126 \caption
Me@0	1127 {On top, exe time in seconds for MM. Below, overhead for pthread vs Vthread. First column is cycles
Me@0	1128 for perfect memory and second is total measured cycles. pthread cycles are
Me@0	1129 deduced from round-trip experiments.
Me@0	1130 }
Me@0	1131 \label{tabHeadToHead}
Me@0	1132 \begin{center}
Me@0	1133 \begin{tabular}{\|l@{\ }\|@{\ }lr\|}
Me@0	1134 \hline
Me@0	1135 \rule{0pt}{12pt}
Me@0	1136 Matrix size&Lang.&sec.\\
Me@0	1137 [2pt]\hline
Me@0	1138 81x81 & Cilk & 0.017\\
Me@0	1139 & VCilk & 0.008\\
Me@0	1140 \hline
Me@0	1141 324x324 & Cilk & 0.13\\
Me@0	1142 & VCilk & 0.13\\
Me@0	1143 \hline
Me@0	1144 648x648 & Cilk & 0.71\\
Me@0	1145 & VCilk & 0.85\\
Me@0	1146 \hline
Me@0	1147 1296x1296 & Cilk & 4.8\\
Me@0	1148 & VCilk & 6.2 \\
Me@0	1149 [2pt]\hline
Me@0	1150 \end{tabular}
Me@0	1151 \begin{tabular}{c}
Me@0	1152 \begin{tabular}{\|l\|rr\|r\|r\|}
Me@0	1153 \hline
Me@0	1154 operation & \multicolumn{2}{c\|}{Vthread} & pthread & ratio\\
Me@0	1155 \hline
Me@0	1156 & comp & total & & \\
Me@0	1157 & only & & & \\
Me@0	1158 [2pt]\hline
Me@0	1159 mutex\_lock & 85 & 1050 & 50,000 & 48:1\\
Me@0	1160 mutex\_unlock & 85 & 610 & 45,000 & 74:1\\
Me@0	1161 cond\_wait & 85 & 850 & 60,000 & 71:1\\
Me@0	1162 cond\_signal & 90 & 650 & 60,000 & 92:1\\
Me@0	1163 \hline
Me@0	1164 \end{tabular}\\
Me@0	1165 \end{tabular}
Me@0	1166 \end{center}
Me@0	1167 \end{table}
Me@0	1168
Me@0	1169 \section{Conclusion}\label{secConclusion}
Me@0	1170
Me@0	1171 We have shown an alternative to the Thread model that enables easier-to-use
Me@0	1172 parallelism constructs by splitting
Me@0	1173 the scheduler open, to accept new parallelism constructs in the form of
Me@0	1174 plugins. This gives the language control over assigning virtual
Me@0	1175 processors to physical cores, for performance, debugging, and flexibility
Me@0	1176 benefits. Parallelism constructs of languages are implementable using
Me@0	1177 sequential algorithms, within a matter of days, while maintaining low run-time
Me@0	1178 overhead, on the order of a few hundred nano-seconds per concurrency
Me@0	1179 operation.
Me@0	1180
Me@0	1181 \bibliography{Bib_for_papers}
Me@0	1182
Me@0	1183 \end{document}

Mercurial > cgi-bin > hgwebdir.cgi > VMS > 0__Writings > kshalle

annotate 0__Papers/Future_Architecture/Future_Architecture.tex @ 4:01aefc2be990