Mercurial > cgi-bin > hgwebdir.cgi > VMS > 0__Writings > kshalle
view 0__Papers/Future_Architecture/Future_Architecture.tex @ 0:7b45d1764e2b
Initial add
| author | Me@portablequad |
|---|---|
| date | Tue, 03 Jan 2012 16:29:44 -0800 |
| parents | |
| children |
line source
3 \documentclass[conference]{IEEEtran}
4 %
5 \usepackage{makeidx,geometry,amssymb,graphicx,calc,ifthen}
6 %
8 % *** CITATION PACKAGES ***
9 %
10 %\usepackage{cite}
11 % cite.sty was written by Donald Arseneau
12 % V1.6 and later of IEEEtran pre-defines the format of the cite.sty package
13 % \cite{} output to follow that of IEEE. Loading the cite package will
14 % result in citation numbers being automatically sorted and properly
15 % "compressed/ranged". e.g., [1], [9], [2], [7], [5], [6] without using
16 % cite.sty will become [1], [2], [5]--[7], [9] using cite.sty. cite.sty's
17 % \cite will automatically add leading space, if needed. Use cite.sty's
18 % noadjust option (cite.sty V3.8 and later) if you want to turn this off.
19 % cite.sty is already installed on most LaTeX systems. Be sure and use
20 % version 4.0 (2003-05-27) and later if using hyperref.sty. cite.sty does
21 % not currently provide for hyperlinked citations.
22 % The latest version can be obtained at:
23 % http://www.ctan.org/tex-archive/macros/latex/contrib/cite/
24 % The documentation is contained in the cite.sty file itself.
31 % *** GRAPHICS RELATED PACKAGES ***
32 %
33 \ifCLASSINFOpdf
34 % \usepackage[pdftex]{graphicx}
35 % declare the path(s) where your graphic files are
36 % \graphicspath{{../pdf/}{../jpeg/}}
37 % and their extensions so you won't have to specify these with
38 % every instance of \includegraphics
39 % \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
40 \else
41 % or other class option (dvipsone, dvipdf, if not using dvips). graphicx
42 % will default to the driver specified in the system graphics.cfg if no
43 % driver is specified.
44 % \usepackage[dvips]{graphicx}
45 % declare the path(s) where your graphic files are
46 % \graphicspath{{../eps/}}
47 % and their extensions so you won't have to specify these with
48 % every instance of \includegraphics
49 % \DeclareGraphicsExtensions{.eps}
50 \fi
51 % graphicx was written by David Carlisle and Sebastian Rahtz. It is
52 % required if you want graphics, photos, etc. graphicx.sty is already
53 % installed on most LaTeX systems. The latest version and documentation can
54 % be obtained at:
55 % http://www.ctan.org/tex-archive/macros/latex/required/graphics/
56 % Another good source of documentation is "Using Imported Graphics in
57 % LaTeX2e" by Keith Reckdahl which can be found as epslatex.ps or
58 % epslatex.pdf at: http://www.ctan.org/tex-archive/info/
59 %
60 % latex, and pdflatex in dvi mode, support graphics in encapsulated
61 % postscript (.eps) format. pdflatex in pdf mode supports graphics
62 % in .pdf, .jpeg, .png and .mps (metapost) formats. Users should ensure
63 % that all non-photo figures use a vector format (.eps, .pdf, .mps) and
64 % not a bitmapped formats (.jpeg, .png). IEEE frowns on bitmapped formats
65 % which can result in "jaggedy"/blurry rendering of lines and letters as
66 % well as large increases in file sizes.
67 %
68 % You can find documentation about the pdfTeX application at:
69 % http://www.tug.org/applications/pdftex
75 % *** MATH PACKAGES ***
76 %
77 %\usepackage[cmex10]{amsmath}
78 % A popular package from the American Mathematical Society that provides
79 % many useful and powerful commands for dealing with mathematics. If using
80 % it, be sure to load this package with the cmex10 option to ensure that
81 % only type 1 fonts will utilized at all point sizes. Without this option,
82 % it is possible that some math symbols, particularly those within
83 % footnotes, will be rendered in bitmap form which will result in a
84 % document that can not be IEEE Xplore compliant!
85 %
86 % Also, note that the amsmath package sets \interdisplaylinepenalty to 10000
87 % thus preventing page breaks from occurring within multiline equations. Use:
88 %\interdisplaylinepenalty=2500
89 % after loading amsmath to restore such page breaks as IEEEtran.cls normally
90 % does. amsmath.sty is already installed on most LaTeX systems. The latest
91 % version and documentation can be obtained at:
92 % http://www.ctan.org/tex-archive/macros/latex/required/amslatex/math/
98 % *** SPECIALIZED LIST PACKAGES ***
99 %
100 %\usepackage{algorithmic}
101 % algorithmic.sty was written by Peter Williams and Rogerio Brito.
102 % This package provides an algorithmic environment fo describing algorithms.
103 % You can use the algorithmic environment in-text or within a figure
104 % environment to provide for a floating algorithm. Do NOT use the algorithm
105 % floating environment provided by algorithm.sty (by the same authors) or
106 % algorithm2e.sty (by Christophe Fiorio) as IEEE does not use dedicated
107 % algorithm float types and packages that provide these will not provide
108 % correct IEEE style captions. The latest version and documentation of
109 % algorithmic.sty can be obtained at:
110 % http://www.ctan.org/tex-archive/macros/latex/contrib/algorithms/
111 % There is also a support site at:
112 % http://algorithms.berlios.de/index.html
113 % Also of interest may be the (relatively newer and more customizable)
114 % algorithmicx.sty package by Szasz Janos:
115 % http://www.ctan.org/tex-archive/macros/latex/contrib/algorithmicx/
120 % *** ALIGNMENT PACKAGES ***
121 %
122 %\usepackage{array}
123 % Frank Mittelbach's and David Carlisle's array.sty patches and improves
124 % the standard LaTeX2e array and tabular environments to provide better
125 % appearance and additional user controls. As the default LaTeX2e table
126 % generation code is lacking to the point of almost being broken with
127 % respect to the quality of the end results, all users are strongly
128 % advised to use an enhanced (at the very least that provided by array.sty)
129 % set of table tools. array.sty is already installed on most systems. The
130 % latest version and documentation can be obtained at:
131 % http://www.ctan.org/tex-archive/macros/latex/required/tools/
134 %\usepackage{mdwmath}
135 %\usepackage{mdwtab}
136 % Also highly recommended is Mark Wooding's extremely powerful MDW tools,
137 % especially mdwmath.sty and mdwtab.sty which are used to format equations
138 % and tables, respectively. The MDWtools set is already installed on most
139 % LaTeX systems. The lastest version and documentation is available at:
140 % http://www.ctan.org/tex-archive/macros/latex/contrib/mdwtools/
143 % IEEEtran contains the IEEEeqnarray family of commands that can be used to
144 % generate multiline equations as well as matrices, tables, etc., of high
145 % quality.
148 %\usepackage{eqparbox}
149 % Also of notable interest is Scott Pakin's eqparbox package for creating
150 % (automatically sized) equal width boxes - aka "natural width parboxes".
151 % Available at:
152 % http://www.ctan.org/tex-archive/macros/latex/contrib/eqparbox/
158 % *** SUBFIGURE PACKAGES ***
159 %\usepackage[tight,footnotesize]{subfigure}
160 % subfigure.sty was written by Steven Douglas Cochran. This package makes it
161 % easy to put subfigures in your figures. e.g., "Figure 1a and 1b". For IEEE
162 % work, it is a good idea to load it with the tight package option to reduce
163 % the amount of white space around the subfigures. subfigure.sty is already
164 % installed on most LaTeX systems. The latest version and documentation can
165 % be obtained at:
166 % http://www.ctan.org/tex-archive/obsolete/macros/latex/contrib/subfigure/
167 % subfigure.sty has been superceeded by subfig.sty.
171 %\usepackage[caption=false]{caption}
172 %\usepackage[font=footnotesize]{subfig}
173 % subfig.sty, also written by Steven Douglas Cochran, is the modern
174 % replacement for subfigure.sty. However, subfig.sty requires and
175 % automatically loads Axel Sommerfeldt's caption.sty which will override
176 % IEEEtran.cls handling of captions and this will result in nonIEEE style
177 % figure/table captions. To prevent this problem, be sure and preload
178 % caption.sty with its "caption=false" package option. This is will preserve
179 % IEEEtran.cls handing of captions. Version 1.3 (2005/06/28) and later
180 % (recommended due to many improvements over 1.2) of subfig.sty supports
181 % the caption=false option directly:
182 %\usepackage[caption=false,font=footnotesize]{subfig}
183 %
184 % The latest version and documentation can be obtained at:
185 % http://www.ctan.org/tex-archive/macros/latex/contrib/subfig/
186 % The latest version and documentation of caption.sty can be obtained at:
187 % http://www.ctan.org/tex-archive/macros/latex/contrib/caption/
192 % *** FLOAT PACKAGES ***
193 %
194 %\usepackage{fixltx2e}
195 % fixltx2e, the successor to the earlier fix2col.sty, was written by
196 % Frank Mittelbach and David Carlisle. This package corrects a few problems
197 % in the LaTeX2e kernel, the most notable of which is that in current
198 % LaTeX2e releases, the ordering of single and double column floats is not
199 % guaranteed to be preserved. Thus, an unpatched LaTeX2e can allow a
200 % single column figure to be placed prior to an earlier double column
201 % figure. The latest version and documentation can be found at:
202 % http://www.ctan.org/tex-archive/macros/latex/base/
206 %\usepackage{stfloats}
207 % stfloats.sty was written by Sigitas Tolusis. This package gives LaTeX2e
208 % the ability to do double column floats at the bottom of the page as well
209 % as the top. (e.g., "\begin{figure*}[!b]" is not normally possible in
210 % LaTeX2e). It also provides a command:
211 %\fnbelowfloat
212 % to enable the placement of footnotes below bottom floats (the standard
213 % LaTeX2e kernel puts them above bottom floats). This is an invasive package
214 % which rewrites many portions of the LaTeX2e float routines. It may not work
215 % with other packages that modify the LaTeX2e float routines. The latest
216 % version and documentation can be obtained at:
217 % http://www.ctan.org/tex-archive/macros/latex/contrib/sttools/
218 % Documentation is contained in the stfloats.sty comments as well as in the
219 % presfull.pdf file. Do not use the stfloats baselinefloat ability as IEEE
220 % does not allow \baselineskip to stretch. Authors submitting work to the
221 % IEEE should note that IEEE rarely uses double column equations and
222 % that authors should try to avoid such use. Do not be tempted to use the
223 % cuted.sty or midfloat.sty packages (also by Sigitas Tolusis) as IEEE does
224 % not format its papers in such ways.
230 % *** PDF, URL AND HYPERLINK PACKAGES ***
231 %
232 %\usepackage{url}
233 % url.sty was written by Donald Arseneau. It provides better support for
234 % handling and breaking URLs. url.sty is already installed on most LaTeX
235 % systems. The latest version can be obtained at:
236 % http://www.ctan.org/tex-archive/macros/latex/contrib/misc/
237 % Read the url.sty source comments for usage information. Basically,
238 % \url{my_url_here}.
244 % *** Do not adjust lengths that control margins, column widths, etc. ***
245 % *** Do not use packages that alter fonts (such as pslatex). ***
246 % There should be no need to do such things with IEEEtran.cls V1.6 and later.
247 % (Unless specifically asked to do so by the journal or conference you plan
248 % to submit to, of course. )
251 % correct bad hyphenation here
252 \hyphenation{op-tical net-works semi-conduc-tor}
255 \begin{document}
256 \bibliographystyle{plain}
257 %
259 \title{The Shape of Future Architectures}
261 \author
262 {
263 \IEEEauthorblockN{Sean Halle}
264 \IEEEauthorblockA
265 {
266 University of California at Santa Cruz\\
267 and INRIA Paris\\
268 Email: sean.halle@inria.fr
269 }
270 \and
271 \IEEEauthorblockN{Albert Cohen}
272 \IEEEauthorblockA
273 { INRIA Paris\\
274 Email: albert.cohen@infria.fr
275 }
276 }
279 \maketitle
280 %
282 \begin{abstract}
283 Process technology is rapidly approaching multiple fundamental physical limits that combine to increase the cost of producing chips faster than the increase in density. Whether or not process technology continues to shrink, the economics will slow the demand for chips from future processes. This will, for the first time, shift the design decisions from being shortest time-to-market towards being the greatest efficiency per transistor -- either in computation delivered towards the goal end-user measurement such as throughput, or in energy per operation.
285 The implication for processor architecture is to accelerate the trend away from "easy to program" structures such as out-of-order pipelines and coherent shared memory, and towards hierarchical designs that force application information to be exploited in explicitly managing the placement and movement of data, in order to keep it as local for as much of the computation as possible. The cost of communication will be the dominant factor driving the shape of future architectures, both from a computation-bottleneck cost and an energy per operation cost.
287 Integration will be pushed towards the third dimension in order to mitigate the communication costs -- indeed main memories have long been on that path. This approach will hasten the development of 3 dimensional integration processes. The architecture implication is that
288 \end{abstract}
290 \section{Motivation}
292 Threads were invented to be virtual sequential processors, but are
293 problematic when one has multiple physical cores. Not only
294 are they difficult to use for parallel code, but their parallel performance is poor
295 because they block the language from controlling
296 task placement. The language often knows which cores are likely to already
297 have data in the cache which a new task requires as input. If it had
298 control, the language could place tasks where the data is likely to already
299 reside, reducing communication and increasing performance and power
300 efficiency. A thread replacement is needed that is both easy to use in
301 applications and gives the language control over task placement.
303 The first question is, what is the easiest to use parallelism construct? There
304 exist many to choose from: Threads (for legacy reasons); Actors
305 {\cite{Hewitt10}}{\cite{Actors97}}; Components {\cite{ComponentModel00}};
306 process calculi {\cite{hoare78}} {\cite{milner99}};
307 coordination languages {\cite{Gelernter85Linda}}; and new ones
308 continually invented.
310 We demonstrate in this paper a thread replacement that lets any such parallelism
311 construct be plugged in. It is the first extensible hardware abstraction,
312 allowing to plug-in both preferred parallelism constructs and preferred
313 runtime scheduler. We focus in this introductory paper on
314 the parallelism construct support, illustrating how to implement synchronous
315 \texttt{send}-\texttt{receive} constructs motivated by process calculi,
316 the \texttt{spawn} and \texttt{sync} constructs from Cilk, and even
317 \texttt{mutex} and \texttt{condition variable} constructs from pthreads.
318 However, the schedulers we implemented have no
319 interesting performance optimizations.
321 The next question is, how much effort is required to implement a plugin? Our
322 mechanism protects the plugin from exposure to concurrency issues, allowing
323 the parallelism constructs to be implemented with sequential algorithms. It
324 accomplishes this by introducing a ``virtual'' time that sequentializes events
325 that appear, to the program, to be simultaneous.
327 We call our abstraction Virtualized Master-Slave, or VMS. We demonstrate a user-level
328 implementation in this paper, although it is ideally implemented as the base
329 hardware abstraction of the OS.
331 It involves subtle concepts related to time in the program vs time in the
332 plugin vs physical time. We explain it in four levels, starting with an abstract definition
333 and moving down to implementation.
335 The first level is the theoretical definition, given in Section \ref{secAbsModel}. Second is the elements
336 of our implementation and how they relate to the theoretical definition, in Section \ref{secInternal}.
337 Third is the application code point of view, in Section \ref{secApp}, which is tied back to both
338 the abstract model and the internal elements. The fourth view,
339 also in Section \ref{secApp}, shows the implementation of the plugin for a parallel
340 construct. Additionally,
341 measurements of effectiveness appear
342 in Section \ref{secResults} and conclusion is in \ref{secConclusion}.
344 \section{Background and Related Work}
346 {\bf Side-stepping OS Threads:\ }User-level thread packages and most parallel
347 language runtimes have to side-step OS threads, by pinning one to each core,
348 which effectively gives the user-level package control over the core. Our VMS
349 implementation also does this. We are not claiming in this paper to have the
350 OS level implementation of VMS that the model is capable of -- just the
351 user-space version.
353 Related work either provides a variation on the Thread model, or is a full
354 language with included parallelism constructs. For example,
355 TBB {\cite{TBBHome}} is a user-space thread package with added features.
356 It allows
357 tasks to create other tasks and control when to start their
358 execution. However, one cannot plug in alternate parallelism-control constructs, such as
359 rendez-vous style send and receive, which would have to be made using locks.
360 Likewise, Cilk {\cite{CILKHome}}
361 provides a simpler and easier to use version of TBB's task scheduler, but is
362 limited to only tree-shaped task graphs. OpenMP {\cite{OpenMPHome}} allows
363 creating tasks and controlling their execution order,
364 but is a language, not a hardware abstraction. All of these may
365 alternatively be implemented via plugins to VMS, with similar performance.
367 In contrast, VMS is the first hardware abstraction that doesn't impose its own concurrency
368 semantics, but rather takes preferred ones as plugins. This makes it not a
369 language itself, but a {\em{support}} mechanism for language level
370 parallelism constructs. The parallelism constructs in Actors, Components,
371 pthreads, and so on may all be implemented as VMS plugins.
373 {\bf Virtual Processor (VP) Definition:\ }We want to avoid the confusion
374 associated with the terms ``thread'' and ``task'' so will use the term
375 {\em{virtual processor}} (VP), which we define as state in combination with
376 the ability to animate code or {\em{an additional level of virtual
377 processors}}. The state consists of a stack with its contents, a program
378 counter, a pointer to top of stack, and a pointer to the current stack frame.
380 \section{Abstract Definition of VMS}
381 \label{secAbsModel}
383 We give an intuitive overview, then add details in the following sub-sections.
385 {\bf Intuitive Overview:\ }VMS is concerned primarily with time and
386 guarantees about it. This is because parallelism constructs control how the
387 time-lines of different virtual processors intersect. They also guarantee
388 relations of time lines to hardware events.
390 As an example, consider a program that writes into a data structure in one
391 time-line, then calls a \texttt{send} construct, meanwhile in a different
392 time-line it calls the \texttt{receive} construct then reads the data
393 structure. The constructs should guarantee that all data written before the
394 \texttt{send} is readable in the other time-line after the
395 \texttt{receive}. VMS provides primitive guarantees, which plugged-in code
396 builds upon to provide such higher-level guarantees.
398 To support parallelism constructs, VMS provides: primitive operations to create and suspend
399 VPs; a way for plugged-in code to control when each VP is (re)started; and
400 time-related guarantees. These are enforced on all hardware, be it shared
401 memory or distributed, with strong memory consistency or weak.
403 {\bf Definition in Three Parts:\ }We give the abstract definition in three
404 parts: a definition of the elements of a VMS computation system; a definition
405 of time and the key VMS guarantee; and a definition of virtual processor
406 scheduling states and transitions between them.
408 The definition we give is for VMS {\em{with plugins present}}. \ Hence, it
409 covers the behavior of all possible parallelism constructs implementable with
410 VMS. The Master mentioned in the definition is an abstract
411 entity, with a plugin present. In practice, this Master entity is implemented
412 as part of a core VMS, and plugins later added. This VMS-core is the
413 hardware abstraction. It hides the physical hardware behind an interface that
414 creates virtual processors and enforces well-defined time-behavior.
417 \subsection{The Elements of a VMS Computation System}
419 \texttt{- }A VMS program has multiple VPs, which are Slaves,
420 each with an independent time-line.
422 \texttt{- }A schedule of Slaves is generated by a Master
423 entity, from within separate time-line(s).
425 \texttt{- }A schedule is defined as the set of points at which
426 VPs are (re)animated.
428 \texttt{- }All semantic parallelism behavior is invoked via
429 communication with the Master.
431 \texttt{- }Communication with the Master happens by using a
432 model-provided primitive, which causes {\em{voluntary}} suspension of the
433 program's VP.
436 What is important here is: that the choice of which VP is animated, at which
437 point, happens in a separate time-line; and that the VPs suspend voluntarily
438 for each parallelism construct. This means that {\em{scheduling is separated
439 from the application code}}, the key point.
441 The Master entity appears to be a single entity to the slaves, but may be
442 implemented by multiple Master VPs hidden inside the VMS implementation.
444 VPs use the Master as an intermediary to: semantically communicate with each
445 other; cause creation of new program VPs; and to influence re-animation of
446 VPs. As a subtlety, notice that hardware mechanisms, such as coherent shared
447 memory, allow communication to take place that is not visible to the
448 parallelism constructs. Parallelism constructs must be separately called in
449 order to make use of shared variable communication safe.
451 {\bf Definitions:\ }VMS is intended only for hardware systems that consist of processing elements
452 connected by communication. We define a memory-space to be a processing
453 element, albeit without the ability to transform data. We define a
454 {\em{physical core}} to be a processing element that {\em{does}} transform
455 data, and require that it execute a sequential stream of instructions. We
456 define a program-time as the sequence of instructions animated by a Slave VP
457 (which is eventually animated by a physical core). A Slave VP has associated
458 {\em{scheduling state}} that, among other things, relates to how its
459 program-time progresses relative to physical time on the cores.
461 \subsection{Time in VMS}
463 \texttt{- }VMS has three levels of time: {\em{Program time}},
464 {\em{ Master time}}, and {\em{Virtual time}}.
466 \texttt{- }Program time is local to a Slave VP, measured in
467 instruction executions.
469 \texttt{- }Master time is hidden from the program and is
470 independent from all Program times.
472 \texttt{- }Virtual time is the ordered set of changes in
473 scheduling state of Slave VPs.
477 What is most important here is that Virtual time defines a global sequential
478 ordering. This ordering is consistent with the key VMS guarantee (given
479 below), and each point in it is computed within Master time.
481 Also, the independence between program times and master time has subtle
482 advantages. It enables elegant enforcement of the VMS guarantee, and
483 implementation simplifications that become clear after gaining deep
484 implementation knowledge.
486 In VMS, each event relevant to parallel semantics is tied to a transition of
487 the state of a Slave VP. This means that implementing the behavior of parallel
488 semantics is equivalent to controlling the order of transitions of state of
489 virtual processors.
491 {\bf Definitions:\ }The stream of instructions in a given program-time is
492 broken into a number of {\em{trace-segments}}, separated by suspension
493 points. Each trace-segment is animated by a single physical core, but not
494 necessarily the same core as animated the other trace segments. A suspend
495 point is created by a Slave VP executing the ``suspend'' primitive provided by
496 VMS. A suspend point has no duration in program time, but has distinct start
497 and end points in virtual time. The end-suspension points of two different
498 program times can be tied together within virtual time, which is called a
499 {\em{tie point}} and has special significance to parallel constructs. The
500 physical-time of a core has no relationship to any program time, except for
501 the various time-guarantees in this definition of VMS.
503 \begin{figure}[ht]
504 \includegraphics[width=2.9in]{../figures/Time_in_VMS_1.png}
505 \caption
506 {Mapping program time onto Virtual time. \ The
507 Master controls creation of new program time lines, and ending suspend
508 points. Here, it has ended two suspend points at a common tie-point.
509 }
510 \label{figTimeMapping}
511 \end{figure}
513 {\bf Relating time-lines to each other:\ }Figure \ref{figTimeMapping}
514 illustrates how trace-segments relate to suspend points, and map onto
515 virtual time. A trace segment starts in virtual time where suspend is ended, as seen.
516 In fact, the two trace segments shown have a common
517 start-point within virtual time. This is because the parallelism semantics
518 chose to start them at the same point -- this is what a tie
519 point is. A key note is that the lengths in virtual time have no relation to
520 the lengths in program-time. The only defined feature is that those two
521 trace-segments have a common start-point in virtual time. This means that the
522 two suspend points are considered to be tied together.
524 {\bf The Key VMS guarantee:\ }\label{VMSguarantee}Being tied together
525 means that all physical events that can be observed by both program-times are
526 covered by the key VMS guarantee: any events triggered before the common
527 suspend point in one program time are guaranteed visible in the other program
528 time after the common suspend point. They {\em{might}} be visible before,
529 but it's not guaranteed. In addition, events triggered after the common
530 suspend point in one are guaranteed not visible before the common suspend
531 point in the other. {\em{This two-part guarantee can be considered the heart
532 of VMS.}}
534 Intuitively, a tie-point separates before it from after such that tied program
535 times agree (illustrated with code in Section \ref{secApp} Figure \ref{figAnimVP}). But the subtlety is
536 that events triggered before the tie-point, {\em{might}} be visible to the
537 other before, and ones triggered after {\em{might not}} be visible to the
538 other after -- physical events triggered before are only guaranteed visible
539 {\em{after}} the tie point, and events after are only guaranteed
540 {\em{not}} visible {\em{before}} the tie point.
542 This is a form of bounded non-determinism. The pattern of suspension
543 end-points determines which trace-segments overlap in Virtual time, and events
544 triggered in one might be visible in overlapped ones. But no guarantees cover
545 these. If one segment tries to observe, it will see events triggered by
546 overlapped segments in non-deterministic order.
548 The VMS implementation defines which physical events are covered by the key
549 VMS guarantee (reads/writes, network communication, DMA, I/O).
551 {\bf Globally consistent sequential order:\ }VMS maps suspend-start,
552 suspend-end, and hence tie-points, to a globally-consistent sequential order
553 in Virtual time. This enables one of VMS's key benefits: sequential
554 algorithms for parallel constructs.
556 Tie points define parallel behavior, so an implementation of how to choose tie
557 points equals an implementation of parallel constructs. The Master chooses
558 tie-points, so plugging code to choose tie-points into the Master equals
559 plugging in parallel constructs.
561 \subsection{Scheduling State}
563 Scheduling state is used in VMS to organize internal activity, for enforcing
564 the guarantees.
566 \texttt{- }VPs have three scheduling states: {\em{Animated}},
567 {\em{Blocked}}, {\em{Ready}} (Figure \ref{figStates}).
569 \texttt{- }VPs in Animated are {\em{allowed}} to advance
570 program-time with {\em{local}} physical time.
572 \texttt{- }VPs in Blocked and Ready do not advance their
573 program-time.
575 \texttt{- }Animated has two physical states:
576 {\em{Progressing}} and {\em{Stalled}}.
578 \texttt{- }VPs in Progressing advance program-time with local
579 physical time, those in Stalled do not (allowing non-semantic suspend).
581 \texttt{- }Scheduling states are defined in Virtual time only.
583 \texttt{- }Progressing and Stalled are defined in local
584 physical time only (invisible in Virtual).
587 \begin{figure}[h]
588 \includegraphics{../figures/Scheduling_states_2.png}
589 \caption{Scheduling states of a slave VP in the VMS model.}
590 \label{figStates}
591 \end{figure}
593 Some important points: 1) only VPs Animated can trigger physical events that
594 are seen in other program time-lines; 2) the distinction between Blocked vs
595 Stalled is that a VP has to explicitly execute a VMS primitive operation to
596 enter Blocked, making it part of the semantics of parallelism constructs. In
597 contrast, Stalled happens invisibly, with no effect on semantic behavior. It
598 is due to hardware events hidden inside VMS, such as interrupts.
600 The Ready state is used to separate the parallelism-construct behavior from
601 the scheduling behavior. It acts as a ``staging area'' for scheduling. VPs
602 placed into this state are {\em{allowed}} to be animated, then the scheduler
603 decides when and where.
605 A subtle but illustrative point is that actions {\em{outside}} a given
606 program time cause the VP to transition Blocked$\rightarrow$Ready, which
607 contrasts to lock algorithms like spin-locks or Dijkstra's, where the
608 concurrency-related behavior takes place {\em{inside}} program time.
610 {\bf Transition Between Slave Scheduling States:\ }
612 \texttt{- }VPs transition states as shown in Figure \ref{figStates}.
614 \texttt{- }Animated$\rightarrow$Blocked is caused by a Slave VP
615 executing the Suspend VMS primitive.
617 \texttt{- }Blocked$\rightarrow$Ready is determined by the
618 semantics implemented in the plugin.
620 \texttt{- }Ready$\rightarrow$Animated is determined by the
621 scheduler in the plugin.
623 \texttt{- }Transitions in scheduling state have a globally
624 consistent order in Virtual time.
627 The parallelism primitives executed by a program do not control change in
628 scheduling states. They merely communicate messages to the Master, via a VMS
629 supplied primitive. Inside the Master, the plugin's parallelism construct
630 implementation processes the message. Based on that, it performs changes in
631 state from Blocked$\rightarrow$Ready, creates new VPs, and dissipates existing
632 VPs. Most communication from Slave to Master requires the VP to suspend when
633 it sends the message. A few messages, like creating new Slave may be sent
634 without suspending.
636 The suspend primitive decouples local physical time from Virtual time.
637 Execution causes immediate transition to Stalled in physical time, then the
638 Master performs Animated$\rightarrow$Blocked, fixing that transition in
639 Virtual time. The only relationship is causality. This weak relation is what
640 allows suspension-points to be serialized in Virtual time, which in turn is
641 what allows using sequential algorithms to implement parallelism constructs.
644 \subsection{Plugins}
646 \begin{figure}[ht]
647 \includegraphics{../figures/VMS-core__plugins.png}
648 \caption
649 {
650 The Master has been split into a generic core and a language-specific plug-in.
651 The core encapsulates the hardware and remains the same across applications.
652 The plug-in is part of the parallelism-construct implementation. It is
653 loaded separately onto the hardware and linked to the application when run.
654 }
655 \label{figMasterSplit}
656 \end{figure}
659 The Master entity has two parts, a generic core part and a plugin (Figure \ref{figMasterSplit}).
660 The core part of the Master is implemented as part of
661 VMS-core. The plug-in supplies two functions: the communication-handler and
662 the scheduler, both having a standard prototype. The communication-handler
663 implements the parallelism constructs, while scheduler assigns VPs to cores.
665 An {\em{instance}} of a plugin is created as part of initializing an
666 application, and the instance holds the semantic and scheduling state for that
667 run of the application. This state, combined with the virtual processor states
668 of the slaves created during that application run, represents progress of the
669 work of the application. \ For example, multi-tasking is performed simply by
670 the Master switching among plug-in instances when it has a resource to offer
671 to a scheduler. The parallelism-semantic state holds all information needed to
672 resume (hardware state, such as TLB and cache-tags is inside VMS-core).
676 \section{Internal Workings of Our Implementation}
677 \label{secInternal}
679 We name the elements of our example implementation and describe their logical
680 function, then relate them to the abstract model. We then step through the
681 operation of the elements.
683 {\bf Elements and Their Logical Function:\ }As illustrated in Figure \ref{figInternals},
684 our VMS implementation is organized around physical cores.
685 Each core has its own {\em{master
686 virtual-processor}}, \texttt{masterVP}, and a {\em{physical-core controller}}, which communicate via a set of
687 scheduling slots, \texttt{schedSlot}. The Master in the abstract definition
688 is implemented by the multiple \texttt{masterVP}s plus a particular plugin
689 instance with its shared parallelism-semantic state (seen at the top).
691 On a given core, only one of: the core-controller, \texttt{masterVP}, or a
692 slave VP, is animated at any point in local physical time. Each
693 \texttt{masterVP} animates the same function, called
694 \texttt{master\_loop}, and each slave VP animates a function from the
695 application, starting with the top-level function the slave is created with,
696 and following its call sequence. The core controller is implemented here as a
697 Linux pthread that runs the \texttt{core\_loop} function.
699 Switching between VPs is done by executing a VMS primitive that suspends the
700 VP. This switches the physical core over to the controller, by jumping to the
701 start of the \texttt{core\_loop} function, which chooses the next VP and
702 switches to that (switching is detailed in Section \ref{secApp} Figure \ref{figAssembly}).
704 {\bf Relation to Abstract Model:\ }We chose to implement the Master entity
705 of the model by a set of \texttt{masterVP}s, plus plug-in functions and
706 shared parallelism-semantic state. What we call VMS-core consists of this
707 implementation of the Master, plus the core-controllers, plus the VMS
708 primitive libraries, for creating new VPs and dissipating existing VPs, suspending VPs,
709 and communicating from slave VP to Master. In Figure \ref{figInternals},
710 everything in green is part of VMS-core, while the plugin is in red, and
711 application code appears as blue, inside the slave VP.
713 Virtual time in the model is implemented via a combination of four things: a
714 \texttt{masterLock} (not shown) that guarantees non-overlap of
715 \texttt{masterVP} trace-segments; the \texttt{master\_loop} which performs
716 transition Animated$\rightarrow$Blocked; the \texttt{comm\_handler\_fn}
717 which performs Blocked$\rightarrow$Ready and the \texttt{scheduler\_fn}
718 which performs Ready$\rightarrow$Animated. \ Each state transition is one step
719 of Virtual time; is guaranteed sequential by the non-overlap of
720 \texttt{masterVP} trace segments; and is global due to being in
721 parallelism-semantic state that is shared (top of Figure \ref{figInternals}).
723 Transitions Progressing$\rightleftarrows$Stalled within the Animated state are
724 invisible to the parallelism semantics, the Master, and Virtual time, and so
725 have no effect on the elements seen.
728 \begin{figure*}[!t]
729 \includegraphics[width=5in]{../figures/VMS-core__internal_workings.png}
730 \caption
731 { Internal elements of our example VMS implementation
732 }
733 \label{figInternals}
734 \end{figure*}
736 {\bf Steps of Operation:\ }The steps of operation are numbered, in Figure \ref{figInternals}.
737 Taking them in order, 1) \texttt{master\_loop} scans
738 the scheduling slots to see which ones' slaves have suspended since the
739 previous scan. \ 2) It hands these to the \texttt{comm\_handler\_fn} plugged
740 in (which equals transition Animated$\rightarrow$Blocked). \ 3) The VP has a
741 request attached, and data in it causes \ the \texttt{comm\_handler\_fn}
742 to manipulate data structures in the shared parallelism-semantic state. \
743 These structures hold all the slaves in the blocked state (code-level detail
744 in Figure \ref{figReqHdlr}, Section \ref{secApp}). \ 4) Some requests cause slaves to be moved to a
745 \texttt{readyQ} on one of the cores (Blocked$\rightarrow$Ready). Which
746 core's \texttt{readyQ} receives the slave is under plugin control,
747 determined by a combination of request contents, semantic state and physical
748 machine state. 5) During the scan, the \texttt{master\_loop} also looks for
749 empty slots, and for each calls the \texttt{scheduler\_fn} plugged in. It
750 chooses a slave from the \texttt{readyQ} on the core animating
751 \texttt{master\_loop}. \ 6) The \texttt{master\_loop} then places the
752 slave VP's pointer into the scheduling slot (Ready$\rightarrow$Animated),
753 making it available to the \texttt{core\_loop}. 7) When done with the scan,
754 \texttt{masterVP} suspends, switching animation back to the
755 \texttt{core\_loop}. \ 8) \texttt{core\_loop} takes slave VPs out of the
756 slots, then 9) switches animation to them. \ 10) When a slave self-suspends,
757 animation returns to the \texttt{core\_loop} (detail in code in Figure 9),
758 which picks another, until 11) all slots are empty and the
759 \texttt{core\_loop} switches animation to the \texttt{masterVP}.
761 {\bf Enabling sequential implementation of parallelism semantics:\ }All of
762 that happens on each core separately, but in this particular implementation we
763 use a central \texttt{masterLock} to ensure that only one core's
764 \texttt{masterVP} can be active at any time. This guarantees non-overlap
765 of trace-segments from different \texttt{masterVP}s, allowing the plugins to
766 use sequential algorithms, without a performance penalty, as verified in
767 Section \ref{secResults}.
769 Relating this to the abstract model: the parallelism-semantic behavior of the
770 Master is implemented by the communication handler, in the plugin. It thus
771 runs in the Master time referred to, in the model, in Section \ref{secAbsModel}. Requests are
772 sent to the Master by self-suspension of the slaves, but sit idle until the
773 other slaves in the scheduling slots have also run. This is the passive
774 behavior of requests that was noted in Section \ref{secAbsModel}, which allows the
775 \texttt{masterVP}s to remain suspended until needed. This in turn enables
776 the \texttt{masterVP}s from different cores to be non-overlapped. It is the
777 non-overlap that enables the algorithms for the parallelism semantics to be
778 sequential.
783 \section{Code Level View}
784 \label{secApp}
786 To relate the abstract model and the internal elements to application code and
787 parallelism-library code, we give code snippets that illustrate key features.
788 We start with the application then work down through the sequence of calls,
789 to the plugin, using our SSR {\cite{VMSHome}} parallelism-library as an
790 example.
792 In general, applications are either written in terms of a full custom language
793 that has its own syntax, or a base language with a parallelism library, which
794 is often called an {\em{embedded language}}. Our demonstrators, VCilk
795 {\cite{VMSHome}}, Vthread, and SSR, are all parallelism libraries. A full
796 custom language would follow the standard practice of performing
797 source-to-source transform, from custom syntax into C plus parallelism-library
798 calls.
800 {\bf SSR:\ }SSR stands for Synchronous Send-Receive, and details of its
801 calls and internal implementation will be given throughout this section. It
802 has two types of construct. The first, called {\em{from-to}} has two calls:
803 \texttt{SSR\_send\_from\_to} and \texttt{SSR\_receive\_from\_to}, both of
804 which specify the sending VP as well as the receiving VP. \ The other, called
805 {\em{of-type}} also has two calls: \texttt{SSR\_\_send\_of\_type\_to} and
806 \texttt{SSR\_\_receive\_of\_type}, which allow a receiver to accept from
807 anonymous senders, but select according to type of message.
810 % An example of a double column floating figure using two subfigures.
811 % (The subfig.sty package must be loaded for this to work.)
812 % The subfigure \label commands are set within each subfloat command, the
813 % \label for the overall figure must come after \caption.
814 % \hfil must be used as a separator to get equal spacing.
815 % The subfigure.sty package works much the same way, except \subfigure is
816 % used instead of \subfloat.
817 %
818 %\begin{figure*}[!t]
819 %\centerline{\subfloat[Case I]\includegraphics[width=2.5in]{subfigcase1}%
820 %\label{fig_first_case}}
821 %\hfil
822 %\subfloat[Case II]{\includegraphics[width=2.5in]{subfigcase2}%
823 %\label{fig_second_case}}}
824 %\caption{Simulation results}
825 %\label{fig_sim}
826 %\end{figure*}
827 %
828 % Note that often IEEE papers with subfigures do not employ subfigure
829 % captions (using the optional argument to \subfloat), but instead will
830 % reference/describe all of them (a), (b), etc., within the main caption.
833 {\bf Application View:\ } Figure \ref{figAnimVP} shows snippets of application code, which use the SSR parallelism
834 library. The most important feature is that all calls take a pointer to the
835 VP that is animating the call. This is seen at the top of the figure where slave VP creation takes a pointer
836 to the VP asking for creation. Below that is the standard prototype for top level functions,
837 showing that the function receives a pointer to the VP it is the top level function for.
839 The pointer is placed on the stack by VMS when it creates the VP, and is the means by
840 which the application comes into possession of the pointer. This animating VP is
841 passed to all library calls made from there. For example, the bottom shows a pointer to the
842 animating VP placed in the position of sender in
843 the \texttt{send} construct call. Correspondingly, for the \texttt{receive} construct,
844 the position of receiving VP is filled by the VP animating the call.
846 \begin{figure}[ht]
847 {\noindent
848 {\scriptsize
849 {\small Creating a new processor:}
850 \begin{verbatim}
851 newProcessor = SSR__create_procr( &top_VP_fn,
852 paramsPtr, animatingVP ); \end{verbatim}
854 {\small prototype for the top level function:}
855 \begin{verbatim}
856 top_VP_fn( void *parameterStrucPtr, VirtProcr
857 *animatingVP ); \end{verbatim}
859 {\small handing animating VP to parallelism constructs:}
860 \begin{verbatim}
861 SSR__send_from_to( messagePtr, animatingVP,
862 receivingVP );
863 messagePtr = SSR__receive_from_to( sendingVP,
864 animatingVP ); \end{verbatim}
865 }
866 }
867 \caption
868 {
869 Application code snippets showing that all calls to the parallelism library
870 take the VP animating that call as a parameter.
871 }
872 \label{figAnimVP}
873 \end{figure}
875 Relating these to the internal elements of our implementation, the
876 \texttt{animatingVP} suspends inside each of these
877 calls, passing a request (generated in the library) to one of the \texttt{masterVP}s.
878 The \texttt{masterVP} then calls the \texttt{comm-handler}
879 plugin, and so on, as described in Section \ref{secInternal}.
881 For the \texttt{SSR\_\_create\_processor} call, the comm-handler
882 in turn calls a VMS primitive to perform the creation.
883 The primitive places a pointer to the newly created VP onto its stack, so that when
884 \texttt{top\_VP\_fn} is later animated, it sees the VP-pointer as a
885 parameter passed to it. \ All
886 application code is either such a top-level function, or has one at the root
887 of the call-stack.
889 The send and receive calls both suspend their animating VP. When both have
890 been called, the communication handler pairs them up and resumes both. This
891 ties time-lines together, invoking the VMS guarantee. Both
892 application-functions know, because of the VMS guarantee (Section \ref{secAbsModel}), that
893 writes to shared variables made before the send call by the sender are visible
894 to the receiver after the receive call. This is the programmer's view of tying
895 together the local time-lines of two different VPs, as defined in Section \ref{secAbsModel}.
898 {\bf Concurrency-Library View:\ }A parallelism library function, in
899 general, only creates a request, sends it, and returns, as seen below. To
900 send a request, it uses the combined request-and-suspend VMS primitive that
901 attaches the request then suspends the VP. The primitive requires the
902 pointer to the VP, to attach the request and to suspend it.
905 \begin{figure}[ht]
906 {\noindent
907 {\scriptsize
908 \begin{verbatim}
909 void * SSR__receive_from_to( VirtProcr *sendVP,
910 VirtProcr *receiveVP )
911 { SSRSemReq reqData;
912 reqData.receiveVP = receiveVP;
913 reqData.sendVP = sendVP;
914 reqData.reqType = receive_from_to;
915 VMS__send_sem_request( &reqData, receiveVP );
916 return receiveVP->dataReturnedFromRequest;
917 } \end{verbatim}
918 }
919 }
920 \caption{Implementation of SSR's receive\_from\_to library function.}
921 \label{figImplLib}
922 \end{figure}
925 In Figure \ref{figImplLib}, notice that the request's data is on the stack of the virtual
926 processor that's animating the call, which is the \texttt{receiveVP}. The
927 \texttt{VMS\_\_send\_sem\_request} suspends this VP, which changes the
928 physical core's stack pointer to a different stack. So the request data is
929 guaranteed to remain undisturbed while the VP is suspended.
931 Figure \ref{figAssembly} shows the implementation of the VMS suspend primitive. As seen in
932 Figure \ref{figInternals}, suspending the \texttt{receiveVP} involves
933 switching to the \texttt{core\_loop}. In our implementation, this is done by
934 switching to the stack of the pthread pinned to the physical core and then
935 jumping to the start-point of \texttt{core\_loop}.
937 This code uses standard techniques commonly employed in co-routine
938 implementations. Tuning effort spent in \texttt{core\_loop} is inherited by
939 all applications.
942 \begin{figure}[ht]
943 {\noindent
944 {\scriptsize
945 \begin{verbatim}
946 VMS__suspend_procr( VirtProcr *animatingVP )
947 { animatingVP->resumeInstrAddr = &&ResumePt;
948 //GCC takes addr of label
949 animatingVP->schedSlotAssignedTo->
950 isNewlySuspended = TRUE;
951 //for master_loop to see
952 <assembly code stores current physical core's
953 stack reg into animatingVP struct>
954 <assembly code loads stack reg with core_loop
955 stackPtr, which was saved into animatingVP>
956 <assembly code jmps to core_loop start instr
957 addr, which was also saved into animatingVP>
958 ResumePt:
959 return;
960 } \end{verbatim}
961 }
962 }
963 \caption
964 {Implementation of VMS suspend processor.
965 Re-animating the virtual processor reverses this sequence. \ It saves the
966 \texttt{core\_loop}'s resume instr-addr and stack ptr into the VP structure,
967 then loads the VP's stack ptr and jmps to its \texttt{resumeInstrAddr}.
968 }
969 \label{figAssembly}
970 \end{figure}
973 {\bf Plugin View:\ }SSR's communication handler dispatches on the
974 \texttt{reqType} field of the request data, as set by the
975 \texttt{SSR\_\_receive\_from\_to} code. It calls the handler code in
976 Figure \ref{figReqHdlr}. This constructs a hash-key, by concatenating the from-VP's pointer
977 with the to-VP's pointer. Then it looks-up that key in the hash-table that SSR uses
978 to match sends with receives, which is in the shared semantic state seen at
979 the top of Figure \ref{figInternals} in Section \ref{secInternal}.
981 The most important feature in Figure \ref{figReqHdlr} is that both send and receive
982 will construct the same key, so will find the same
983 hash entry. Whichever request is handled first in Virtual time will see the
984 hash entry empty, and save itself in that entry. The second to arrive
985 sees the waiting request and then resumes both VPs, by putting them into their
986 \texttt{readyQ}s.
988 Access to the shared hash
989 table can be considered private, as in a sequential algorithm. This is because
990 our VMS-core implementation ensures that only
991 one handler on one core is executing at a time.
994 \begin{figure}[ht]
995 {\noindent
996 {\scriptsize
997 \begin{verbatim}
998 handle_receive_from_to( VirtProcr *requestingVP,
999 SSRSemReq *reqData, SSRSemEnv *semEnv )
1000 { commHashTbl = semEnv->communicatingVPHashTable;
1001 key[0] = reqData->receiveVP; key[1] =
1002 reqData->sendVP; //send uses same key
1003 waitingReqData = lookup_and_remove( key,
1004 commHashTbl ); //get waiting request
1005 if( waitingReqData != NULL )
1006 { resume_virt_procr( waitingReqData->sendVP );
1007 resume_virt_procr( waitingReqData->
1008 receiveVP );
1009 }
1010 else
1011 insert( key, reqData, commHashTbl );
1012 //receive is first to arrive, make it wait
1013 } \end{verbatim}
1014 }
1015 }
1016 \caption
1017 {Pseudo-code of communication-handler for
1018 \texttt{receive\_from\_to} request type. The \texttt{semEnv} is a pointer
1019 to the shared parallelism-semantic state seen at the top of Figure
1020 \ref{figInternals}.
1021 }
1022 \label{figReqHdlr}
1023 \end{figure}
1028 \section{Results}
1029 \label{secResults}
1031 {\bf Setup:\ }We implemented blocked dense matrix multiply with right
1032 sub-matrices copied to transposed form. We ran on a 1 socket by 4 core
1033 Core2Quad 2.4Ghz chip.
1035 {\bf Implementation-Time:\ }As shown in Table 1, time to implement the
1036 three parallel libraries averages 2 days each. As an example of productivity,
1037 adding nested transactions, parallel singleton, and atomic function-execution
1038 to SSR required a single afternoon, totaling less than 100 lines of C code.
1041 \begin{table}[ht]
1042 \caption
1043 {Person-days to design, code, and
1044 test each parallelism library. L.O.C. is lines of (original) C code, excluding libraries and
1045 comments.
1046 }
1047 \label{tabPersonDaysLang}
1048 \begin{center}
1049 \begin{tabular}{llll}
1050 & SSR & Vthread & VCilk\\
1051 Design & 4 & 1 & 0.5\\
1052 Code & 2 & 0.5 & 0.5\\
1053 Test & 1 & 0.5 & 0.5\\
1054 L.O.C. & 470 & 290 & 310
1055 \end{tabular}
1056 \end{center}
1057 \end{table}
1060 {\bf Execution Performance:\ }Performance of VMS is seen in Table
1061 \ref{tabOverheadCycles}. The code is not optimized, but rather written
1062 to be easy to understand and modify. The majority of the plugin time
1063 is lost to cache misses because the shared parallelism-semantic state moves
1064 between cores on a majority of accesses. Acquisition of the master lock is slow due to the hardware
1065 implementing the CAS instruction.
1067 Existing techniques will likely improve performance, such
1068 as localizing semantic data to cores, splitting malloc across the cores,
1069 pre-allocating slabs that are recycled, and pre-fetching. However, in many cases, several hundred
1070 nano-seconds per task is as optimal as the applications can benefit from.
1072 \begin{table}[ht]
1073 \caption
1074 {Cycles of overhead, per scheduled
1075 slave. ``comp only'' is perfect memory, ``comp + mem'' is actual cycles.
1076 ``Plugin-concur'' only concurrency requests, ``plugin-all'' includes
1077 create and malloc requests. Two significant digits due to variability.
1078 }
1079 \label{tabOverheadCycles}
1080 \begin{center}
1081 \begin{tabular}{|l|ll|r|r|}
1082 \hline
1083 & & comp & comp\\
1084 & & only & +mem\\
1085 VMS Only & \texttt{master\_loop} & 91 & \ 110\\
1086 & switch VPs & 77 & \ 130\\
1087 & (malloc) & 160 & 2300\\
1088 & (create VP) & 540 & 3800\\
1089 \hline
1090 Language: & & & \\
1091 SSR & plugin -- concur & 190 & 540\\
1092 & plugin -- all & 530 & 2200\\
1093 & lock & & 250\\
1094 Vthread & plugin -- concur & 66 & 710\\
1095 & plugin -- all & 180 & 1500\\
1096 & lock & & 250\\
1097 VCilk & plugin -- concur & 65 & 260\\
1098 & plugin -- all & 330 & 1800\\
1099 & lock & & 250\\
1100 \hline
1101 \end{tabular}
1102 \end{center}
1103 \end{table}
1107 {\bf Head to Head:\ }We
1108 compare our implementation of the \texttt{spawn} and \texttt{sync}
1109 constructs against Cilk 5.4, on the top in Table
1110 \ref{tabHeadToHead}, which shows that the same application code has similar
1111 performance. For large matrices, Cilk 5.4's better use of the memory hierarchy
1112 achieves 23\% better performance. However, for small matrices, VCilk is better,
1113 with a factor 2 lower overhead. Cilk 5.4 does not allow controlling the number of spawn events it
1114 actually executes, and chooses to run smaller matrices sequentially, limiting
1115 our comparison.
1117 When comparing to pthreads, our VMS based implementation has more than an
1118 order of magnitude better overhead per invocation of mutex or condition
1119 variable functionality, as seen on the bottom of Table \ref{tabHeadToHead}.
1120 Applications that inherently have short trace segments will synchronize often
1121 and benefit the most from Vthread.
1125 \begin{table}[ht]
1126 \caption
1127 {On top, exe time in seconds for MM. Below, overhead for pthread vs Vthread. First column is cycles
1128 for perfect memory and second is total measured cycles. pthread cycles are
1129 deduced from round-trip experiments.
1130 }
1131 \label{tabHeadToHead}
1132 \begin{center}
1133 \begin{tabular}{|l@{\ }|@{\ }lr|}
1134 \hline
1135 \rule{0pt}{12pt}
1136 Matrix size&Lang.&sec.\\
1137 [2pt]\hline
1138 81x81 & Cilk & 0.017\\
1139 & VCilk & 0.008\\
1140 \hline
1141 324x324 & Cilk & 0.13\\
1142 & VCilk & 0.13\\
1143 \hline
1144 648x648 & Cilk & 0.71\\
1145 & VCilk & 0.85\\
1146 \hline
1147 1296x1296 & Cilk & 4.8\\
1148 & VCilk & 6.2 \\
1149 [2pt]\hline
1150 \end{tabular}
1151 \begin{tabular}{c}
1152 \begin{tabular}{|l|rr|r|r|}
1153 \hline
1154 operation & \multicolumn{2}{c|}{Vthread} & pthread & ratio\\
1155 \hline
1156 & comp & total & & \\
1157 & only & & & \\
1158 [2pt]\hline
1159 mutex\_lock & 85 & 1050 & 50,000 & 48:1\\
1160 mutex\_unlock & 85 & 610 & 45,000 & 74:1\\
1161 cond\_wait & 85 & 850 & 60,000 & 71:1\\
1162 cond\_signal & 90 & 650 & 60,000 & 92:1\\
1163 \hline
1164 \end{tabular}\\
1165 \end{tabular}
1166 \end{center}
1167 \end{table}
1169 \section{Conclusion}\label{secConclusion}
1171 We have shown an alternative to the Thread model that enables easier-to-use
1172 parallelism constructs by splitting
1173 the scheduler open, to accept new parallelism constructs in the form of
1174 plugins. This gives the language control over assigning virtual
1175 processors to physical cores, for performance, debugging, and flexibility
1176 benefits. Parallelism constructs of languages are implementable using
1177 sequential algorithms, within a matter of days, while maintaining low run-time
1178 overhead, on the order of a few hundred nano-seconds per concurrency
1179 operation.
1181 \bibliography{Bib_for_papers}
1183 \end{document}
