| rev |
line source |
|
Me@0
|
1
|
|
Me@0
|
2
|
|
Me@0
|
3 \documentclass[conference]{IEEEtran}
|
|
Me@0
|
4 %
|
|
Me@0
|
5 \usepackage{makeidx,geometry,amssymb,graphicx,calc,ifthen}
|
|
Me@0
|
6 %
|
|
Me@0
|
7
|
|
Me@0
|
8 % *** CITATION PACKAGES ***
|
|
Me@0
|
9 %
|
|
Me@0
|
10 %\usepackage{cite}
|
|
Me@0
|
11 % cite.sty was written by Donald Arseneau
|
|
Me@0
|
12 % V1.6 and later of IEEEtran pre-defines the format of the cite.sty package
|
|
Me@0
|
13 % \cite{} output to follow that of IEEE. Loading the cite package will
|
|
Me@0
|
14 % result in citation numbers being automatically sorted and properly
|
|
Me@0
|
15 % "compressed/ranged". e.g., [1], [9], [2], [7], [5], [6] without using
|
|
Me@0
|
16 % cite.sty will become [1], [2], [5]--[7], [9] using cite.sty. cite.sty's
|
|
Me@0
|
17 % \cite will automatically add leading space, if needed. Use cite.sty's
|
|
Me@0
|
18 % noadjust option (cite.sty V3.8 and later) if you want to turn this off.
|
|
Me@0
|
19 % cite.sty is already installed on most LaTeX systems. Be sure and use
|
|
Me@0
|
20 % version 4.0 (2003-05-27) and later if using hyperref.sty. cite.sty does
|
|
Me@0
|
21 % not currently provide for hyperlinked citations.
|
|
Me@0
|
22 % The latest version can be obtained at:
|
|
Me@0
|
23 % http://www.ctan.org/tex-archive/macros/latex/contrib/cite/
|
|
Me@0
|
24 % The documentation is contained in the cite.sty file itself.
|
|
Me@0
|
25
|
|
Me@0
|
26
|
|
Me@0
|
27
|
|
Me@0
|
28
|
|
Me@0
|
29
|
|
Me@0
|
30
|
|
Me@0
|
31 % *** GRAPHICS RELATED PACKAGES ***
|
|
Me@0
|
32 %
|
|
Me@0
|
33 \ifCLASSINFOpdf
|
|
Me@0
|
34 % \usepackage[pdftex]{graphicx}
|
|
Me@0
|
35 % declare the path(s) where your graphic files are
|
|
Me@0
|
36 % \graphicspath{{../pdf/}{../jpeg/}}
|
|
Me@0
|
37 % and their extensions so you won't have to specify these with
|
|
Me@0
|
38 % every instance of \includegraphics
|
|
Me@0
|
39 % \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
|
|
Me@0
|
40 \else
|
|
Me@0
|
41 % or other class option (dvipsone, dvipdf, if not using dvips). graphicx
|
|
Me@0
|
42 % will default to the driver specified in the system graphics.cfg if no
|
|
Me@0
|
43 % driver is specified.
|
|
Me@0
|
44 % \usepackage[dvips]{graphicx}
|
|
Me@0
|
45 % declare the path(s) where your graphic files are
|
|
Me@0
|
46 % \graphicspath{{../eps/}}
|
|
Me@0
|
47 % and their extensions so you won't have to specify these with
|
|
Me@0
|
48 % every instance of \includegraphics
|
|
Me@0
|
49 % \DeclareGraphicsExtensions{.eps}
|
|
Me@0
|
50 \fi
|
|
Me@0
|
51 % graphicx was written by David Carlisle and Sebastian Rahtz. It is
|
|
Me@0
|
52 % required if you want graphics, photos, etc. graphicx.sty is already
|
|
Me@0
|
53 % installed on most LaTeX systems. The latest version and documentation can
|
|
Me@0
|
54 % be obtained at:
|
|
Me@0
|
55 % http://www.ctan.org/tex-archive/macros/latex/required/graphics/
|
|
Me@0
|
56 % Another good source of documentation is "Using Imported Graphics in
|
|
Me@0
|
57 % LaTeX2e" by Keith Reckdahl which can be found as epslatex.ps or
|
|
Me@0
|
58 % epslatex.pdf at: http://www.ctan.org/tex-archive/info/
|
|
Me@0
|
59 %
|
|
Me@0
|
60 % latex, and pdflatex in dvi mode, support graphics in encapsulated
|
|
Me@0
|
61 % postscript (.eps) format. pdflatex in pdf mode supports graphics
|
|
Me@0
|
62 % in .pdf, .jpeg, .png and .mps (metapost) formats. Users should ensure
|
|
Me@0
|
63 % that all non-photo figures use a vector format (.eps, .pdf, .mps) and
|
|
Me@0
|
64 % not a bitmapped formats (.jpeg, .png). IEEE frowns on bitmapped formats
|
|
Me@0
|
65 % which can result in "jaggedy"/blurry rendering of lines and letters as
|
|
Me@0
|
66 % well as large increases in file sizes.
|
|
Me@0
|
67 %
|
|
Me@0
|
68 % You can find documentation about the pdfTeX application at:
|
|
Me@0
|
69 % http://www.tug.org/applications/pdftex
|
|
Me@0
|
70
|
|
Me@0
|
71
|
|
Me@0
|
72
|
|
Me@0
|
73
|
|
Me@0
|
74
|
|
Me@0
|
75 % *** MATH PACKAGES ***
|
|
Me@0
|
76 %
|
|
Me@0
|
77 %\usepackage[cmex10]{amsmath}
|
|
Me@0
|
78 % A popular package from the American Mathematical Society that provides
|
|
Me@0
|
79 % many useful and powerful commands for dealing with mathematics. If using
|
|
Me@0
|
80 % it, be sure to load this package with the cmex10 option to ensure that
|
|
Me@0
|
81 % only type 1 fonts will utilized at all point sizes. Without this option,
|
|
Me@0
|
82 % it is possible that some math symbols, particularly those within
|
|
Me@0
|
83 % footnotes, will be rendered in bitmap form which will result in a
|
|
Me@0
|
84 % document that can not be IEEE Xplore compliant!
|
|
Me@0
|
85 %
|
|
Me@0
|
86 % Also, note that the amsmath package sets \interdisplaylinepenalty to 10000
|
|
Me@0
|
87 % thus preventing page breaks from occurring within multiline equations. Use:
|
|
Me@0
|
88 %\interdisplaylinepenalty=2500
|
|
Me@0
|
89 % after loading amsmath to restore such page breaks as IEEEtran.cls normally
|
|
Me@0
|
90 % does. amsmath.sty is already installed on most LaTeX systems. The latest
|
|
Me@0
|
91 % version and documentation can be obtained at:
|
|
Me@0
|
92 % http://www.ctan.org/tex-archive/macros/latex/required/amslatex/math/
|
|
Me@0
|
93
|
|
Me@0
|
94
|
|
Me@0
|
95
|
|
Me@0
|
96
|
|
Me@0
|
97
|
|
Me@0
|
98 % *** SPECIALIZED LIST PACKAGES ***
|
|
Me@0
|
99 %
|
|
Me@0
|
100 %\usepackage{algorithmic}
|
|
Me@0
|
101 % algorithmic.sty was written by Peter Williams and Rogerio Brito.
|
|
Me@0
|
102 % This package provides an algorithmic environment fo describing algorithms.
|
|
Me@0
|
103 % You can use the algorithmic environment in-text or within a figure
|
|
Me@0
|
104 % environment to provide for a floating algorithm. Do NOT use the algorithm
|
|
Me@0
|
105 % floating environment provided by algorithm.sty (by the same authors) or
|
|
Me@0
|
106 % algorithm2e.sty (by Christophe Fiorio) as IEEE does not use dedicated
|
|
Me@0
|
107 % algorithm float types and packages that provide these will not provide
|
|
Me@0
|
108 % correct IEEE style captions. The latest version and documentation of
|
|
Me@0
|
109 % algorithmic.sty can be obtained at:
|
|
Me@0
|
110 % http://www.ctan.org/tex-archive/macros/latex/contrib/algorithms/
|
|
Me@0
|
111 % There is also a support site at:
|
|
Me@0
|
112 % http://algorithms.berlios.de/index.html
|
|
Me@0
|
113 % Also of interest may be the (relatively newer and more customizable)
|
|
Me@0
|
114 % algorithmicx.sty package by Szasz Janos:
|
|
Me@0
|
115 % http://www.ctan.org/tex-archive/macros/latex/contrib/algorithmicx/
|
|
Me@0
|
116
|
|
Me@0
|
117
|
|
Me@0
|
118
|
|
Me@0
|
119
|
|
Me@0
|
120 % *** ALIGNMENT PACKAGES ***
|
|
Me@0
|
121 %
|
|
Me@0
|
122 %\usepackage{array}
|
|
Me@0
|
123 % Frank Mittelbach's and David Carlisle's array.sty patches and improves
|
|
Me@0
|
124 % the standard LaTeX2e array and tabular environments to provide better
|
|
Me@0
|
125 % appearance and additional user controls. As the default LaTeX2e table
|
|
Me@0
|
126 % generation code is lacking to the point of almost being broken with
|
|
Me@0
|
127 % respect to the quality of the end results, all users are strongly
|
|
Me@0
|
128 % advised to use an enhanced (at the very least that provided by array.sty)
|
|
Me@0
|
129 % set of table tools. array.sty is already installed on most systems. The
|
|
Me@0
|
130 % latest version and documentation can be obtained at:
|
|
Me@0
|
131 % http://www.ctan.org/tex-archive/macros/latex/required/tools/
|
|
Me@0
|
132
|
|
Me@0
|
133
|
|
Me@0
|
134 %\usepackage{mdwmath}
|
|
Me@0
|
135 %\usepackage{mdwtab}
|
|
Me@0
|
136 % Also highly recommended is Mark Wooding's extremely powerful MDW tools,
|
|
Me@0
|
137 % especially mdwmath.sty and mdwtab.sty which are used to format equations
|
|
Me@0
|
138 % and tables, respectively. The MDWtools set is already installed on most
|
|
Me@0
|
139 % LaTeX systems. The lastest version and documentation is available at:
|
|
Me@0
|
140 % http://www.ctan.org/tex-archive/macros/latex/contrib/mdwtools/
|
|
Me@0
|
141
|
|
Me@0
|
142
|
|
Me@0
|
143 % IEEEtran contains the IEEEeqnarray family of commands that can be used to
|
|
Me@0
|
144 % generate multiline equations as well as matrices, tables, etc., of high
|
|
Me@0
|
145 % quality.
|
|
Me@0
|
146
|
|
Me@0
|
147
|
|
Me@0
|
148 %\usepackage{eqparbox}
|
|
Me@0
|
149 % Also of notable interest is Scott Pakin's eqparbox package for creating
|
|
Me@0
|
150 % (automatically sized) equal width boxes - aka "natural width parboxes".
|
|
Me@0
|
151 % Available at:
|
|
Me@0
|
152 % http://www.ctan.org/tex-archive/macros/latex/contrib/eqparbox/
|
|
Me@0
|
153
|
|
Me@0
|
154
|
|
Me@0
|
155
|
|
Me@0
|
156
|
|
Me@0
|
157
|
|
Me@0
|
158 % *** SUBFIGURE PACKAGES ***
|
|
Me@0
|
159 %\usepackage[tight,footnotesize]{subfigure}
|
|
Me@0
|
160 % subfigure.sty was written by Steven Douglas Cochran. This package makes it
|
|
Me@0
|
161 % easy to put subfigures in your figures. e.g., "Figure 1a and 1b". For IEEE
|
|
Me@0
|
162 % work, it is a good idea to load it with the tight package option to reduce
|
|
Me@0
|
163 % the amount of white space around the subfigures. subfigure.sty is already
|
|
Me@0
|
164 % installed on most LaTeX systems. The latest version and documentation can
|
|
Me@0
|
165 % be obtained at:
|
|
Me@0
|
166 % http://www.ctan.org/tex-archive/obsolete/macros/latex/contrib/subfigure/
|
|
Me@0
|
167 % subfigure.sty has been superceeded by subfig.sty.
|
|
Me@0
|
168
|
|
Me@0
|
169
|
|
Me@0
|
170
|
|
Me@0
|
171 %\usepackage[caption=false]{caption}
|
|
Me@0
|
172 %\usepackage[font=footnotesize]{subfig}
|
|
Me@0
|
173 % subfig.sty, also written by Steven Douglas Cochran, is the modern
|
|
Me@0
|
174 % replacement for subfigure.sty. However, subfig.sty requires and
|
|
Me@0
|
175 % automatically loads Axel Sommerfeldt's caption.sty which will override
|
|
Me@0
|
176 % IEEEtran.cls handling of captions and this will result in nonIEEE style
|
|
Me@0
|
177 % figure/table captions. To prevent this problem, be sure and preload
|
|
Me@0
|
178 % caption.sty with its "caption=false" package option. This is will preserve
|
|
Me@0
|
179 % IEEEtran.cls handing of captions. Version 1.3 (2005/06/28) and later
|
|
Me@0
|
180 % (recommended due to many improvements over 1.2) of subfig.sty supports
|
|
Me@0
|
181 % the caption=false option directly:
|
|
Me@0
|
182 %\usepackage[caption=false,font=footnotesize]{subfig}
|
|
Me@0
|
183 %
|
|
Me@0
|
184 % The latest version and documentation can be obtained at:
|
|
Me@0
|
185 % http://www.ctan.org/tex-archive/macros/latex/contrib/subfig/
|
|
Me@0
|
186 % The latest version and documentation of caption.sty can be obtained at:
|
|
Me@0
|
187 % http://www.ctan.org/tex-archive/macros/latex/contrib/caption/
|
|
Me@0
|
188
|
|
Me@0
|
189
|
|
Me@0
|
190
|
|
Me@0
|
191
|
|
Me@0
|
192 % *** FLOAT PACKAGES ***
|
|
Me@0
|
193 %
|
|
Me@0
|
194 %\usepackage{fixltx2e}
|
|
Me@0
|
195 % fixltx2e, the successor to the earlier fix2col.sty, was written by
|
|
Me@0
|
196 % Frank Mittelbach and David Carlisle. This package corrects a few problems
|
|
Me@0
|
197 % in the LaTeX2e kernel, the most notable of which is that in current
|
|
Me@0
|
198 % LaTeX2e releases, the ordering of single and double column floats is not
|
|
Me@0
|
199 % guaranteed to be preserved. Thus, an unpatched LaTeX2e can allow a
|
|
Me@0
|
200 % single column figure to be placed prior to an earlier double column
|
|
Me@0
|
201 % figure. The latest version and documentation can be found at:
|
|
Me@0
|
202 % http://www.ctan.org/tex-archive/macros/latex/base/
|
|
Me@0
|
203
|
|
Me@0
|
204
|
|
Me@0
|
205
|
|
Me@0
|
206 %\usepackage{stfloats}
|
|
Me@0
|
207 % stfloats.sty was written by Sigitas Tolusis. This package gives LaTeX2e
|
|
Me@0
|
208 % the ability to do double column floats at the bottom of the page as well
|
|
Me@0
|
209 % as the top. (e.g., "\begin{figure*}[!b]" is not normally possible in
|
|
Me@0
|
210 % LaTeX2e). It also provides a command:
|
|
Me@0
|
211 %\fnbelowfloat
|
|
Me@0
|
212 % to enable the placement of footnotes below bottom floats (the standard
|
|
Me@0
|
213 % LaTeX2e kernel puts them above bottom floats). This is an invasive package
|
|
Me@0
|
214 % which rewrites many portions of the LaTeX2e float routines. It may not work
|
|
Me@0
|
215 % with other packages that modify the LaTeX2e float routines. The latest
|
|
Me@0
|
216 % version and documentation can be obtained at:
|
|
Me@0
|
217 % http://www.ctan.org/tex-archive/macros/latex/contrib/sttools/
|
|
Me@0
|
218 % Documentation is contained in the stfloats.sty comments as well as in the
|
|
Me@0
|
219 % presfull.pdf file. Do not use the stfloats baselinefloat ability as IEEE
|
|
Me@0
|
220 % does not allow \baselineskip to stretch. Authors submitting work to the
|
|
Me@0
|
221 % IEEE should note that IEEE rarely uses double column equations and
|
|
Me@0
|
222 % that authors should try to avoid such use. Do not be tempted to use the
|
|
Me@0
|
223 % cuted.sty or midfloat.sty packages (also by Sigitas Tolusis) as IEEE does
|
|
Me@0
|
224 % not format its papers in such ways.
|
|
Me@0
|
225
|
|
Me@0
|
226
|
|
Me@0
|
227
|
|
Me@0
|
228
|
|
Me@0
|
229
|
|
Me@0
|
230 % *** PDF, URL AND HYPERLINK PACKAGES ***
|
|
Me@0
|
231 %
|
|
Me@0
|
232 %\usepackage{url}
|
|
Me@0
|
233 % url.sty was written by Donald Arseneau. It provides better support for
|
|
Me@0
|
234 % handling and breaking URLs. url.sty is already installed on most LaTeX
|
|
Me@0
|
235 % systems. The latest version can be obtained at:
|
|
Me@0
|
236 % http://www.ctan.org/tex-archive/macros/latex/contrib/misc/
|
|
Me@0
|
237 % Read the url.sty source comments for usage information. Basically,
|
|
Me@0
|
238 % \url{my_url_here}.
|
|
Me@0
|
239
|
|
Me@0
|
240
|
|
Me@0
|
241
|
|
Me@0
|
242
|
|
Me@0
|
243
|
|
Me@0
|
244 % *** Do not adjust lengths that control margins, column widths, etc. ***
|
|
Me@0
|
245 % *** Do not use packages that alter fonts (such as pslatex). ***
|
|
Me@0
|
246 % There should be no need to do such things with IEEEtran.cls V1.6 and later.
|
|
Me@0
|
247 % (Unless specifically asked to do so by the journal or conference you plan
|
|
Me@0
|
248 % to submit to, of course. )
|
|
Me@0
|
249
|
|
Me@0
|
250
|
|
Me@0
|
251 % correct bad hyphenation here
|
|
Me@0
|
252 \hyphenation{op-tical net-works semi-conduc-tor}
|
|
Me@0
|
253
|
|
Me@0
|
254
|
|
Me@0
|
255 \begin{document}
|
|
Me@0
|
256 \bibliographystyle{plain}
|
|
Me@0
|
257 %
|
|
Me@0
|
258
|
|
Me@0
|
259 \title{The Shape of Future Architectures}
|
|
Me@0
|
260
|
|
Me@0
|
261 \author
|
|
Me@0
|
262 {
|
|
Me@0
|
263 \IEEEauthorblockN{Sean Halle}
|
|
Me@0
|
264 \IEEEauthorblockA
|
|
Me@0
|
265 {
|
|
Me@0
|
266 University of California at Santa Cruz\\
|
|
Me@0
|
267 and INRIA Paris\\
|
|
Me@0
|
268 Email: sean.halle@inria.fr
|
|
Me@0
|
269 }
|
|
Me@0
|
270 \and
|
|
Me@0
|
271 \IEEEauthorblockN{Albert Cohen}
|
|
Me@0
|
272 \IEEEauthorblockA
|
|
Me@0
|
273 { INRIA Paris\\
|
|
Me@0
|
274 Email: albert.cohen@infria.fr
|
|
Me@0
|
275 }
|
|
Me@0
|
276 }
|
|
Me@0
|
277
|
|
Me@0
|
278
|
|
Me@0
|
279 \maketitle
|
|
Me@0
|
280 %
|
|
Me@0
|
281
|
|
Me@0
|
282 \begin{abstract}
|
|
Me@0
|
283 Process technology is rapidly approaching multiple fundamental physical limits that combine to increase the cost of producing chips faster than the increase in density. Whether or not process technology continues to shrink, the economics will slow the demand for chips from future processes. This will, for the first time, shift the design decisions from being shortest time-to-market towards being the greatest efficiency per transistor -- either in computation delivered towards the goal end-user measurement such as throughput, or in energy per operation.
|
|
Me@0
|
284
|
|
Me@0
|
285 The implication for processor architecture is to accelerate the trend away from "easy to program" structures such as out-of-order pipelines and coherent shared memory, and towards hierarchical designs that force application information to be exploited in explicitly managing the placement and movement of data, in order to keep it as local for as much of the computation as possible. The cost of communication will be the dominant factor driving the shape of future architectures, both from a computation-bottleneck cost and an energy per operation cost.
|
|
Me@0
|
286
|
|
Me@0
|
287 Integration will be pushed towards the third dimension in order to mitigate the communication costs -- indeed main memories have long been on that path. This approach will hasten the development of 3 dimensional integration processes. The architecture implication is that
|
|
Me@0
|
288 \end{abstract}
|
|
Me@0
|
289
|
|
Me@0
|
290 \section{Motivation}
|
|
Me@0
|
291
|
|
Me@0
|
292 Threads were invented to be virtual sequential processors, but are
|
|
Me@0
|
293 problematic when one has multiple physical cores. Not only
|
|
Me@0
|
294 are they difficult to use for parallel code, but their parallel performance is poor
|
|
Me@0
|
295 because they block the language from controlling
|
|
Me@0
|
296 task placement. The language often knows which cores are likely to already
|
|
Me@0
|
297 have data in the cache which a new task requires as input. If it had
|
|
Me@0
|
298 control, the language could place tasks where the data is likely to already
|
|
Me@0
|
299 reside, reducing communication and increasing performance and power
|
|
Me@0
|
300 efficiency. A thread replacement is needed that is both easy to use in
|
|
Me@0
|
301 applications and gives the language control over task placement.
|
|
Me@0
|
302
|
|
Me@0
|
303 The first question is, what is the easiest to use parallelism construct? There
|
|
Me@0
|
304 exist many to choose from: Threads (for legacy reasons); Actors
|
|
Me@0
|
305 {\cite{Hewitt10}}{\cite{Actors97}}; Components {\cite{ComponentModel00}};
|
|
Me@0
|
306 process calculi {\cite{hoare78}} {\cite{milner99}};
|
|
Me@0
|
307 coordination languages {\cite{Gelernter85Linda}}; and new ones
|
|
Me@0
|
308 continually invented.
|
|
Me@0
|
309
|
|
Me@0
|
310 We demonstrate in this paper a thread replacement that lets any such parallelism
|
|
Me@0
|
311 construct be plugged in. It is the first extensible hardware abstraction,
|
|
Me@0
|
312 allowing to plug-in both preferred parallelism constructs and preferred
|
|
Me@0
|
313 runtime scheduler. We focus in this introductory paper on
|
|
Me@0
|
314 the parallelism construct support, illustrating how to implement synchronous
|
|
Me@0
|
315 \texttt{send}-\texttt{receive} constructs motivated by process calculi,
|
|
Me@0
|
316 the \texttt{spawn} and \texttt{sync} constructs from Cilk, and even
|
|
Me@0
|
317 \texttt{mutex} and \texttt{condition variable} constructs from pthreads.
|
|
Me@0
|
318 However, the schedulers we implemented have no
|
|
Me@0
|
319 interesting performance optimizations.
|
|
Me@0
|
320
|
|
Me@0
|
321 The next question is, how much effort is required to implement a plugin? Our
|
|
Me@0
|
322 mechanism protects the plugin from exposure to concurrency issues, allowing
|
|
Me@0
|
323 the parallelism constructs to be implemented with sequential algorithms. It
|
|
Me@0
|
324 accomplishes this by introducing a ``virtual'' time that sequentializes events
|
|
Me@0
|
325 that appear, to the program, to be simultaneous.
|
|
Me@0
|
326
|
|
Me@0
|
327 We call our abstraction Virtualized Master-Slave, or VMS. We demonstrate a user-level
|
|
Me@0
|
328 implementation in this paper, although it is ideally implemented as the base
|
|
Me@0
|
329 hardware abstraction of the OS.
|
|
Me@0
|
330
|
|
Me@0
|
331 It involves subtle concepts related to time in the program vs time in the
|
|
Me@0
|
332 plugin vs physical time. We explain it in four levels, starting with an abstract definition
|
|
Me@0
|
333 and moving down to implementation.
|
|
Me@0
|
334
|
|
Me@0
|
335 The first level is the theoretical definition, given in Section \ref{secAbsModel}. Second is the elements
|
|
Me@0
|
336 of our implementation and how they relate to the theoretical definition, in Section \ref{secInternal}.
|
|
Me@0
|
337 Third is the application code point of view, in Section \ref{secApp}, which is tied back to both
|
|
Me@0
|
338 the abstract model and the internal elements. The fourth view,
|
|
Me@0
|
339 also in Section \ref{secApp}, shows the implementation of the plugin for a parallel
|
|
Me@0
|
340 construct. Additionally,
|
|
Me@0
|
341 measurements of effectiveness appear
|
|
Me@0
|
342 in Section \ref{secResults} and conclusion is in \ref{secConclusion}.
|
|
Me@0
|
343
|
|
Me@0
|
344 \section{Background and Related Work}
|
|
Me@0
|
345
|
|
Me@0
|
346 {\bf Side-stepping OS Threads:\ }User-level thread packages and most parallel
|
|
Me@0
|
347 language runtimes have to side-step OS threads, by pinning one to each core,
|
|
Me@0
|
348 which effectively gives the user-level package control over the core. Our VMS
|
|
Me@0
|
349 implementation also does this. We are not claiming in this paper to have the
|
|
Me@0
|
350 OS level implementation of VMS that the model is capable of -- just the
|
|
Me@0
|
351 user-space version.
|
|
Me@0
|
352
|
|
Me@0
|
353 Related work either provides a variation on the Thread model, or is a full
|
|
Me@0
|
354 language with included parallelism constructs. For example,
|
|
Me@0
|
355 TBB {\cite{TBBHome}} is a user-space thread package with added features.
|
|
Me@0
|
356 It allows
|
|
Me@0
|
357 tasks to create other tasks and control when to start their
|
|
Me@0
|
358 execution. However, one cannot plug in alternate parallelism-control constructs, such as
|
|
Me@0
|
359 rendez-vous style send and receive, which would have to be made using locks.
|
|
Me@0
|
360 Likewise, Cilk {\cite{CILKHome}}
|
|
Me@0
|
361 provides a simpler and easier to use version of TBB's task scheduler, but is
|
|
Me@0
|
362 limited to only tree-shaped task graphs. OpenMP {\cite{OpenMPHome}} allows
|
|
Me@0
|
363 creating tasks and controlling their execution order,
|
|
Me@0
|
364 but is a language, not a hardware abstraction. All of these may
|
|
Me@0
|
365 alternatively be implemented via plugins to VMS, with similar performance.
|
|
Me@0
|
366
|
|
Me@0
|
367 In contrast, VMS is the first hardware abstraction that doesn't impose its own concurrency
|
|
Me@0
|
368 semantics, but rather takes preferred ones as plugins. This makes it not a
|
|
Me@0
|
369 language itself, but a {\em{support}} mechanism for language level
|
|
Me@0
|
370 parallelism constructs. The parallelism constructs in Actors, Components,
|
|
Me@0
|
371 pthreads, and so on may all be implemented as VMS plugins.
|
|
Me@0
|
372
|
|
Me@0
|
373 {\bf Virtual Processor (VP) Definition:\ }We want to avoid the confusion
|
|
Me@0
|
374 associated with the terms ``thread'' and ``task'' so will use the term
|
|
Me@0
|
375 {\em{virtual processor}} (VP), which we define as state in combination with
|
|
Me@0
|
376 the ability to animate code or {\em{an additional level of virtual
|
|
Me@0
|
377 processors}}. The state consists of a stack with its contents, a program
|
|
Me@0
|
378 counter, a pointer to top of stack, and a pointer to the current stack frame.
|
|
Me@0
|
379
|
|
Me@0
|
380 \section{Abstract Definition of VMS}
|
|
Me@0
|
381 \label{secAbsModel}
|
|
Me@0
|
382
|
|
Me@0
|
383 We give an intuitive overview, then add details in the following sub-sections.
|
|
Me@0
|
384
|
|
Me@0
|
385 {\bf Intuitive Overview:\ }VMS is concerned primarily with time and
|
|
Me@0
|
386 guarantees about it. This is because parallelism constructs control how the
|
|
Me@0
|
387 time-lines of different virtual processors intersect. They also guarantee
|
|
Me@0
|
388 relations of time lines to hardware events.
|
|
Me@0
|
389
|
|
Me@0
|
390 As an example, consider a program that writes into a data structure in one
|
|
Me@0
|
391 time-line, then calls a \texttt{send} construct, meanwhile in a different
|
|
Me@0
|
392 time-line it calls the \texttt{receive} construct then reads the data
|
|
Me@0
|
393 structure. The constructs should guarantee that all data written before the
|
|
Me@0
|
394 \texttt{send} is readable in the other time-line after the
|
|
Me@0
|
395 \texttt{receive}. VMS provides primitive guarantees, which plugged-in code
|
|
Me@0
|
396 builds upon to provide such higher-level guarantees.
|
|
Me@0
|
397
|
|
Me@0
|
398 To support parallelism constructs, VMS provides: primitive operations to create and suspend
|
|
Me@0
|
399 VPs; a way for plugged-in code to control when each VP is (re)started; and
|
|
Me@0
|
400 time-related guarantees. These are enforced on all hardware, be it shared
|
|
Me@0
|
401 memory or distributed, with strong memory consistency or weak.
|
|
Me@0
|
402
|
|
Me@0
|
403 {\bf Definition in Three Parts:\ }We give the abstract definition in three
|
|
Me@0
|
404 parts: a definition of the elements of a VMS computation system; a definition
|
|
Me@0
|
405 of time and the key VMS guarantee; and a definition of virtual processor
|
|
Me@0
|
406 scheduling states and transitions between them.
|
|
Me@0
|
407
|
|
Me@0
|
408 The definition we give is for VMS {\em{with plugins present}}. \ Hence, it
|
|
Me@0
|
409 covers the behavior of all possible parallelism constructs implementable with
|
|
Me@0
|
410 VMS. The Master mentioned in the definition is an abstract
|
|
Me@0
|
411 entity, with a plugin present. In practice, this Master entity is implemented
|
|
Me@0
|
412 as part of a core VMS, and plugins later added. This VMS-core is the
|
|
Me@0
|
413 hardware abstraction. It hides the physical hardware behind an interface that
|
|
Me@0
|
414 creates virtual processors and enforces well-defined time-behavior.
|
|
Me@0
|
415
|
|
Me@0
|
416
|
|
Me@0
|
417 \subsection{The Elements of a VMS Computation System}
|
|
Me@0
|
418
|
|
Me@0
|
419 \texttt{- }A VMS program has multiple VPs, which are Slaves,
|
|
Me@0
|
420 each with an independent time-line.
|
|
Me@0
|
421
|
|
Me@0
|
422 \texttt{- }A schedule of Slaves is generated by a Master
|
|
Me@0
|
423 entity, from within separate time-line(s).
|
|
Me@0
|
424
|
|
Me@0
|
425 \texttt{- }A schedule is defined as the set of points at which
|
|
Me@0
|
426 VPs are (re)animated.
|
|
Me@0
|
427
|
|
Me@0
|
428 \texttt{- }All semantic parallelism behavior is invoked via
|
|
Me@0
|
429 communication with the Master.
|
|
Me@0
|
430
|
|
Me@0
|
431 \texttt{- }Communication with the Master happens by using a
|
|
Me@0
|
432 model-provided primitive, which causes {\em{voluntary}} suspension of the
|
|
Me@0
|
433 program's VP.
|
|
Me@0
|
434
|
|
Me@0
|
435
|
|
Me@0
|
436 What is important here is: that the choice of which VP is animated, at which
|
|
Me@0
|
437 point, happens in a separate time-line; and that the VPs suspend voluntarily
|
|
Me@0
|
438 for each parallelism construct. This means that {\em{scheduling is separated
|
|
Me@0
|
439 from the application code}}, the key point.
|
|
Me@0
|
440
|
|
Me@0
|
441 The Master entity appears to be a single entity to the slaves, but may be
|
|
Me@0
|
442 implemented by multiple Master VPs hidden inside the VMS implementation.
|
|
Me@0
|
443
|
|
Me@0
|
444 VPs use the Master as an intermediary to: semantically communicate with each
|
|
Me@0
|
445 other; cause creation of new program VPs; and to influence re-animation of
|
|
Me@0
|
446 VPs. As a subtlety, notice that hardware mechanisms, such as coherent shared
|
|
Me@0
|
447 memory, allow communication to take place that is not visible to the
|
|
Me@0
|
448 parallelism constructs. Parallelism constructs must be separately called in
|
|
Me@0
|
449 order to make use of shared variable communication safe.
|
|
Me@0
|
450
|
|
Me@0
|
451 {\bf Definitions:\ }VMS is intended only for hardware systems that consist of processing elements
|
|
Me@0
|
452 connected by communication. We define a memory-space to be a processing
|
|
Me@0
|
453 element, albeit without the ability to transform data. We define a
|
|
Me@0
|
454 {\em{physical core}} to be a processing element that {\em{does}} transform
|
|
Me@0
|
455 data, and require that it execute a sequential stream of instructions. We
|
|
Me@0
|
456 define a program-time as the sequence of instructions animated by a Slave VP
|
|
Me@0
|
457 (which is eventually animated by a physical core). A Slave VP has associated
|
|
Me@0
|
458 {\em{scheduling state}} that, among other things, relates to how its
|
|
Me@0
|
459 program-time progresses relative to physical time on the cores.
|
|
Me@0
|
460
|
|
Me@0
|
461 \subsection{Time in VMS}
|
|
Me@0
|
462
|
|
Me@0
|
463 \texttt{- }VMS has three levels of time: {\em{Program time}},
|
|
Me@0
|
464 {\em{ Master time}}, and {\em{Virtual time}}.
|
|
Me@0
|
465
|
|
Me@0
|
466 \texttt{- }Program time is local to a Slave VP, measured in
|
|
Me@0
|
467 instruction executions.
|
|
Me@0
|
468
|
|
Me@0
|
469 \texttt{- }Master time is hidden from the program and is
|
|
Me@0
|
470 independent from all Program times.
|
|
Me@0
|
471
|
|
Me@0
|
472 \texttt{- }Virtual time is the ordered set of changes in
|
|
Me@0
|
473 scheduling state of Slave VPs.
|
|
Me@0
|
474
|
|
Me@0
|
475
|
|
Me@0
|
476
|
|
Me@0
|
477 What is most important here is that Virtual time defines a global sequential
|
|
Me@0
|
478 ordering. This ordering is consistent with the key VMS guarantee (given
|
|
Me@0
|
479 below), and each point in it is computed within Master time.
|
|
Me@0
|
480
|
|
Me@0
|
481 Also, the independence between program times and master time has subtle
|
|
Me@0
|
482 advantages. It enables elegant enforcement of the VMS guarantee, and
|
|
Me@0
|
483 implementation simplifications that become clear after gaining deep
|
|
Me@0
|
484 implementation knowledge.
|
|
Me@0
|
485
|
|
Me@0
|
486 In VMS, each event relevant to parallel semantics is tied to a transition of
|
|
Me@0
|
487 the state of a Slave VP. This means that implementing the behavior of parallel
|
|
Me@0
|
488 semantics is equivalent to controlling the order of transitions of state of
|
|
Me@0
|
489 virtual processors.
|
|
Me@0
|
490
|
|
Me@0
|
491 {\bf Definitions:\ }The stream of instructions in a given program-time is
|
|
Me@0
|
492 broken into a number of {\em{trace-segments}}, separated by suspension
|
|
Me@0
|
493 points. Each trace-segment is animated by a single physical core, but not
|
|
Me@0
|
494 necessarily the same core as animated the other trace segments. A suspend
|
|
Me@0
|
495 point is created by a Slave VP executing the ``suspend'' primitive provided by
|
|
Me@0
|
496 VMS. A suspend point has no duration in program time, but has distinct start
|
|
Me@0
|
497 and end points in virtual time. The end-suspension points of two different
|
|
Me@0
|
498 program times can be tied together within virtual time, which is called a
|
|
Me@0
|
499 {\em{tie point}} and has special significance to parallel constructs. The
|
|
Me@0
|
500 physical-time of a core has no relationship to any program time, except for
|
|
Me@0
|
501 the various time-guarantees in this definition of VMS.
|
|
Me@0
|
502
|
|
Me@0
|
503 \begin{figure}[ht]
|
|
Me@0
|
504 \includegraphics[width=2.9in]{../figures/Time_in_VMS_1.png}
|
|
Me@0
|
505 \caption
|
|
Me@0
|
506 {Mapping program time onto Virtual time. \ The
|
|
Me@0
|
507 Master controls creation of new program time lines, and ending suspend
|
|
Me@0
|
508 points. Here, it has ended two suspend points at a common tie-point.
|
|
Me@0
|
509 }
|
|
Me@0
|
510 \label{figTimeMapping}
|
|
Me@0
|
511 \end{figure}
|
|
Me@0
|
512
|
|
Me@0
|
513 {\bf Relating time-lines to each other:\ }Figure \ref{figTimeMapping}
|
|
Me@0
|
514 illustrates how trace-segments relate to suspend points, and map onto
|
|
Me@0
|
515 virtual time. A trace segment starts in virtual time where suspend is ended, as seen.
|
|
Me@0
|
516 In fact, the two trace segments shown have a common
|
|
Me@0
|
517 start-point within virtual time. This is because the parallelism semantics
|
|
Me@0
|
518 chose to start them at the same point -- this is what a tie
|
|
Me@0
|
519 point is. A key note is that the lengths in virtual time have no relation to
|
|
Me@0
|
520 the lengths in program-time. The only defined feature is that those two
|
|
Me@0
|
521 trace-segments have a common start-point in virtual time. This means that the
|
|
Me@0
|
522 two suspend points are considered to be tied together.
|
|
Me@0
|
523
|
|
Me@0
|
524 {\bf The Key VMS guarantee:\ }\label{VMSguarantee}Being tied together
|
|
Me@0
|
525 means that all physical events that can be observed by both program-times are
|
|
Me@0
|
526 covered by the key VMS guarantee: any events triggered before the common
|
|
Me@0
|
527 suspend point in one program time are guaranteed visible in the other program
|
|
Me@0
|
528 time after the common suspend point. They {\em{might}} be visible before,
|
|
Me@0
|
529 but it's not guaranteed. In addition, events triggered after the common
|
|
Me@0
|
530 suspend point in one are guaranteed not visible before the common suspend
|
|
Me@0
|
531 point in the other. {\em{This two-part guarantee can be considered the heart
|
|
Me@0
|
532 of VMS.}}
|
|
Me@0
|
533
|
|
Me@0
|
534 Intuitively, a tie-point separates before it from after such that tied program
|
|
Me@0
|
535 times agree (illustrated with code in Section \ref{secApp} Figure \ref{figAnimVP}). But the subtlety is
|
|
Me@0
|
536 that events triggered before the tie-point, {\em{might}} be visible to the
|
|
Me@0
|
537 other before, and ones triggered after {\em{might not}} be visible to the
|
|
Me@0
|
538 other after -- physical events triggered before are only guaranteed visible
|
|
Me@0
|
539 {\em{after}} the tie point, and events after are only guaranteed
|
|
Me@0
|
540 {\em{not}} visible {\em{before}} the tie point.
|
|
Me@0
|
541
|
|
Me@0
|
542 This is a form of bounded non-determinism. The pattern of suspension
|
|
Me@0
|
543 end-points determines which trace-segments overlap in Virtual time, and events
|
|
Me@0
|
544 triggered in one might be visible in overlapped ones. But no guarantees cover
|
|
Me@0
|
545 these. If one segment tries to observe, it will see events triggered by
|
|
Me@0
|
546 overlapped segments in non-deterministic order.
|
|
Me@0
|
547
|
|
Me@0
|
548 The VMS implementation defines which physical events are covered by the key
|
|
Me@0
|
549 VMS guarantee (reads/writes, network communication, DMA, I/O).
|
|
Me@0
|
550
|
|
Me@0
|
551 {\bf Globally consistent sequential order:\ }VMS maps suspend-start,
|
|
Me@0
|
552 suspend-end, and hence tie-points, to a globally-consistent sequential order
|
|
Me@0
|
553 in Virtual time. This enables one of VMS's key benefits: sequential
|
|
Me@0
|
554 algorithms for parallel constructs.
|
|
Me@0
|
555
|
|
Me@0
|
556 Tie points define parallel behavior, so an implementation of how to choose tie
|
|
Me@0
|
557 points equals an implementation of parallel constructs. The Master chooses
|
|
Me@0
|
558 tie-points, so plugging code to choose tie-points into the Master equals
|
|
Me@0
|
559 plugging in parallel constructs.
|
|
Me@0
|
560
|
|
Me@0
|
561 \subsection{Scheduling State}
|
|
Me@0
|
562
|
|
Me@0
|
563 Scheduling state is used in VMS to organize internal activity, for enforcing
|
|
Me@0
|
564 the guarantees.
|
|
Me@0
|
565
|
|
Me@0
|
566 \texttt{- }VPs have three scheduling states: {\em{Animated}},
|
|
Me@0
|
567 {\em{Blocked}}, {\em{Ready}} (Figure \ref{figStates}).
|
|
Me@0
|
568
|
|
Me@0
|
569 \texttt{- }VPs in Animated are {\em{allowed}} to advance
|
|
Me@0
|
570 program-time with {\em{local}} physical time.
|
|
Me@0
|
571
|
|
Me@0
|
572 \texttt{- }VPs in Blocked and Ready do not advance their
|
|
Me@0
|
573 program-time.
|
|
Me@0
|
574
|
|
Me@0
|
575 \texttt{- }Animated has two physical states:
|
|
Me@0
|
576 {\em{Progressing}} and {\em{Stalled}}.
|
|
Me@0
|
577
|
|
Me@0
|
578 \texttt{- }VPs in Progressing advance program-time with local
|
|
Me@0
|
579 physical time, those in Stalled do not (allowing non-semantic suspend).
|
|
Me@0
|
580
|
|
Me@0
|
581 \texttt{- }Scheduling states are defined in Virtual time only.
|
|
Me@0
|
582
|
|
Me@0
|
583 \texttt{- }Progressing and Stalled are defined in local
|
|
Me@0
|
584 physical time only (invisible in Virtual).
|
|
Me@0
|
585
|
|
Me@0
|
586
|
|
Me@0
|
587 \begin{figure}[h]
|
|
Me@0
|
588 \includegraphics{../figures/Scheduling_states_2.png}
|
|
Me@0
|
589 \caption{Scheduling states of a slave VP in the VMS model.}
|
|
Me@0
|
590 \label{figStates}
|
|
Me@0
|
591 \end{figure}
|
|
Me@0
|
592
|
|
Me@0
|
593 Some important points: 1) only VPs Animated can trigger physical events that
|
|
Me@0
|
594 are seen in other program time-lines; 2) the distinction between Blocked vs
|
|
Me@0
|
595 Stalled is that a VP has to explicitly execute a VMS primitive operation to
|
|
Me@0
|
596 enter Blocked, making it part of the semantics of parallelism constructs. In
|
|
Me@0
|
597 contrast, Stalled happens invisibly, with no effect on semantic behavior. It
|
|
Me@0
|
598 is due to hardware events hidden inside VMS, such as interrupts.
|
|
Me@0
|
599
|
|
Me@0
|
600 The Ready state is used to separate the parallelism-construct behavior from
|
|
Me@0
|
601 the scheduling behavior. It acts as a ``staging area'' for scheduling. VPs
|
|
Me@0
|
602 placed into this state are {\em{allowed}} to be animated, then the scheduler
|
|
Me@0
|
603 decides when and where.
|
|
Me@0
|
604
|
|
Me@0
|
605 A subtle but illustrative point is that actions {\em{outside}} a given
|
|
Me@0
|
606 program time cause the VP to transition Blocked$\rightarrow$Ready, which
|
|
Me@0
|
607 contrasts to lock algorithms like spin-locks or Dijkstra's, where the
|
|
Me@0
|
608 concurrency-related behavior takes place {\em{inside}} program time.
|
|
Me@0
|
609
|
|
Me@0
|
610 {\bf Transition Between Slave Scheduling States:\ }
|
|
Me@0
|
611
|
|
Me@0
|
612 \texttt{- }VPs transition states as shown in Figure \ref{figStates}.
|
|
Me@0
|
613
|
|
Me@0
|
614 \texttt{- }Animated$\rightarrow$Blocked is caused by a Slave VP
|
|
Me@0
|
615 executing the Suspend VMS primitive.
|
|
Me@0
|
616
|
|
Me@0
|
617 \texttt{- }Blocked$\rightarrow$Ready is determined by the
|
|
Me@0
|
618 semantics implemented in the plugin.
|
|
Me@0
|
619
|
|
Me@0
|
620 \texttt{- }Ready$\rightarrow$Animated is determined by the
|
|
Me@0
|
621 scheduler in the plugin.
|
|
Me@0
|
622
|
|
Me@0
|
623 \texttt{- }Transitions in scheduling state have a globally
|
|
Me@0
|
624 consistent order in Virtual time.
|
|
Me@0
|
625
|
|
Me@0
|
626
|
|
Me@0
|
627 The parallelism primitives executed by a program do not control change in
|
|
Me@0
|
628 scheduling states. They merely communicate messages to the Master, via a VMS
|
|
Me@0
|
629 supplied primitive. Inside the Master, the plugin's parallelism construct
|
|
Me@0
|
630 implementation processes the message. Based on that, it performs changes in
|
|
Me@0
|
631 state from Blocked$\rightarrow$Ready, creates new VPs, and dissipates existing
|
|
Me@0
|
632 VPs. Most communication from Slave to Master requires the VP to suspend when
|
|
Me@0
|
633 it sends the message. A few messages, like creating new Slave may be sent
|
|
Me@0
|
634 without suspending.
|
|
Me@0
|
635
|
|
Me@0
|
636 The suspend primitive decouples local physical time from Virtual time.
|
|
Me@0
|
637 Execution causes immediate transition to Stalled in physical time, then the
|
|
Me@0
|
638 Master performs Animated$\rightarrow$Blocked, fixing that transition in
|
|
Me@0
|
639 Virtual time. The only relationship is causality. This weak relation is what
|
|
Me@0
|
640 allows suspension-points to be serialized in Virtual time, which in turn is
|
|
Me@0
|
641 what allows using sequential algorithms to implement parallelism constructs.
|
|
Me@0
|
642
|
|
Me@0
|
643
|
|
Me@0
|
644 \subsection{Plugins}
|
|
Me@0
|
645
|
|
Me@0
|
646 \begin{figure}[ht]
|
|
Me@0
|
647 \includegraphics{../figures/VMS-core__plugins.png}
|
|
Me@0
|
648 \caption
|
|
Me@0
|
649 {
|
|
Me@0
|
650 The Master has been split into a generic core and a language-specific plug-in.
|
|
Me@0
|
651 The core encapsulates the hardware and remains the same across applications.
|
|
Me@0
|
652 The plug-in is part of the parallelism-construct implementation. It is
|
|
Me@0
|
653 loaded separately onto the hardware and linked to the application when run.
|
|
Me@0
|
654 }
|
|
Me@0
|
655 \label{figMasterSplit}
|
|
Me@0
|
656 \end{figure}
|
|
Me@0
|
657
|
|
Me@0
|
658
|
|
Me@0
|
659 The Master entity has two parts, a generic core part and a plugin (Figure \ref{figMasterSplit}).
|
|
Me@0
|
660 The core part of the Master is implemented as part of
|
|
Me@0
|
661 VMS-core. The plug-in supplies two functions: the communication-handler and
|
|
Me@0
|
662 the scheduler, both having a standard prototype. The communication-handler
|
|
Me@0
|
663 implements the parallelism constructs, while scheduler assigns VPs to cores.
|
|
Me@0
|
664
|
|
Me@0
|
665 An {\em{instance}} of a plugin is created as part of initializing an
|
|
Me@0
|
666 application, and the instance holds the semantic and scheduling state for that
|
|
Me@0
|
667 run of the application. This state, combined with the virtual processor states
|
|
Me@0
|
668 of the slaves created during that application run, represents progress of the
|
|
Me@0
|
669 work of the application. \ For example, multi-tasking is performed simply by
|
|
Me@0
|
670 the Master switching among plug-in instances when it has a resource to offer
|
|
Me@0
|
671 to a scheduler. The parallelism-semantic state holds all information needed to
|
|
Me@0
|
672 resume (hardware state, such as TLB and cache-tags is inside VMS-core).
|
|
Me@0
|
673
|
|
Me@0
|
674
|
|
Me@0
|
675
|
|
Me@0
|
676 \section{Internal Workings of Our Implementation}
|
|
Me@0
|
677 \label{secInternal}
|
|
Me@0
|
678
|
|
Me@0
|
679 We name the elements of our example implementation and describe their logical
|
|
Me@0
|
680 function, then relate them to the abstract model. We then step through the
|
|
Me@0
|
681 operation of the elements.
|
|
Me@0
|
682
|
|
Me@0
|
683 {\bf Elements and Their Logical Function:\ }As illustrated in Figure \ref{figInternals},
|
|
Me@0
|
684 our VMS implementation is organized around physical cores.
|
|
Me@0
|
685 Each core has its own {\em{master
|
|
Me@0
|
686 virtual-processor}}, \texttt{masterVP}, and a {\em{physical-core controller}}, which communicate via a set of
|
|
Me@0
|
687 scheduling slots, \texttt{schedSlot}. The Master in the abstract definition
|
|
Me@0
|
688 is implemented by the multiple \texttt{masterVP}s plus a particular plugin
|
|
Me@0
|
689 instance with its shared parallelism-semantic state (seen at the top).
|
|
Me@0
|
690
|
|
Me@0
|
691 On a given core, only one of: the core-controller, \texttt{masterVP}, or a
|
|
Me@0
|
692 slave VP, is animated at any point in local physical time. Each
|
|
Me@0
|
693 \texttt{masterVP} animates the same function, called
|
|
Me@0
|
694 \texttt{master\_loop}, and each slave VP animates a function from the
|
|
Me@0
|
695 application, starting with the top-level function the slave is created with,
|
|
Me@0
|
696 and following its call sequence. The core controller is implemented here as a
|
|
Me@0
|
697 Linux pthread that runs the \texttt{core\_loop} function.
|
|
Me@0
|
698
|
|
Me@0
|
699 Switching between VPs is done by executing a VMS primitive that suspends the
|
|
Me@0
|
700 VP. This switches the physical core over to the controller, by jumping to the
|
|
Me@0
|
701 start of the \texttt{core\_loop} function, which chooses the next VP and
|
|
Me@0
|
702 switches to that (switching is detailed in Section \ref{secApp} Figure \ref{figAssembly}).
|
|
Me@0
|
703
|
|
Me@0
|
704 {\bf Relation to Abstract Model:\ }We chose to implement the Master entity
|
|
Me@0
|
705 of the model by a set of \texttt{masterVP}s, plus plug-in functions and
|
|
Me@0
|
706 shared parallelism-semantic state. What we call VMS-core consists of this
|
|
Me@0
|
707 implementation of the Master, plus the core-controllers, plus the VMS
|
|
Me@0
|
708 primitive libraries, for creating new VPs and dissipating existing VPs, suspending VPs,
|
|
Me@0
|
709 and communicating from slave VP to Master. In Figure \ref{figInternals},
|
|
Me@0
|
710 everything in green is part of VMS-core, while the plugin is in red, and
|
|
Me@0
|
711 application code appears as blue, inside the slave VP.
|
|
Me@0
|
712
|
|
Me@0
|
713 Virtual time in the model is implemented via a combination of four things: a
|
|
Me@0
|
714 \texttt{masterLock} (not shown) that guarantees non-overlap of
|
|
Me@0
|
715 \texttt{masterVP} trace-segments; the \texttt{master\_loop} which performs
|
|
Me@0
|
716 transition Animated$\rightarrow$Blocked; the \texttt{comm\_handler\_fn}
|
|
Me@0
|
717 which performs Blocked$\rightarrow$Ready and the \texttt{scheduler\_fn}
|
|
Me@0
|
718 which performs Ready$\rightarrow$Animated. \ Each state transition is one step
|
|
Me@0
|
719 of Virtual time; is guaranteed sequential by the non-overlap of
|
|
Me@0
|
720 \texttt{masterVP} trace segments; and is global due to being in
|
|
Me@0
|
721 parallelism-semantic state that is shared (top of Figure \ref{figInternals}).
|
|
Me@0
|
722
|
|
Me@0
|
723 Transitions Progressing$\rightleftarrows$Stalled within the Animated state are
|
|
Me@0
|
724 invisible to the parallelism semantics, the Master, and Virtual time, and so
|
|
Me@0
|
725 have no effect on the elements seen.
|
|
Me@0
|
726
|
|
Me@0
|
727
|
|
Me@0
|
728 \begin{figure*}[!t]
|
|
Me@0
|
729 \includegraphics[width=5in]{../figures/VMS-core__internal_workings.png}
|
|
Me@0
|
730 \caption
|
|
Me@0
|
731 { Internal elements of our example VMS implementation
|
|
Me@0
|
732 }
|
|
Me@0
|
733 \label{figInternals}
|
|
Me@0
|
734 \end{figure*}
|
|
Me@0
|
735
|
|
Me@0
|
736 {\bf Steps of Operation:\ }The steps of operation are numbered, in Figure \ref{figInternals}.
|
|
Me@0
|
737 Taking them in order, 1) \texttt{master\_loop} scans
|
|
Me@0
|
738 the scheduling slots to see which ones' slaves have suspended since the
|
|
Me@0
|
739 previous scan. \ 2) It hands these to the \texttt{comm\_handler\_fn} plugged
|
|
Me@0
|
740 in (which equals transition Animated$\rightarrow$Blocked). \ 3) The VP has a
|
|
Me@0
|
741 request attached, and data in it causes \ the \texttt{comm\_handler\_fn}
|
|
Me@0
|
742 to manipulate data structures in the shared parallelism-semantic state. \
|
|
Me@0
|
743 These structures hold all the slaves in the blocked state (code-level detail
|
|
Me@0
|
744 in Figure \ref{figReqHdlr}, Section \ref{secApp}). \ 4) Some requests cause slaves to be moved to a
|
|
Me@0
|
745 \texttt{readyQ} on one of the cores (Blocked$\rightarrow$Ready). Which
|
|
Me@0
|
746 core's \texttt{readyQ} receives the slave is under plugin control,
|
|
Me@0
|
747 determined by a combination of request contents, semantic state and physical
|
|
Me@0
|
748 machine state. 5) During the scan, the \texttt{master\_loop} also looks for
|
|
Me@0
|
749 empty slots, and for each calls the \texttt{scheduler\_fn} plugged in. It
|
|
Me@0
|
750 chooses a slave from the \texttt{readyQ} on the core animating
|
|
Me@0
|
751 \texttt{master\_loop}. \ 6) The \texttt{master\_loop} then places the
|
|
Me@0
|
752 slave VP's pointer into the scheduling slot (Ready$\rightarrow$Animated),
|
|
Me@0
|
753 making it available to the \texttt{core\_loop}. 7) When done with the scan,
|
|
Me@0
|
754 \texttt{masterVP} suspends, switching animation back to the
|
|
Me@0
|
755 \texttt{core\_loop}. \ 8) \texttt{core\_loop} takes slave VPs out of the
|
|
Me@0
|
756 slots, then 9) switches animation to them. \ 10) When a slave self-suspends,
|
|
Me@0
|
757 animation returns to the \texttt{core\_loop} (detail in code in Figure 9),
|
|
Me@0
|
758 which picks another, until 11) all slots are empty and the
|
|
Me@0
|
759 \texttt{core\_loop} switches animation to the \texttt{masterVP}.
|
|
Me@0
|
760
|
|
Me@0
|
761 {\bf Enabling sequential implementation of parallelism semantics:\ }All of
|
|
Me@0
|
762 that happens on each core separately, but in this particular implementation we
|
|
Me@0
|
763 use a central \texttt{masterLock} to ensure that only one core's
|
|
Me@0
|
764 \texttt{masterVP} can be active at any time. This guarantees non-overlap
|
|
Me@0
|
765 of trace-segments from different \texttt{masterVP}s, allowing the plugins to
|
|
Me@0
|
766 use sequential algorithms, without a performance penalty, as verified in
|
|
Me@0
|
767 Section \ref{secResults}.
|
|
Me@0
|
768
|
|
Me@0
|
769 Relating this to the abstract model: the parallelism-semantic behavior of the
|
|
Me@0
|
770 Master is implemented by the communication handler, in the plugin. It thus
|
|
Me@0
|
771 runs in the Master time referred to, in the model, in Section \ref{secAbsModel}. Requests are
|
|
Me@0
|
772 sent to the Master by self-suspension of the slaves, but sit idle until the
|
|
Me@0
|
773 other slaves in the scheduling slots have also run. This is the passive
|
|
Me@0
|
774 behavior of requests that was noted in Section \ref{secAbsModel}, which allows the
|
|
Me@0
|
775 \texttt{masterVP}s to remain suspended until needed. This in turn enables
|
|
Me@0
|
776 the \texttt{masterVP}s from different cores to be non-overlapped. It is the
|
|
Me@0
|
777 non-overlap that enables the algorithms for the parallelism semantics to be
|
|
Me@0
|
778 sequential.
|
|
Me@0
|
779
|
|
Me@0
|
780
|
|
Me@0
|
781
|
|
Me@0
|
782
|
|
Me@0
|
783 \section{Code Level View}
|
|
Me@0
|
784 \label{secApp}
|
|
Me@0
|
785
|
|
Me@0
|
786 To relate the abstract model and the internal elements to application code and
|
|
Me@0
|
787 parallelism-library code, we give code snippets that illustrate key features.
|
|
Me@0
|
788 We start with the application then work down through the sequence of calls,
|
|
Me@0
|
789 to the plugin, using our SSR {\cite{VMSHome}} parallelism-library as an
|
|
Me@0
|
790 example.
|
|
Me@0
|
791
|
|
Me@0
|
792 In general, applications are either written in terms of a full custom language
|
|
Me@0
|
793 that has its own syntax, or a base language with a parallelism library, which
|
|
Me@0
|
794 is often called an {\em{embedded language}}. Our demonstrators, VCilk
|
|
Me@0
|
795 {\cite{VMSHome}}, Vthread, and SSR, are all parallelism libraries. A full
|
|
Me@0
|
796 custom language would follow the standard practice of performing
|
|
Me@0
|
797 source-to-source transform, from custom syntax into C plus parallelism-library
|
|
Me@0
|
798 calls.
|
|
Me@0
|
799
|
|
Me@0
|
800 {\bf SSR:\ }SSR stands for Synchronous Send-Receive, and details of its
|
|
Me@0
|
801 calls and internal implementation will be given throughout this section. It
|
|
Me@0
|
802 has two types of construct. The first, called {\em{from-to}} has two calls:
|
|
Me@0
|
803 \texttt{SSR\_send\_from\_to} and \texttt{SSR\_receive\_from\_to}, both of
|
|
Me@0
|
804 which specify the sending VP as well as the receiving VP. \ The other, called
|
|
Me@0
|
805 {\em{of-type}} also has two calls: \texttt{SSR\_\_send\_of\_type\_to} and
|
|
Me@0
|
806 \texttt{SSR\_\_receive\_of\_type}, which allow a receiver to accept from
|
|
Me@0
|
807 anonymous senders, but select according to type of message.
|
|
Me@0
|
808
|
|
Me@0
|
809
|
|
Me@0
|
810 % An example of a double column floating figure using two subfigures.
|
|
Me@0
|
811 % (The subfig.sty package must be loaded for this to work.)
|
|
Me@0
|
812 % The subfigure \label commands are set within each subfloat command, the
|
|
Me@0
|
813 % \label for the overall figure must come after \caption.
|
|
Me@0
|
814 % \hfil must be used as a separator to get equal spacing.
|
|
Me@0
|
815 % The subfigure.sty package works much the same way, except \subfigure is
|
|
Me@0
|
816 % used instead of \subfloat.
|
|
Me@0
|
817 %
|
|
Me@0
|
818 %\begin{figure*}[!t]
|
|
Me@0
|
819 %\centerline{\subfloat[Case I]\includegraphics[width=2.5in]{subfigcase1}%
|
|
Me@0
|
820 %\label{fig_first_case}}
|
|
Me@0
|
821 %\hfil
|
|
Me@0
|
822 %\subfloat[Case II]{\includegraphics[width=2.5in]{subfigcase2}%
|
|
Me@0
|
823 %\label{fig_second_case}}}
|
|
Me@0
|
824 %\caption{Simulation results}
|
|
Me@0
|
825 %\label{fig_sim}
|
|
Me@0
|
826 %\end{figure*}
|
|
Me@0
|
827 %
|
|
Me@0
|
828 % Note that often IEEE papers with subfigures do not employ subfigure
|
|
Me@0
|
829 % captions (using the optional argument to \subfloat), but instead will
|
|
Me@0
|
830 % reference/describe all of them (a), (b), etc., within the main caption.
|
|
Me@0
|
831
|
|
Me@0
|
832
|
|
Me@0
|
833 {\bf Application View:\ } Figure \ref{figAnimVP} shows snippets of application code, which use the SSR parallelism
|
|
Me@0
|
834 library. The most important feature is that all calls take a pointer to the
|
|
Me@0
|
835 VP that is animating the call. This is seen at the top of the figure where slave VP creation takes a pointer
|
|
Me@0
|
836 to the VP asking for creation. Below that is the standard prototype for top level functions,
|
|
Me@0
|
837 showing that the function receives a pointer to the VP it is the top level function for.
|
|
Me@0
|
838
|
|
Me@0
|
839 The pointer is placed on the stack by VMS when it creates the VP, and is the means by
|
|
Me@0
|
840 which the application comes into possession of the pointer. This animating VP is
|
|
Me@0
|
841 passed to all library calls made from there. For example, the bottom shows a pointer to the
|
|
Me@0
|
842 animating VP placed in the position of sender in
|
|
Me@0
|
843 the \texttt{send} construct call. Correspondingly, for the \texttt{receive} construct,
|
|
Me@0
|
844 the position of receiving VP is filled by the VP animating the call.
|
|
Me@0
|
845
|
|
Me@0
|
846 \begin{figure}[ht]
|
|
Me@0
|
847 {\noindent
|
|
Me@0
|
848 {\scriptsize
|
|
Me@0
|
849 {\small Creating a new processor:}
|
|
Me@0
|
850 \begin{verbatim}
|
|
Me@0
|
851 newProcessor = SSR__create_procr( &top_VP_fn,
|
|
Me@0
|
852 paramsPtr, animatingVP ); \end{verbatim}
|
|
Me@0
|
853
|
|
Me@0
|
854 {\small prototype for the top level function:}
|
|
Me@0
|
855 \begin{verbatim}
|
|
Me@0
|
856 top_VP_fn( void *parameterStrucPtr, VirtProcr
|
|
Me@0
|
857 *animatingVP ); \end{verbatim}
|
|
Me@0
|
858
|
|
Me@0
|
859 {\small handing animating VP to parallelism constructs:}
|
|
Me@0
|
860 \begin{verbatim}
|
|
Me@0
|
861 SSR__send_from_to( messagePtr, animatingVP,
|
|
Me@0
|
862 receivingVP );
|
|
Me@0
|
863 messagePtr = SSR__receive_from_to( sendingVP,
|
|
Me@0
|
864 animatingVP ); \end{verbatim}
|
|
Me@0
|
865 }
|
|
Me@0
|
866 }
|
|
Me@0
|
867 \caption
|
|
Me@0
|
868 {
|
|
Me@0
|
869 Application code snippets showing that all calls to the parallelism library
|
|
Me@0
|
870 take the VP animating that call as a parameter.
|
|
Me@0
|
871 }
|
|
Me@0
|
872 \label{figAnimVP}
|
|
Me@0
|
873 \end{figure}
|
|
Me@0
|
874
|
|
Me@0
|
875 Relating these to the internal elements of our implementation, the
|
|
Me@0
|
876 \texttt{animatingVP} suspends inside each of these
|
|
Me@0
|
877 calls, passing a request (generated in the library) to one of the \texttt{masterVP}s.
|
|
Me@0
|
878 The \texttt{masterVP} then calls the \texttt{comm-handler}
|
|
Me@0
|
879 plugin, and so on, as described in Section \ref{secInternal}.
|
|
Me@0
|
880
|
|
Me@0
|
881 For the \texttt{SSR\_\_create\_processor} call, the comm-handler
|
|
Me@0
|
882 in turn calls a VMS primitive to perform the creation.
|
|
Me@0
|
883 The primitive places a pointer to the newly created VP onto its stack, so that when
|
|
Me@0
|
884 \texttt{top\_VP\_fn} is later animated, it sees the VP-pointer as a
|
|
Me@0
|
885 parameter passed to it. \ All
|
|
Me@0
|
886 application code is either such a top-level function, or has one at the root
|
|
Me@0
|
887 of the call-stack.
|
|
Me@0
|
888
|
|
Me@0
|
889 The send and receive calls both suspend their animating VP. When both have
|
|
Me@0
|
890 been called, the communication handler pairs them up and resumes both. This
|
|
Me@0
|
891 ties time-lines together, invoking the VMS guarantee. Both
|
|
Me@0
|
892 application-functions know, because of the VMS guarantee (Section \ref{secAbsModel}), that
|
|
Me@0
|
893 writes to shared variables made before the send call by the sender are visible
|
|
Me@0
|
894 to the receiver after the receive call. This is the programmer's view of tying
|
|
Me@0
|
895 together the local time-lines of two different VPs, as defined in Section \ref{secAbsModel}.
|
|
Me@0
|
896
|
|
Me@0
|
897
|
|
Me@0
|
898 {\bf Concurrency-Library View:\ }A parallelism library function, in
|
|
Me@0
|
899 general, only creates a request, sends it, and returns, as seen below. To
|
|
Me@0
|
900 send a request, it uses the combined request-and-suspend VMS primitive that
|
|
Me@0
|
901 attaches the request then suspends the VP. The primitive requires the
|
|
Me@0
|
902 pointer to the VP, to attach the request and to suspend it.
|
|
Me@0
|
903
|
|
Me@0
|
904
|
|
Me@0
|
905 \begin{figure}[ht]
|
|
Me@0
|
906 {\noindent
|
|
Me@0
|
907 {\scriptsize
|
|
Me@0
|
908 \begin{verbatim}
|
|
Me@0
|
909 void * SSR__receive_from_to( VirtProcr *sendVP,
|
|
Me@0
|
910 VirtProcr *receiveVP )
|
|
Me@0
|
911 { SSRSemReq reqData;
|
|
Me@0
|
912 reqData.receiveVP = receiveVP;
|
|
Me@0
|
913 reqData.sendVP = sendVP;
|
|
Me@0
|
914 reqData.reqType = receive_from_to;
|
|
Me@0
|
915 VMS__send_sem_request( &reqData, receiveVP );
|
|
Me@0
|
916 return receiveVP->dataReturnedFromRequest;
|
|
Me@0
|
917 } \end{verbatim}
|
|
Me@0
|
918 }
|
|
Me@0
|
919 }
|
|
Me@0
|
920 \caption{Implementation of SSR's receive\_from\_to library function.}
|
|
Me@0
|
921 \label{figImplLib}
|
|
Me@0
|
922 \end{figure}
|
|
Me@0
|
923
|
|
Me@0
|
924
|
|
Me@0
|
925 In Figure \ref{figImplLib}, notice that the request's data is on the stack of the virtual
|
|
Me@0
|
926 processor that's animating the call, which is the \texttt{receiveVP}. The
|
|
Me@0
|
927 \texttt{VMS\_\_send\_sem\_request} suspends this VP, which changes the
|
|
Me@0
|
928 physical core's stack pointer to a different stack. So the request data is
|
|
Me@0
|
929 guaranteed to remain undisturbed while the VP is suspended.
|
|
Me@0
|
930
|
|
Me@0
|
931 Figure \ref{figAssembly} shows the implementation of the VMS suspend primitive. As seen in
|
|
Me@0
|
932 Figure \ref{figInternals}, suspending the \texttt{receiveVP} involves
|
|
Me@0
|
933 switching to the \texttt{core\_loop}. In our implementation, this is done by
|
|
Me@0
|
934 switching to the stack of the pthread pinned to the physical core and then
|
|
Me@0
|
935 jumping to the start-point of \texttt{core\_loop}.
|
|
Me@0
|
936
|
|
Me@0
|
937 This code uses standard techniques commonly employed in co-routine
|
|
Me@0
|
938 implementations. Tuning effort spent in \texttt{core\_loop} is inherited by
|
|
Me@0
|
939 all applications.
|
|
Me@0
|
940
|
|
Me@0
|
941
|
|
Me@0
|
942 \begin{figure}[ht]
|
|
Me@0
|
943 {\noindent
|
|
Me@0
|
944 {\scriptsize
|
|
Me@0
|
945 \begin{verbatim}
|
|
Me@0
|
946 VMS__suspend_procr( VirtProcr *animatingVP )
|
|
Me@0
|
947 { animatingVP->resumeInstrAddr = &&ResumePt;
|
|
Me@0
|
948 //GCC takes addr of label
|
|
Me@0
|
949 animatingVP->schedSlotAssignedTo->
|
|
Me@0
|
950 isNewlySuspended = TRUE;
|
|
Me@0
|
951 //for master_loop to see
|
|
Me@0
|
952 <assembly code stores current physical core's
|
|
Me@0
|
953 stack reg into animatingVP struct>
|
|
Me@0
|
954 <assembly code loads stack reg with core_loop
|
|
Me@0
|
955 stackPtr, which was saved into animatingVP>
|
|
Me@0
|
956 <assembly code jmps to core_loop start instr
|
|
Me@0
|
957 addr, which was also saved into animatingVP>
|
|
Me@0
|
958 ResumePt:
|
|
Me@0
|
959 return;
|
|
Me@0
|
960 } \end{verbatim}
|
|
Me@0
|
961 }
|
|
Me@0
|
962 }
|
|
Me@0
|
963 \caption
|
|
Me@0
|
964 {Implementation of VMS suspend processor.
|
|
Me@0
|
965 Re-animating the virtual processor reverses this sequence. \ It saves the
|
|
Me@0
|
966 \texttt{core\_loop}'s resume instr-addr and stack ptr into the VP structure,
|
|
Me@0
|
967 then loads the VP's stack ptr and jmps to its \texttt{resumeInstrAddr}.
|
|
Me@0
|
968 }
|
|
Me@0
|
969 \label{figAssembly}
|
|
Me@0
|
970 \end{figure}
|
|
Me@0
|
971
|
|
Me@0
|
972
|
|
Me@0
|
973 {\bf Plugin View:\ }SSR's communication handler dispatches on the
|
|
Me@0
|
974 \texttt{reqType} field of the request data, as set by the
|
|
Me@0
|
975 \texttt{SSR\_\_receive\_from\_to} code. It calls the handler code in
|
|
Me@0
|
976 Figure \ref{figReqHdlr}. This constructs a hash-key, by concatenating the from-VP's pointer
|
|
Me@0
|
977 with the to-VP's pointer. Then it looks-up that key in the hash-table that SSR uses
|
|
Me@0
|
978 to match sends with receives, which is in the shared semantic state seen at
|
|
Me@0
|
979 the top of Figure \ref{figInternals} in Section \ref{secInternal}.
|
|
Me@0
|
980
|
|
Me@0
|
981 The most important feature in Figure \ref{figReqHdlr} is that both send and receive
|
|
Me@0
|
982 will construct the same key, so will find the same
|
|
Me@0
|
983 hash entry. Whichever request is handled first in Virtual time will see the
|
|
Me@0
|
984 hash entry empty, and save itself in that entry. The second to arrive
|
|
Me@0
|
985 sees the waiting request and then resumes both VPs, by putting them into their
|
|
Me@0
|
986 \texttt{readyQ}s.
|
|
Me@0
|
987
|
|
Me@0
|
988 Access to the shared hash
|
|
Me@0
|
989 table can be considered private, as in a sequential algorithm. This is because
|
|
Me@0
|
990 our VMS-core implementation ensures that only
|
|
Me@0
|
991 one handler on one core is executing at a time.
|
|
Me@0
|
992
|
|
Me@0
|
993
|
|
Me@0
|
994 \begin{figure}[ht]
|
|
Me@0
|
995 {\noindent
|
|
Me@0
|
996 {\scriptsize
|
|
Me@0
|
997 \begin{verbatim}
|
|
Me@0
|
998 handle_receive_from_to( VirtProcr *requestingVP,
|
|
Me@0
|
999 SSRSemReq *reqData, SSRSemEnv *semEnv )
|
|
Me@0
|
1000 { commHashTbl = semEnv->communicatingVPHashTable;
|
|
Me@0
|
1001 key[0] = reqData->receiveVP; key[1] =
|
|
Me@0
|
1002 reqData->sendVP; //send uses same key
|
|
Me@0
|
1003 waitingReqData = lookup_and_remove( key,
|
|
Me@0
|
1004 commHashTbl ); //get waiting request
|
|
Me@0
|
1005 if( waitingReqData != NULL )
|
|
Me@0
|
1006 { resume_virt_procr( waitingReqData->sendVP );
|
|
Me@0
|
1007 resume_virt_procr( waitingReqData->
|
|
Me@0
|
1008 receiveVP );
|
|
Me@0
|
1009 }
|
|
Me@0
|
1010 else
|
|
Me@0
|
1011 insert( key, reqData, commHashTbl );
|
|
Me@0
|
1012 //receive is first to arrive, make it wait
|
|
Me@0
|
1013 } \end{verbatim}
|
|
Me@0
|
1014 }
|
|
Me@0
|
1015 }
|
|
Me@0
|
1016 \caption
|
|
Me@0
|
1017 {Pseudo-code of communication-handler for
|
|
Me@0
|
1018 \texttt{receive\_from\_to} request type. The \texttt{semEnv} is a pointer
|
|
Me@0
|
1019 to the shared parallelism-semantic state seen at the top of Figure
|
|
Me@0
|
1020 \ref{figInternals}.
|
|
Me@0
|
1021 }
|
|
Me@0
|
1022 \label{figReqHdlr}
|
|
Me@0
|
1023 \end{figure}
|
|
Me@0
|
1024
|
|
Me@0
|
1025
|
|
Me@0
|
1026
|
|
Me@0
|
1027
|
|
Me@0
|
1028 \section{Results}
|
|
Me@0
|
1029 \label{secResults}
|
|
Me@0
|
1030
|
|
Me@0
|
1031 {\bf Setup:\ }We implemented blocked dense matrix multiply with right
|
|
Me@0
|
1032 sub-matrices copied to transposed form. We ran on a 1 socket by 4 core
|
|
Me@0
|
1033 Core2Quad 2.4Ghz chip.
|
|
Me@0
|
1034
|
|
Me@0
|
1035 {\bf Implementation-Time:\ }As shown in Table 1, time to implement the
|
|
Me@0
|
1036 three parallel libraries averages 2 days each. As an example of productivity,
|
|
Me@0
|
1037 adding nested transactions, parallel singleton, and atomic function-execution
|
|
Me@0
|
1038 to SSR required a single afternoon, totaling less than 100 lines of C code.
|
|
Me@0
|
1039
|
|
Me@0
|
1040
|
|
Me@0
|
1041 \begin{table}[ht]
|
|
Me@0
|
1042 \caption
|
|
Me@0
|
1043 {Person-days to design, code, and
|
|
Me@0
|
1044 test each parallelism library. L.O.C. is lines of (original) C code, excluding libraries and
|
|
Me@0
|
1045 comments.
|
|
Me@0
|
1046 }
|
|
Me@0
|
1047 \label{tabPersonDaysLang}
|
|
Me@0
|
1048 \begin{center}
|
|
Me@0
|
1049 \begin{tabular}{llll}
|
|
Me@0
|
1050 & SSR & Vthread & VCilk\\
|
|
Me@0
|
1051 Design & 4 & 1 & 0.5\\
|
|
Me@0
|
1052 Code & 2 & 0.5 & 0.5\\
|
|
Me@0
|
1053 Test & 1 & 0.5 & 0.5\\
|
|
Me@0
|
1054 L.O.C. & 470 & 290 & 310
|
|
Me@0
|
1055 \end{tabular}
|
|
Me@0
|
1056 \end{center}
|
|
Me@0
|
1057 \end{table}
|
|
Me@0
|
1058
|
|
Me@0
|
1059
|
|
Me@0
|
1060 {\bf Execution Performance:\ }Performance of VMS is seen in Table
|
|
Me@0
|
1061 \ref{tabOverheadCycles}. The code is not optimized, but rather written
|
|
Me@0
|
1062 to be easy to understand and modify. The majority of the plugin time
|
|
Me@0
|
1063 is lost to cache misses because the shared parallelism-semantic state moves
|
|
Me@0
|
1064 between cores on a majority of accesses. Acquisition of the master lock is slow due to the hardware
|
|
Me@0
|
1065 implementing the CAS instruction.
|
|
Me@0
|
1066
|
|
Me@0
|
1067 Existing techniques will likely improve performance, such
|
|
Me@0
|
1068 as localizing semantic data to cores, splitting malloc across the cores,
|
|
Me@0
|
1069 pre-allocating slabs that are recycled, and pre-fetching. However, in many cases, several hundred
|
|
Me@0
|
1070 nano-seconds per task is as optimal as the applications can benefit from.
|
|
Me@0
|
1071
|
|
Me@0
|
1072 \begin{table}[ht]
|
|
Me@0
|
1073 \caption
|
|
Me@0
|
1074 {Cycles of overhead, per scheduled
|
|
Me@0
|
1075 slave. ``comp only'' is perfect memory, ``comp + mem'' is actual cycles.
|
|
Me@0
|
1076 ``Plugin-concur'' only concurrency requests, ``plugin-all'' includes
|
|
Me@0
|
1077 create and malloc requests. Two significant digits due to variability.
|
|
Me@0
|
1078 }
|
|
Me@0
|
1079 \label{tabOverheadCycles}
|
|
Me@0
|
1080 \begin{center}
|
|
Me@0
|
1081 \begin{tabular}{|l|ll|r|r|}
|
|
Me@0
|
1082 \hline
|
|
Me@0
|
1083 & & comp & comp\\
|
|
Me@0
|
1084 & & only & +mem\\
|
|
Me@0
|
1085 VMS Only & \texttt{master\_loop} & 91 & \ 110\\
|
|
Me@0
|
1086 & switch VPs & 77 & \ 130\\
|
|
Me@0
|
1087 & (malloc) & 160 & 2300\\
|
|
Me@0
|
1088 & (create VP) & 540 & 3800\\
|
|
Me@0
|
1089 \hline
|
|
Me@0
|
1090 Language: & & & \\
|
|
Me@0
|
1091 SSR & plugin -- concur & 190 & 540\\
|
|
Me@0
|
1092 & plugin -- all & 530 & 2200\\
|
|
Me@0
|
1093 & lock & & 250\\
|
|
Me@0
|
1094 Vthread & plugin -- concur & 66 & 710\\
|
|
Me@0
|
1095 & plugin -- all & 180 & 1500\\
|
|
Me@0
|
1096 & lock & & 250\\
|
|
Me@0
|
1097 VCilk & plugin -- concur & 65 & 260\\
|
|
Me@0
|
1098 & plugin -- all & 330 & 1800\\
|
|
Me@0
|
1099 & lock & & 250\\
|
|
Me@0
|
1100 \hline
|
|
Me@0
|
1101 \end{tabular}
|
|
Me@0
|
1102 \end{center}
|
|
Me@0
|
1103 \end{table}
|
|
Me@0
|
1104
|
|
Me@0
|
1105
|
|
Me@0
|
1106
|
|
Me@0
|
1107 {\bf Head to Head:\ }We
|
|
Me@0
|
1108 compare our implementation of the \texttt{spawn} and \texttt{sync}
|
|
Me@0
|
1109 constructs against Cilk 5.4, on the top in Table
|
|
Me@0
|
1110 \ref{tabHeadToHead}, which shows that the same application code has similar
|
|
Me@0
|
1111 performance. For large matrices, Cilk 5.4's better use of the memory hierarchy
|
|
Me@0
|
1112 achieves 23\% better performance. However, for small matrices, VCilk is better,
|
|
Me@0
|
1113 with a factor 2 lower overhead. Cilk 5.4 does not allow controlling the number of spawn events it
|
|
Me@0
|
1114 actually executes, and chooses to run smaller matrices sequentially, limiting
|
|
Me@0
|
1115 our comparison.
|
|
Me@0
|
1116
|
|
Me@0
|
1117 When comparing to pthreads, our VMS based implementation has more than an
|
|
Me@0
|
1118 order of magnitude better overhead per invocation of mutex or condition
|
|
Me@0
|
1119 variable functionality, as seen on the bottom of Table \ref{tabHeadToHead}.
|
|
Me@0
|
1120 Applications that inherently have short trace segments will synchronize often
|
|
Me@0
|
1121 and benefit the most from Vthread.
|
|
Me@0
|
1122
|
|
Me@0
|
1123
|
|
Me@0
|
1124
|
|
Me@0
|
1125 \begin{table}[ht]
|
|
Me@0
|
1126 \caption
|
|
Me@0
|
1127 {On top, exe time in seconds for MM. Below, overhead for pthread vs Vthread. First column is cycles
|
|
Me@0
|
1128 for perfect memory and second is total measured cycles. pthread cycles are
|
|
Me@0
|
1129 deduced from round-trip experiments.
|
|
Me@0
|
1130 }
|
|
Me@0
|
1131 \label{tabHeadToHead}
|
|
Me@0
|
1132 \begin{center}
|
|
Me@0
|
1133 \begin{tabular}{|l@{\ }|@{\ }lr|}
|
|
Me@0
|
1134 \hline
|
|
Me@0
|
1135 \rule{0pt}{12pt}
|
|
Me@0
|
1136 Matrix size&Lang.&sec.\\
|
|
Me@0
|
1137 [2pt]\hline
|
|
Me@0
|
1138 81x81 & Cilk & 0.017\\
|
|
Me@0
|
1139 & VCilk & 0.008\\
|
|
Me@0
|
1140 \hline
|
|
Me@0
|
1141 324x324 & Cilk & 0.13\\
|
|
Me@0
|
1142 & VCilk & 0.13\\
|
|
Me@0
|
1143 \hline
|
|
Me@0
|
1144 648x648 & Cilk & 0.71\\
|
|
Me@0
|
1145 & VCilk & 0.85\\
|
|
Me@0
|
1146 \hline
|
|
Me@0
|
1147 1296x1296 & Cilk & 4.8\\
|
|
Me@0
|
1148 & VCilk & 6.2 \\
|
|
Me@0
|
1149 [2pt]\hline
|
|
Me@0
|
1150 \end{tabular}
|
|
Me@0
|
1151 \begin{tabular}{c}
|
|
Me@0
|
1152 \begin{tabular}{|l|rr|r|r|}
|
|
Me@0
|
1153 \hline
|
|
Me@0
|
1154 operation & \multicolumn{2}{c|}{Vthread} & pthread & ratio\\
|
|
Me@0
|
1155 \hline
|
|
Me@0
|
1156 & comp & total & & \\
|
|
Me@0
|
1157 & only & & & \\
|
|
Me@0
|
1158 [2pt]\hline
|
|
Me@0
|
1159 mutex\_lock & 85 & 1050 & 50,000 & 48:1\\
|
|
Me@0
|
1160 mutex\_unlock & 85 & 610 & 45,000 & 74:1\\
|
|
Me@0
|
1161 cond\_wait & 85 & 850 & 60,000 & 71:1\\
|
|
Me@0
|
1162 cond\_signal & 90 & 650 & 60,000 & 92:1\\
|
|
Me@0
|
1163 \hline
|
|
Me@0
|
1164 \end{tabular}\\
|
|
Me@0
|
1165 \end{tabular}
|
|
Me@0
|
1166 \end{center}
|
|
Me@0
|
1167 \end{table}
|
|
Me@0
|
1168
|
|
Me@0
|
1169 \section{Conclusion}\label{secConclusion}
|
|
Me@0
|
1170
|
|
Me@0
|
1171 We have shown an alternative to the Thread model that enables easier-to-use
|
|
Me@0
|
1172 parallelism constructs by splitting
|
|
Me@0
|
1173 the scheduler open, to accept new parallelism constructs in the form of
|
|
Me@0
|
1174 plugins. This gives the language control over assigning virtual
|
|
Me@0
|
1175 processors to physical cores, for performance, debugging, and flexibility
|
|
Me@0
|
1176 benefits. Parallelism constructs of languages are implementable using
|
|
Me@0
|
1177 sequential algorithms, within a matter of days, while maintaining low run-time
|
|
Me@0
|
1178 overhead, on the order of a few hundred nano-seconds per concurrency
|
|
Me@0
|
1179 operation.
|
|
Me@0
|
1180
|
|
Me@0
|
1181 \bibliography{Bib_for_papers}
|
|
Me@0
|
1182
|
|
Me@0
|
1183 \end{document}
|