changeset 23:8a46b7d0621a

future arch -- del extraneous files
author Some Random Person <seanhalle@yahoo.com>
date Thu, 12 Apr 2012 08:55:20 -0700
parents 6166abb29bf4
children 72ba77515c93
files 0__Papers/Future_Architecture/latex/Future_Architecture.aux 0__Papers/Future_Architecture/latex/Future_Architecture.bbl 0__Papers/Future_Architecture/latex/Future_Architecture.blg 0__Papers/Future_Architecture/latex/Future_Architecture.ddf 0__Papers/Future_Architecture/latex/Future_Architecture.tex.Backup
diffstat 4 files changed, 0 insertions(+), 625 deletions(-) [+]
line diff
     1.1 --- a/0__Papers/Future_Architecture/latex/Future_Architecture.aux	Thu Apr 12 08:53:52 2012 -0700
     1.2 +++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
     1.3 @@ -1,31 +0,0 @@
     1.4 -\relax 
     1.5 -\bibstyle{plain}
     1.6 -\@writefile{toc}{\contentsline {section}{\numberline {I}What parallel abstractions should the hardware provide?}{1}}
     1.7 -\@writefile{lof}{\contentsline {figure}{\numberline {1}{\ignorespaces A special \texttt  {switch} op-code is recognized by the decode stage, and triggers fetch of instructions from firm-ware. The firm-ware instrs are provided to the OS as a ``hardware driver", and implement the runtime behavior of a language. The application communicates to the runtime by placing pointers to data-structures into registers just before executing the ``switch to runtime" instruction, which starts the fetch from firm-ware. Helper instructions accelerate common runtime operations, such as hash-table lookups, communication, search-for-optimum, and so on. }}{1}}
     1.8 -\newlabel{figTimeMapping}{{1}{1}}
     1.9 -\@writefile{toc}{\contentsline {subsection}{\numberline {\unhbox \voidb@x \hbox {I-A}}Soft-extension of instruction-set}{1}}
    1.10 -\@writefile{toc}{\contentsline {subsection}{\numberline {\unhbox \voidb@x \hbox {I-B}}Communications via firm-ware}{2}}
    1.11 -\@writefile{toc}{\contentsline {subsection}{\numberline {\unhbox \voidb@x \hbox {I-C}}Communication via separate helper processors}{2}}
    1.12 -\@writefile{lof}{\contentsline {figure}{\numberline {2}{\ignorespaces Communication is performed between local memory and remote memories via a separate communication processor. This processor executes firm-ware that is loaded under OS control. For example, it may run a standard software-cache or run scatter-gather code extracted from the application. }}{2}}
    1.13 -\newlabel{figCommProcr}{{2}{2}}
    1.14 -\@writefile{toc}{\contentsline {subsection}{\numberline {\unhbox \voidb@x \hbox {I-D}}Speculation and Fast Control Message Support}{2}}
    1.15 -\@writefile{lof}{\contentsline {figure}{\numberline {3}{\ignorespaces Tag memory and tag processing are added to local memory. The tags have an extra field used by tag processing to filter lines. It can generate a list of all tags that match in the extra field, can compare a list coming from the network to tag memory, tell the communication processor which local memory lines match a boolean expression on the extra field, and so on. }}{3}}
    1.16 -\newlabel{figSpecHW}{{3}{3}}
    1.17 -\@writefile{toc}{\contentsline {subsection}{\numberline {\unhbox \voidb@x \hbox {I-E}}Example}{3}}
    1.18 -\@writefile{toc}{\contentsline {paragraph}{\numberline {\unhbox \voidb@x \hbox {I-E}0a}setup and switch}{3}}
    1.19 -\@writefile{toc}{\contentsline {paragraph}{\numberline {\unhbox \voidb@x \hbox {I-E}0b}runtime internals}{3}}
    1.20 -\@writefile{toc}{\contentsline {section}{\numberline {II}Which should be the responsibility / functionality of the programmer, the runtime software, and the hardware?}{4}}
    1.21 -\newlabel{secResponsibility}{{II}{4}}
    1.22 -\@writefile{toc}{\contentsline {section}{\numberline {III}Specific Topics of Interest}{4}}
    1.23 -\@writefile{toc}{\contentsline {paragraph}{\numberline {\unhbox \voidb@x \hbox {III-}0c}enabling future parallel programming models}{4}}
    1.24 -\@writefile{toc}{\contentsline {paragraph}{\numberline {\unhbox \voidb@x \hbox {III-}0d}innovative architectural execution models}{4}}
    1.25 -\@writefile{toc}{\contentsline {paragraph}{\numberline {\unhbox \voidb@x \hbox {III-}0e}novel memory hierarchies}{4}}
    1.26 -\@writefile{toc}{\contentsline {paragraph}{\numberline {\unhbox \voidb@x \hbox {III-}0f}simplified and scalable memory models}{4}}
    1.27 -\@writefile{toc}{\contentsline {paragraph}{\numberline {\unhbox \voidb@x \hbox {III-}0g}high-level constructs for on-chip communications}{4}}
    1.28 -\@writefile{toc}{\contentsline {paragraph}{\numberline {\unhbox \voidb@x \hbox {III-}0h}future directions in programming massively parallel systems}{4}}
    1.29 -\@writefile{toc}{\contentsline {paragraph}{\numberline {\unhbox \voidb@x \hbox {III-}0i}potential bottlenecks for future parallel systems}{5}}
    1.30 -\bibdata{Bib_for_papers}
    1.31 -\@writefile{toc}{\contentsline {section}{\numberline {IV}Conclusion}{6}}
    1.32 -\newlabel{secConclusion}{{IV}{6}}
    1.33 -\@writefile{toc}{\contentsline {section}{References}{6}}
    1.34 -
     2.1 --- a/0__Papers/Future_Architecture/latex/Future_Architecture.bbl	Thu Apr 12 08:53:52 2012 -0700
     2.2 +++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
     2.3 @@ -1,3 +0,0 @@
     2.4 -\begin{thebibliography}{}
     2.5 -
     2.6 -\end{thebibliography}
     3.1 --- a/0__Papers/Future_Architecture/latex/Future_Architecture.blg	Thu Apr 12 08:53:52 2012 -0700
     3.2 +++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
     3.3 @@ -1,22 +0,0 @@
     3.4 -This is 8-bit Big BibTeX version 0.99c
     3.5 -Implementation:  WIN32 Console, BaKoMa TNS bound.
     3.6 -Release version: 3.71 (18 Aug 1996)
     3.7 -
     3.8 -The top-level auxiliary file: Future_Architecture.aux
     3.9 -The style file: plain.bst
    3.10 -I couldn't open database file Bib_for_papers.bib
    3.11 ----line 27 of file Future_Architecture.aux
    3.12 - : \bibdata{Bib_for_papers
    3.13 - :                        }
    3.14 -I'm skipping whatever remains of this command
    3.15 -I found no \citation commands---while reading file Future_Architecture.aux
    3.16 -I found no database files---while reading file Future_Architecture.aux
    3.17 -
    3.18 -Here's how much of BibTeX's memory you used:
    3.19 - Cites:                 0 out of 7500
    3.20 - Fields:           125000 out of 125000
    3.21 - Hash table:        34852 out of 35000
    3.22 - Strings:             495 out of 30000
    3.23 - String pool:        4008 out of 750000
    3.24 - Wizard functions:   2118 out of 10000
    3.25 -(There were 3 error messages)
     4.1 --- a/0__Papers/Future_Architecture/latex/Future_Architecture.tex.Backup	Thu Apr 12 08:53:52 2012 -0700
     4.2 +++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
     4.3 @@ -1,571 +0,0 @@
     4.4 -%
     4.5 -
     4.6 -
     4.7 -\documentclass[conference]{IEEEtran}
     4.8 -%
     4.9 -%\usepackage{makeidx,geometry,amssymb,graphicx,calc,ifthen}
    4.10 -\usepackage{amssymb,graphicx,calc,ifthen}
    4.11 -%
    4.12 -
    4.13 -% *** CITATION PACKAGES ***
    4.14 -%
    4.15 -%\usepackage{cite}
    4.16 -% cite.sty was written by Donald Arseneau
    4.17 -% V1.6 and later of IEEEtran pre-defines the format of the cite.sty package
    4.18 -% \cite{} output to follow that of IEEE. Loading the cite package will
    4.19 -% result in citation numbers being automatically sorted and properly
    4.20 -% "compressed/ranged". e.g., [1], [9], [2], [7], [5], [6] without using
    4.21 -% cite.sty will become [1], [2], [5]--[7], [9] using cite.sty. cite.sty's
    4.22 -% \cite will automatically add leading space, if needed. Use cite.sty's
    4.23 -% noadjust option (cite.sty V3.8 and later) if you want to turn this off.
    4.24 -% cite.sty is already installed on most LaTeX systems. Be sure and use
    4.25 -% version 4.0 (2003-05-27) and later if using hyperref.sty. cite.sty does
    4.26 -% not currently provide for hyperlinked citations.
    4.27 -% The latest version can be obtained at:
    4.28 -% http://www.ctan.org/tex-archive/macros/latex/contrib/cite/
    4.29 -% The documentation is contained in the cite.sty file itself.
    4.30 -
    4.31 -
    4.32 -
    4.33 -
    4.34 -
    4.35 -
    4.36 -% *** GRAPHICS RELATED PACKAGES ***
    4.37 -%
    4.38 -\ifCLASSINFOpdf
    4.39 -  % \usepackage[pdftex]{graphicx}
    4.40 -  % declare the path(s) where your graphic files are
    4.41 -  % \graphicspath{{../pdf/}{../jpeg/}}
    4.42 -  % and their extensions so you won't have to specify these with
    4.43 -  % every instance of \includegraphics
    4.44 -  % \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
    4.45 -\else
    4.46 -  % or other class option (dvipsone, dvipdf, if not using dvips). graphicx
    4.47 -  % will default to the driver specified in the system graphics.cfg if no
    4.48 -  % driver is specified.
    4.49 -  % \usepackage[dvips]{graphicx}
    4.50 -  % declare the path(s) where your graphic files are
    4.51 -  % \graphicspath{{../eps/}}
    4.52 -  % and their extensions so you won't have to specify these with
    4.53 -  % every instance of \includegraphics
    4.54 -  % \DeclareGraphicsExtensions{.eps}
    4.55 -\fi
    4.56 -% graphicx was written by David Carlisle and Sebastian Rahtz. It is
    4.57 -% required if you want graphics, photos, etc. graphicx.sty is already
    4.58 -% installed on most LaTeX systems. The latest version and documentation can
    4.59 -% be obtained at: 
    4.60 -% http://www.ctan.org/tex-archive/macros/latex/required/graphics/
    4.61 -% Another good source of documentation is "Using Imported Graphics in
    4.62 -% LaTeX2e" by Keith Reckdahl which can be found as epslatex.ps or
    4.63 -% epslatex.pdf at: http://www.ctan.org/tex-archive/info/
    4.64 -%
    4.65 -% latex, and pdflatex in dvi mode, support graphics in encapsulated
    4.66 -% postscript (.eps) format. pdflatex in pdf mode supports graphics
    4.67 -% in .pdf, .jpeg, .png and .mps (metapost) formats. Users should ensure
    4.68 -% that all non-photo figures use a vector format (.eps, .pdf, .mps) and
    4.69 -% not a bitmapped formats (.jpeg, .png). IEEE frowns on bitmapped formats
    4.70 -% which can result in "jaggedy"/blurry rendering of lines and letters as
    4.71 -% well as large increases in file sizes.
    4.72 -%
    4.73 -% You can find documentation about the pdfTeX application at:
    4.74 -% http://www.tug.org/applications/pdftex
    4.75 -
    4.76 -
    4.77 -
    4.78 -
    4.79 -
    4.80 -% *** MATH PACKAGES ***
    4.81 -%
    4.82 -%\usepackage[cmex10]{amsmath}
    4.83 -% A popular package from the American Mathematical Society that provides
    4.84 -% many useful and powerful commands for dealing with mathematics. If using
    4.85 -% it, be sure to load this package with the cmex10 option to ensure that
    4.86 -% only type 1 fonts will utilized at all point sizes. Without this option,
    4.87 -% it is possible that some math symbols, particularly those within
    4.88 -% footnotes, will be rendered in bitmap form which will result in a
    4.89 -% document that can not be IEEE Xplore compliant!
    4.90 -%
    4.91 -% Also, note that the amsmath package sets \interdisplaylinepenalty to 10000
    4.92 -% thus preventing page breaks from occurring within multiline equations. Use:
    4.93 -%\interdisplaylinepenalty=2500
    4.94 -% after loading amsmath to restore such page breaks as IEEEtran.cls normally
    4.95 -% does. amsmath.sty is already installed on most LaTeX systems. The latest
    4.96 -% version and documentation can be obtained at:
    4.97 -% http://www.ctan.org/tex-archive/macros/latex/required/amslatex/math/
    4.98 -
    4.99 -
   4.100 -
   4.101 -
   4.102 -
   4.103 -% *** SPECIALIZED LIST PACKAGES ***
   4.104 -%
   4.105 -%\usepackage{algorithmic}
   4.106 -% algorithmic.sty was written by Peter Williams and Rogerio Brito.
   4.107 -% This package provides an algorithmic environment fo describing algorithms.
   4.108 -% You can use the algorithmic environment in-text or within a figure
   4.109 -% environment to provide for a floating algorithm. Do NOT use the algorithm
   4.110 -% floating environment provided by algorithm.sty (by the same authors) or
   4.111 -% algorithm2e.sty (by Christophe Fiorio) as IEEE does not use dedicated
   4.112 -% algorithm float types and packages that provide these will not provide
   4.113 -% correct IEEE style captions. The latest version and documentation of
   4.114 -% algorithmic.sty can be obtained at:
   4.115 -% http://www.ctan.org/tex-archive/macros/latex/contrib/algorithms/
   4.116 -% There is also a support site at:
   4.117 -% http://algorithms.berlios.de/index.html
   4.118 -% Also of interest may be the (relatively newer and more customizable)
   4.119 -% algorithmicx.sty package by Szasz Janos:
   4.120 -% http://www.ctan.org/tex-archive/macros/latex/contrib/algorithmicx/
   4.121 -
   4.122 -
   4.123 -
   4.124 -
   4.125 -% *** ALIGNMENT PACKAGES ***
   4.126 -%
   4.127 -%\usepackage{array}
   4.128 -% Frank Mittelbach's and David Carlisle's array.sty patches and improves
   4.129 -% the standard LaTeX2e array and tabular environments to provide better
   4.130 -% appearance and additional user controls. As the default LaTeX2e table
   4.131 -% generation code is lacking to the point of almost being broken with
   4.132 -% respect to the quality of the end results, all users are strongly
   4.133 -% advised to use an enhanced (at the very least that provided by array.sty)
   4.134 -% set of table tools. array.sty is already installed on most systems. The
   4.135 -% latest version and documentation can be obtained at:
   4.136 -% http://www.ctan.org/tex-archive/macros/latex/required/tools/
   4.137 -
   4.138 -
   4.139 -%\usepackage{mdwmath}
   4.140 -%\usepackage{mdwtab}
   4.141 -% Also highly recommended is Mark Wooding's extremely powerful MDW tools,
   4.142 -% especially mdwmath.sty and mdwtab.sty which are used to format equations
   4.143 -% and tables, respectively. The MDWtools set is already installed on most
   4.144 -% LaTeX systems. The lastest version and documentation is available at:
   4.145 -% http://www.ctan.org/tex-archive/macros/latex/contrib/mdwtools/
   4.146 -
   4.147 -
   4.148 -% IEEEtran contains the IEEEeqnarray family of commands that can be used to
   4.149 -% generate multiline equations as well as matrices, tables, etc., of high
   4.150 -% quality.
   4.151 -
   4.152 -
   4.153 -%\usepackage{eqparbox}
   4.154 -% Also of notable interest is Scott Pakin's eqparbox package for creating
   4.155 -% (automatically sized) equal width boxes - aka "natural width parboxes".
   4.156 -% Available at:
   4.157 -% http://www.ctan.org/tex-archive/macros/latex/contrib/eqparbox/
   4.158 -
   4.159 -
   4.160 -
   4.161 -
   4.162 -
   4.163 -% *** SUBFIGURE PACKAGES ***
   4.164 -%\usepackage[tight,footnotesize]{subfigure}
   4.165 -% subfigure.sty was written by Steven Douglas Cochran. This package makes it
   4.166 -% easy to put subfigures in your figures. e.g., "Figure 1a and 1b". For IEEE
   4.167 -% work, it is a good idea to load it with the tight package option to reduce
   4.168 -% the amount of white space around the subfigures. subfigure.sty is already
   4.169 -% installed on most LaTeX systems. The latest version and documentation can
   4.170 -% be obtained at:
   4.171 -% http://www.ctan.org/tex-archive/obsolete/macros/latex/contrib/subfigure/
   4.172 -% subfigure.sty has been superceeded by subfig.sty.
   4.173 -
   4.174 -
   4.175 -
   4.176 -%\usepackage[caption=false]{caption}
   4.177 -%\usepackage[font=footnotesize]{subfig}
   4.178 -% subfig.sty, also written by Steven Douglas Cochran, is the modern
   4.179 -% replacement for subfigure.sty. However, subfig.sty requires and
   4.180 -% automatically loads Axel Sommerfeldt's caption.sty which will override
   4.181 -% IEEEtran.cls handling of captions and this will result in nonIEEE style
   4.182 -% figure/table captions. To prevent this problem, be sure and preload
   4.183 -% caption.sty with its "caption=false" package option. This is will preserve
   4.184 -% IEEEtran.cls handing of captions. Version 1.3 (2005/06/28) and later 
   4.185 -% (recommended due to many improvements over 1.2) of subfig.sty supports
   4.186 -% the caption=false option directly:
   4.187 -%\usepackage[caption=false,font=footnotesize]{subfig}
   4.188 -%
   4.189 -% The latest version and documentation can be obtained at:
   4.190 -% http://www.ctan.org/tex-archive/macros/latex/contrib/subfig/
   4.191 -% The latest version and documentation of caption.sty can be obtained at:
   4.192 -% http://www.ctan.org/tex-archive/macros/latex/contrib/caption/
   4.193 -
   4.194 -
   4.195 -
   4.196 -
   4.197 -% *** FLOAT PACKAGES ***
   4.198 -%
   4.199 -%\usepackage{fixltx2e}
   4.200 -% fixltx2e, the successor to the earlier fix2col.sty, was written by
   4.201 -% Frank Mittelbach and David Carlisle. This package corrects a few problems
   4.202 -% in the LaTeX2e kernel, the most notable of which is that in current
   4.203 -% LaTeX2e releases, the ordering of single and double column floats is not
   4.204 -% guaranteed to be preserved. Thus, an unpatched LaTeX2e can allow a
   4.205 -% single column figure to be placed prior to an earlier double column
   4.206 -% figure. The latest version and documentation can be found at:
   4.207 -% http://www.ctan.org/tex-archive/macros/latex/base/
   4.208 -
   4.209 -
   4.210 -
   4.211 -%\usepackage{stfloats}
   4.212 -% stfloats.sty was written by Sigitas Tolusis. This package gives LaTeX2e
   4.213 -% the ability to do double column floats at the bottom of the page as well
   4.214 -% as the top. (e.g., "\begin{figure*}[!b]" is not normally possible in
   4.215 -% LaTeX2e). It also provides a command:
   4.216 -%\fnbelowfloat
   4.217 -% to enable the placement of footnotes below bottom floats (the standard
   4.218 -% LaTeX2e kernel puts them above bottom floats). This is an invasive package
   4.219 -% which rewrites many portions of the LaTeX2e float routines. It may not work
   4.220 -% with other packages that modify the LaTeX2e float routines. The latest
   4.221 -% version and documentation can be obtained at:
   4.222 -% http://www.ctan.org/tex-archive/macros/latex/contrib/sttools/
   4.223 -% Documentation is contained in the stfloats.sty comments as well as in the
   4.224 -% presfull.pdf file. Do not use the stfloats baselinefloat ability as IEEE
   4.225 -% does not allow \baselineskip to stretch. Authors submitting work to the
   4.226 -% IEEE should note that IEEE rarely uses double column equations and
   4.227 -% that authors should try to avoid such use. Do not be tempted to use the
   4.228 -% cuted.sty or midfloat.sty packages (also by Sigitas Tolusis) as IEEE does
   4.229 -% not format its papers in such ways.
   4.230 -
   4.231 -
   4.232 -
   4.233 -
   4.234 -
   4.235 -% *** PDF, URL AND HYPERLINK PACKAGES ***
   4.236 -%
   4.237 -%\usepackage{url}
   4.238 -% url.sty was written by Donald Arseneau. It provides better support for
   4.239 -% handling and breaking URLs. url.sty is already installed on most LaTeX
   4.240 -% systems. The latest version can be obtained at:
   4.241 -% http://www.ctan.org/tex-archive/macros/latex/contrib/misc/
   4.242 -% Read the url.sty source comments for usage information. Basically,
   4.243 -% \url{my_url_here}.
   4.244 -
   4.245 -
   4.246 -
   4.247 -
   4.248 -
   4.249 -% *** Do not adjust lengths that control margins, column widths, etc. ***
   4.250 -% *** Do not use packages that alter fonts (such as pslatex).         ***
   4.251 -% There should be no need to do such things with IEEEtran.cls V1.6 and later.
   4.252 -% (Unless specifically asked to do so by the journal or conference you plan
   4.253 -% to submit to, of course. )
   4.254 -
   4.255 -
   4.256 -% correct bad hyphenation here
   4.257 -\hyphenation{op-tical net-works semi-conduc-tor}
   4.258 -
   4.259 -
   4.260 -\begin{document}
   4.261 -
   4.262 -\bibliographystyle{plain}
   4.263 -%
   4.264 -
   4.265 -\title{Position:  Support Runtimes in Hardware,\\  Rather than Specific Parallelism Constructs}
   4.266 -
   4.267 -\author
   4.268 -{
   4.269 - \IEEEauthorblockN{Sean Halle}
   4.270 - \IEEEauthorblockA
   4.271 - {
   4.272 -   Open Source Research Institute\\
   4.273 -   Email: sean.halle@osri.org
   4.274 - }
   4.275 -}
   4.276 -
   4.277 -
   4.278 -\maketitle             
   4.279 -%
   4.280 -
   4.281 -\begin{abstract}
   4.282 -This is a position paper, whose purpose is to provide food for thought and a starting point for debate.  Although, the ideas are extrapolations from published work on runtime systems and hardware abstractions that have been implemented and successfully demonstrated.
   4.283 -
   4.284 -The main premise is that no parallelism constructs should  be directly implemented in hardware, but rather separated into a new category of \emph{firmware} that is tightly integrated into the processor pipeline and managed by the OS.  We describe hardware structures that allow traditional thread constructs, domain-specific constructs, transactional memory,  and even consistency models be implemented with extremely low overhead, as well as cooperatively engage the language's runtime into pipeline-level  hardware-resource management.
   4.285 -
   4.286 -We further take the position that software should be organized into a stack, based around \emph{specialization} of source to target hardware. Each layer of the stack has a role in the specialization process, which spans the lifetime of application code as it goes through the stages of, development, transformation to hardware-specific form, installation, and execution.  Hence, specialization includes the toolchain, hand-tuning, auto-tuners, multi-kernels, profiling, and binary optimization. We describe infrastructure to encapsulate and organize these.
   4.287 -\end{abstract}
   4.288 -
   4.289 -
   4.290 -
   4.291 -\section{What parallel abstractions should the hardware provide?}
   4.292 -
   4.293 -Our position is that the hardware should not directly supply any parallel abstractions.  Instead, it should supply a mechanism that elevates the language runtime to the status of a Hardware Abstraction Layer, which is separate from the executable and separate from the OS.  Thus, parallel abstractions are implemented as soft-extensions to the hardware.  With suitable support, many firmware-implemented parallel abstractions would require only a handful of instructions with a similarly low number of cycles of overhead.
   4.294 -
   4.295 -
   4.296 -\begin{figure}[ht]
   4.297 - \center{
   4.298 - \includegraphics[width=3in, height=2in]{../figures/Substitute_instr_with_firm-ware.eps}
   4.299 - }
   4.300 - \caption
   4.301 - {A special op-code is recognized by the decode stage, and triggers fetch of instructions from firm-ware. The firm-ware instrs are provided to the OS as a ``hardware driver", and implement the runtime behavior of a language. The application communicates to the runtime by placing pointers to data-structures into registers just before executing the ``switch to runtime" instruction, which starts the fetch from firm-ware. Helper instructions accelerate common runtime operations, such as hash-table lookups, communication, search-for-optimum, and so on.   
   4.302 -  }
   4.303 -\label{figTimeMapping}
   4.304 -\end{figure}
   4.305 -
   4.306 -
   4.307 -Precedence for such soft-extensions to instruction sets exists. The Alpha chips from DEC provided firmware that implemented complex VAX instructions this way.  A VAX ``firmware" instruction was executed by switching fetch over to a special memory that contained normal Alpha instructions, which implemented the functionality of the VAX instruction.
   4.308 -
   4.309 -An analogous approach is illustrated in Figure \ref{figTimeMapping}. Here, one op-code is set aside as the ``invoke runtime" operation.  Its execution causes instructions to switch to fetching from the firm-ware. Information is communicated via register contents, which point to data-structures that include a hardware defined portion and a language defined portion. 
   4.310 -
   4.311 -This firmware was written by the language-provider, so it is separate from the executable. It implements the behavior of parallelism constructs of the language.
   4.312 -
   4.313 -Such an approach addresses security, portability, and efficiency. It is secure because the OS controls the firm-ware. It is portable because the executable only contains the \emph{interface} to the constructs (implementation is separate).  It is efficient because the firm-ware runs in user-space, and switching to it costs the same as a \texttt{call} . This also improves application performance, if the hardware gives the firm-ware control over low-level behaviors such as hardware-supported swapping of contexts and control of hybrid cache/scratchpad memory.
   4.314 -
   4.315 -An important feature is that application information can directly affect hardware-level scheduling, communication, and other resource management. This is because the runtime has hardware control and receives application information such as the semantics of the parallelism construct invoked, data consumed by a task, and explicit information inserted by the toolchain, for the runtime. These affect the choice of what resources to assign to a given task, and when to suspend and resume tasks.
   4.316 -
   4.317 -Portability improves because only the \emph{interface} to constructs is encoded in the executable. Implementation is free to change from one processor to another, or even from one level of a machine's hierarchy to another.
   4.318 -
   4.319 -\subsection{Communications via firm-ware}
   4.320 -
   4.321 -Another portability benefit is realized when firm-ware becomes the application gateway to communication. This lets parallelism constructs be application-oriented, merely implying communications, without specifying or controlling details.
   4.322 -
   4.323 -Instead, the  firm-ware  controls activities such as marshalling data and invoking the hardware to communicate it, while linking the communication status to  creation, suspension, and resumption of tasks.
   4.324 -
   4.325 -Figure X illustrates the breakdown of responsibilities, and Figure X shows dynamically the steps of invoking the firmware, sending communications, and suspending and resuming virtual processors that animate the tasks.
   4.326 -
   4.327 -
   4.328 -
   4.329 -\subsection{Communication via separate helper processors}
   4.330 -
   4.331 -Placing communication inside the firm-ware  makes it practical to add separate helper processors that overlap communication with computation, as illustrated in Fig \ref{figCommProcr}. These processors execute separate firm-ware,  supplied  either by the OS, or as part of the  executable.
   4.332 - 
   4.333 -
   4.334 -
   4.335 -\begin{figure}[ht]
   4.336 - \center{
   4.337 - \includegraphics[width=3in, height=1.5in]{../figures/Separate_comm_processors.eps}
   4.338 - }
   4.339 - \caption
   4.340 - {Communication is performed between local memory and remote memories via a separate communication processor.  This processor executes firm-ware that is loaded under OS control. For example, it may run a standard software-cache or run scatter-gather code extracted from the application.   
   4.341 -  }
   4.342 -\label{figCommProcr}
   4.343 -\end{figure}
   4.344 -
   4.345 - 
   4.346 - 
   4.347 -A cogent example is an application with complex data structures that are communicated between long-running tasks. During a task, some portion of the data-structure is bundled up and sent to another task. 
   4.348 -
   4.349 -The language provides constructs for rendez-vous style send and receive, and constructs that identify the bundle-data and unbundle-data code.  Send and receive are implemented as part of the language, as runtime firm-ware. In contrast, the bundle and unbundle code is extracted from the application by the toolchain and packaged into the executable. During the run, an OS call causes that bundle and unbundle \emph{communication} firm-ware to be linked into the communication processors.
   4.350 -
   4.351 -When a task executes send or receive, the firm-ware swaps the context out, suspending the task, and replaces it with a non-blocked task. Simultaneously, the firm-ware causes the communication processor to execute bundle or unbundle code.  When communication completes, the task is unblocked.
   4.352 -
   4.353 -
   4.354 -
   4.355 - This tight integration of communication with scheduling of tasks is an example of application information driving scheduling. It allows the firm-ware to decide which core to assign a task to based on application code, while maintaining ultra-low overhead.
   4.356 -
   4.357 -
   4.358 - 
   4.359 -Such bundle/unbundle doesn't work as well in cases where the data consumed has little predictability, or the application doesn't provide gather-scatter or bundle-unbundle information. In this case, the OS can link standard software-cache firm-ware into the communication processors.
   4.360 -
   4.361 -Such a cache has the advantage of being able to swap-out tasks when it misses.  If the hardware makes the cost of switching tasks be on the order of a normal function call, this scheme provides an efficient way to overlap cache misses with useful work, without the large area and energy overhead of out-of-order pipelines..
   4.362 -
   4.363 -   Another potential advantage is adjusting the cache characteristics during the run to better match the application. The characteristics of the phase of computation can be measured, or the toolchain can insert the results of analysis.
   4.364 -
   4.365 -This would ideally be coupled with scratch-pad memory that is augmented with hardware that can treat a section of the memory as tags.  Special op-codes are implemented in the communication-processor to configure the tag memory, and then to cause tag-comparisons, and so on.  Previous work suggests that such a software cache would be only slightly slower than normal hard-wired caches, with modest area and energy overhead [].
   4.366 -
   4.367 -\subsection{Speculation and Fast Control Message Support}
   4.368 -
   4.369 -Hardware support for speculation will work especially well with a firm-ware runtime. Transactional memory[], thread-level speculation[], and higher-level speculative constructs[] could be all supported by generic lower-level mechanisms, which are in turn invoked by the firm-ware runtime.
   4.370 -
   4.371 -This arrangement has the benefit of isolating hardware from a language's consistency-model and execution-model. There is no  longer a large penalty for mis-match.  To get this decoupling, hardware is simplified, by factoring the semantics out, leaving only  generic ``ordering'' primitives.   
   4.372 -
   4.373 -
   4.374 -
   4.375 -\begin{figure}[ht!]
   4.376 - \center
   4.377 - { \includegraphics[width=3in, height=1.5in]{../figures/Speculation_HW_support.eps}
   4.378 - }
   4.379 - \caption
   4.380 - {  
   4.381 -  }
   4.382 -\label{figSpecHW}
   4.383 -\end{figure}
   4.384 -
   4.385 -
   4.386 -
   4.387 -Fig \ref{figSpecHW} Illustrates such a refactoring, with hardware support for consistency and speculation. Example primitives include check-pointing, sand-boxing, and tie-points [cite web with tie-point videos], none of which imply application-visible semantics. Rather, they are used inside the firm-ware runtime to build transactional memory, thread-level speculation, and consistency models such as acquire-release, or flush-on-command.
   4.388 -
   4.389 -For check-pointing, local memory has tags, just as in caches, but with an additional field that holds a check-point number. Writes are only performed to lines with the same check-point number, and if none exist, a read is performed, of either the most recent previous check-point or fresh from remote memory. The hardware supports sending and comparing lists of lines with the same check-point number, as well as sending the lines from a particular checkpoint. This efficiently supports Thread-Level Speculation, with simple roll-back and commit.
   4.390 -
   4.391 -Sandboxes use the same hardware, except  instead of storing the check-point number, the extra tag holds the sandbox ID.  For transactional memory, each transaction started gets its own sandbox ID. This supports the TCC style transactional memory implementation[cite Lujan].
   4.392 -
   4.393 -Checkpoints may also be used to support shared-memory style consistency models, but speculatively.  New check-points are periodically generated, while previous ones are examined for conflicts. Examination takes place in communication processors, supported by hardware for comparing lists of tags. Conflicts cause roll-back, and restart with updated state from one of the conflicting local memories.
   4.394 -
   4.395 - Such hardware can also be used to turn off the tight consistency of current snooping-based protocols for the bulk of computation, saving time and energy in the code that doesn't need it. Such consistency is only enabled for the few specialized portions of code that implement synchronization and communication using shared-variables, essentially for passing control-messages, such as in software-based mutex algorithms.
   4.396 -
   4.397 -Another alternative is to only update shared memory when synchronization constructs indicate handoff of ownership.   This uses the sandbox hardware to track individual objects or data structures. The synchronization construct in the runtime firm-ware triggers the communication firm-ware to update all objects on the core gaining ownership, from modifications made on the core giving up ownership. This not only eliminates the time and energy lost to snooping and directory protocols, but also simplifies the programming model and removes non-portable shared-memory code from executables.
   4.398 -
   4.399 -These approaches rely upon having fast control messages that communicate lists of tags between cores, and allow the firm-ware runtimes to use only local data.
   4.400 -Runtime performance is highest when each core has its own local runtime state. Specialized high-speed ``control'' messages in hardware also let the local runtimes communicate constraint updates, and explicitly send task-stubs, for load-balance, to each other. 
   4.401 -
   4.402 -Such internal-to-runtime messages have only small amounts of data, while their latency is crucial to the runtime's responsiveness.  A slowly responding runtime will leave its core idle more often, because the rate of handling internal bookkeeping about tasks is slower than the rate of finishing those tasks. It is in this case that fast control messages become crucial [Charm++ runtime paper].
   4.403 -
   4.404 -\subsection{Example}
   4.405 -
   4.406 -To illustrate such hardware in action, we walk through an application binary invoking the ``acquire mutex''  parallelism construct:  
   4.407 -\paragraph{setup and switch}
   4.408 -At the appropriate place in the binary, instructions load one register with the pointer to a mutex structure, and another register with the pointer to the virtual-processor (VP) requesting the mutex-lock. Next, the \texttt{switch} instruction executes, which  switches fetch over to the firm-ware of the runtime, while saving the stack and frame pointers into the data-struct of the requesting VP.
   4.409 -
   4.410 -In this example, the hardware specifies a ``virtual processor'' (VP) data structure. It begins with a hardware defined portion that the \texttt{switch} instr automatically manages.
   4.411 -
   4.412 -\paragraph{runtime internals}
   4.413 -After \texttt{switch}, runtime code executes from the protected firm-ware. The code for mutex-acquire expects a pointer to a mutex struct to be in a particular register, checks the ``current owner'' field, and if empty writes the pointer to the VP (held in another register) into it. It then marks the VP as unblocked. Similarly, if the mutex is already owned, it places the VP into the mutex struct's queue, where it remains blocked.
   4.414 -
   4.415 -Most importantly, if the mutex is already owned, the runtime swaps the requesting VP out from the hardware context. It swaps in an unblocked VP.
   4.416 -
   4.417 -The execution time of this can be on the order of 10 cycles. Such speed requires  hardware support for swapping VPs in and out, such as set-aside cache or scratch-pad memory with a wide port to registers, and speculative access to the mutex data-structure. This makes all memory access local and fast.  
   4.418 -
   4.419 -The speculative access would be verified while computation continues. If memory consistency is performed only upon command of the runtime, and hardware supports check-point and rollback, such as in Lujan's work[] then computation can continue without speed penalty.
   4.420 -
   4.421 -Notice that no atomic memory instructions have been used. Further, the executable contains nothing but interfaces to high-level constructs.  All operations have been local and fast, despite maintaining global consistency of  global address space.  
   4.422 -
   4.423 -
   4.424 -
   4.425 -\section{Which should be the responsibility / functionality of the programmer, the runtime software, and the hardware?}
   4.426 -
   4.427 -
   4.428 -With such a hardware arrangement, the responsibilities naturally break down along the lines of a software stack []. The goal of it is to support specialization, which is the process of transforming the original source into a form that is highly efficient on the target hardware. 
   4.429 -
   4.430 -Each layer of the stack has some role in the specialization process, while the application, on top, provides the information that the rest of the stack needs while performing the specialization.  Ideally, the application must not expose hardware assumptions nor hinder specializations for particular targets.
   4.431 -
   4.432 -The proposed hardware naturally supports such a stack. The bottom layer is an interface to simplify creation of the firm-ware runtime implementations. The set of runtimes themselves forms the next layer above that. Above the runtimes is the set of toolchains that generate the executables that talk to the runtimes. Above the toolchains is the set of language-interfaces, and above that, at the top, is the set of applications.
   4.433 - 
   4.434 -The applications only expose constructs, ones designed to avoid hardware implications. Languages with such constructs include CnC[], WorkTable[] and HWSim []. The concurrency constructs are implemented by the runtimes. This alone doesn't ensure portability, but it goes a long way towards that goal, by removing the largest source of hardware-specific information.
   4.435 -
   4.436 - 
   4.437 -Such a stack supports high productivity through domain-specific languages, such as HWSim, making them simple to create, easy to port across hardware, and high performance. The application programmer is responsible only for application-relevant concepts, reducing their learning curve and matching their mental model to the language.  They have domain-specific parallelism constructs provided, either embedded-style as library calls, or with compiler support.
   4.438 -
   4.439 -The constructs help specialization by identifying the tasks, the constraints on scheduling the tasks, and the data to be communicated between tasks.
   4.440 -
   4.441 -In addition, high-quality specialization requires certain ``helpers"[].  These  enable: 1) modifying the layout and order of access of data, 2) modifying the size of a task, both the data consumed and code executed by it, and 3) predicting both execution-time  and data consumed by each task. An example is DKU [], which provides task-size-modification helpers.
   4.442 -
   4.443 -The helpers are either derived by the toolchain, or encoded directly in the application via suitable constructs.  Either way, the domain-specific constructs must be designed such that the information is captured, and convenient for the tools to extract.
   4.444 -
   4.445 -One last concern is the creation of all these firm-ware runtimes.  It would be good to uniform-ize  them as much as possible. That reduces the work of creating one for a particular language, by reusing the interface over many languages. An example is the Virtualized Master-Slave interface[]. 
   4.446 -
   4.447 -
   4.448 -
   4.449 -
   4.450 -
   4.451 -\section{Specific Topics of Interest}
   4.452 -Now that a position has been stated, let us examine how it applies to the topics of interest, to check its consistency and usefulness.
   4.453 -\paragraph{enabling future parallel programming models}
   4.454 - \texttt{switch}-to-runtime supports current and enables foreseeable future parallel programming models.  It maintains very low overhead for them, by embedding the switch mechanism in the pipeline, and by providing hardware support for common runtime constraint-management and assignment operations like hash tables and context swapping. The combination of software flexibility, with efficiency, and the added bonus of bringing application information into the lowest-hardware-level resource management  appears strong.
   4.455 -
   4.456 -\paragraph{innovative architectural execution models} Our position advocates isolating the architectural execution model from the programming model.  \texttt{switch}-to-runtime lets widely different hardware all implement the same programming model. This gives hardware freedom to explore, without code legacy constraining it.
   4.457 -However, high-speed internal-to-runtime messages, speculation support, and decoupled communication processors  may be considered elements of an architectural execution model advocated by our position.
   4.458 -
   4.459 -\paragraph{novel memory hierarchies} Our position suggests that memories be coupled with their own communication processor that performs all movement of data to remote memories. Also that memories be configurable, to have tags that include check-point and sandbox IDs, along with hardware for sending lists of tags that have a given ID, and ability to check tags against such a list.
   4.460 -Together, these features should efficiently implement transactional memory, thread-level speculation, acquire-release, and speculative implementation of the tighter variations on sequential consistency.
   4.461 -\paragraph{simplified and scalable memory models} The communication processor plus speculation hardware can support a wide variety of consistency models, including simplified high-level ones implied by domain-specific constructs. The speculation and linkage to context swapping allows memory consistency and communication to overlap work in the work processor. The scalability is left to communication firm-ware.
   4.462 -
   4.463 -\paragraph{high-level constructs for on-chip communications} The communication processors, with their own firm-ware, enable efficient implementation of essentially any high-level construct. Further, linkage between communication processor and firm-ware runtime in the work processor brings pipeline-level hardware control into the high-level constructs. As a result, high-level constructs can not only imply communications, but cause virtual-processors to be swapped out of hardware during communication so that it is overlapped with useful work from a different context. 
   4.464 -
   4.465 -\paragraph{characterization of the runtime overheads of parallel applications}
   4.466 -
   4.467 -\paragraph{future directions in programming massively parallel systems}  hierarchy of runtimes, each level tuned to one level in HW hierarchy, algorithms and code that arrange data and perform computation in a ``fractal'' arrangement, with each level of hardware looking the same in terms of communication and computation activity.  Thus, communication within the computation scales the same as communication available in the hardware scales, with level in the hierarchy.
   4.468 -
   4.469 -\paragraph{potential bottlenecks for future parallel systems}
   4.470 -communication-to-comp ratio in hardware is worsening..  must find hierarchical approximations to problems, that accumulate lower-level results, so amount of comm decreases as go up in the HW hierarchy.
   4.471 -
   4.472 -
   4.473 -==================================
   4.474 -
   4.475 --] Main programmer visible elements: causal ordering, names of data (pointers inside data-structs), communication of data, operations applied to data, units of work, scheduling events, resulting concrete sequences of work-unit instances, tied together at certain points (for dataflow, is the firing of operations on data-sets, for functional is the application of lambda to data-instances -- the tie is where data instance output flows to multiple inputs)
   4.476 -
   4.477 -
   4.478 --]  Runtime support includes:
   4.479 -
   4.480 --] "speculative exclusive access to local memory-line"
   4.481 -
   4.482 -
   4.483 -
   4.484 --] HW to create a "soft" ctxt (a virtual processor with stack), checkpoint it and restore a checkpoint..
   4.485 -
   4.486 --] HW to accelerate common parallelism-construct ops, like hash-table, queue, search-for-match (ex is runtime impl of mutex and cond vars via queues and dataflow via hash-table)
   4.487 -
   4.488 --] HW for multi-context stack (stuff talked about with Albert)
   4.489 -
   4.490 --] Malloc and Free in hardware, for Virtual Procr create, and for namespaces
   4.491 -
   4.492 --] HW to support "namespace", which is a chunk of allocated memory that a virtual-processor sees..  all pointers within a namespace are offsets from the start of the namespace.. so have a reserved register that holds namespace base addr, and pointers are added to that to get final addr.  Makes pointers equivalent to global, but relocatable.  A namespace is essentially a stack with only one frame.  When access an out-of-namespace pointer, the target namespace is accessed, the data brought over, added to the end (or malloc'd into the namespace), and pointers within the data are translated to new offsets.  This provides automated HW management of distributed memories.  If out-of-namespace pointer is within same addr-space, then it is directly accessed -- HW has a number of base-addr regs, which it can swap in and out 
   4.493 -
   4.494 -
   4.495 --] HW support for independent code  performs translation of pointers from previous memory-space to new memory-space.. so pointers become base plus offset where base is start of the memory-space
   4.496 -
   4.497 --] HW support for memory spaces.. all data is viewed as existing within a memory-space, where that memory-space is a HW entity.. it has a start address and a length, so all pointers are offsets from the start addr (goes back to early main-frame ideas)  In code, no difference from shared-memory -- all data is within a data-struct or array, and data-structs contain pointers -- difference is that code is supplied either by language-impl or by programmer that translates the pointers when data is copied or moved to a different memory-space.  Each memory-space exists inside an addr-space, but is fully repositionable just by changing the base pointer..  Thinking one memory-space per virtual processor (SW ctxt)?  
   4.498 -
   4.499 -=======================================
   4.500 -
   4.501 -application code at the top, held within development tools, the runtime is separate from the executable, in this stack.  The separation allows a single executable to run without modification on several versions of hardware, even though the runtime uses specialized hardware instructions.
   4.502 -
   4.503 -The end point is the triple goal: Productivity, Performant-Portability, and Adoptability.
   4.504 -
   4.505 --] Productivity comes from a combination of language design that provides a mental model close to the applications while hiding any influence from hardware
   4.506 -
   4.507 --] Performant-Portability is the most difficult technically, and boils down to the process of specializing code to the hardware.  This process can span multiple points in an application's life-time, which correspond to multiple levels of the software stack.  For example, compiler transforms, and then runtime choices (auto-tuners), and even swapping particular HW abstractions is part of specializing code to end-hardware.
   4.508 -
   4.509 -
   4.510 -Then, responsibilities are assigned to layers and interfaces within the software stack:
   4.511 -
   4.512 -Application Layer: 
   4.513 -
   4.514 --] state features of the application, in terms of constructs provided by language interface (constructs can be "embedded" into base sequential language or base parallel lang being enhanced -- for continuity with current code bases)
   4.515 -
   4.516 -Language Interface: 
   4.517 -
   4.518 --] Identify Units of work
   4.519 -
   4.520 --] state constraints on scheduling those units
   4.521 -
   4.522 --] provide what specialization needs to manipulate data layout and access order (specialization spans toolchain, runtime, and base HW abstraction) 
   4.523 -
   4.524 -
   4.525 --] provide for toolchain to manipulate data-size of work-unit and code-content of work-unit, provide for data ancestry ("data footprint") to be tracked among work-units, provide for prediction of execution time of a work-unit, for real-time provide stating real-time related constraints on scheduling of units (latency, deadlines, quality relationship)
   4.526 -
   4.527 --- note, these don't all have to be language constructs, but could be, for example, code-snippets supplied to the language, via a construct.  The snippets are then used either in the toolchain or in the runtime.  Examples: DKU for task re-sizing, WorkTable for dynamic dependencies (H264 wait-until example)
   4.528 --- purpose of each is in terms of the specialization process.  Specialization is the embodiment of performant portability -- the term means any changes to UCC done for purposes of performance (define UCC).
   4.529 -
   4.530 -Toolchain Layer: 
   4.531 -
   4.532 --] Perform first step of specializing code to hardware
   4.533 -
   4.534 --] May include tools that bring experts on specialization into the process.. provide them with visualizations, and let them tune code choices and control transforms the tools perform
   4.535 -
   4.536 --] Often involves inserting code that performs down-stream specialization, such as an autotuner (which makes final selection of parameters for data layout or kernel version, etc)
   4.537 -
   4.538 -Runtime Interface:
   4.539 -
   4.540 --] Provides a standard way for executable to talk to runtime.. allows the runtime to be implemented with hardware-specific parallelism-helper instructions (as suggested in part 2) without tying the executable to particular hardware.  Take three different HW platforms, impl runtime on each using instrs only that one platform has, but then single executable runs unchanged on all.
   4.541 -
   4.542 --] Responsible for transmitting performance and specialization information from toolchain layer to runtime layer.  This includes code that manipulates work-unit sizes, and data-layout and access pattern, as well as predictors of execution time of work-units and ways to identify and track data footprint of work-units.
   4.543 -
   4.544 -Runtime Layer:
   4.545 -
   4.546 --] Contains implementations of the parallelism constructs that create units of work, and implementations that enforce constraints on assigning the work to hardware, for animation.
   4.547 -
   4.548 --] Responsible for presenting consistent interface to toolchain layer, using specialization information to adjust work-unit sizes, using data-footprint information to choose hardware with least communication needed, using exe-time prediction to choose best ordering that overlaps the most communication and saves up "free" work to overlap bottlenecks in constraint-graph (dependency graph).
   4.549 -
   4.550 -
   4.551 -Abstraction Interface:
   4.552 -
   4.553 --] Responsible for providing a uniform view of hardware to the runtime, including primitives for check-pointing context, creating new virtual processors, switching contexts, performing communication, bundle/unbundle
   4.554 -
   4.555 -
   4.556 -Abstraction Layer:
   4.557 -
   4.558 --] Whatever of the above mentioned helpers are not implemented directly in HW, implement them in SW in this layer.
   4.559 -
   4.560 --] provide services for the runtime to use -- such as communications, creation of virtual processors, memory allocation, and so on..
   4.561 -
   4.562 -HW Interface:
   4.563 -
   4.564 --] should provide as much of what the abstraction layer contains as is reasonable.  Whether HW implements everything, and so replaces the abstraction layer, or some set of helpers that simplify and speed up the abstraction layer depends on how mature the abstraction is and area/energy/design-time/performance tradeoffs.
   4.565 -
   4.566 -
   4.567 -\section{Conclusion}\label{secConclusion}
   4.568 -
   4.569 -
   4.570 -
   4.571 -
   4.572 -\bibliography{Bib_for_papers}
   4.573 -
   4.574 -\end{document}