VMS/0__Writings/kshalle

changeset 63:832c3927511f

perf-tuning -- fixed multi-level SCG expl
author Sean Halle <seanhalle@yahoo.com>
date Sat, 07 Jul 2012 02:39:50 -0700
parents 59a4161e7bf2
children 06073dc28f72
files 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex 0__Papers/Holistic_Model/Perf_Tune/latex/bib_for_papers_jun_2012.bib 0__Papers/VMS/VMS__Foundation_Paper/VMS__Full_conference_version/latex/VMS__Full_conf_paper.tex
diffstat 3 files changed, 1080 insertions(+), 30 deletions(-) [+]
line diff
     1.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Tue Jul 03 11:27:13 2012 +0200
     1.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Sat Jul 07 02:39:50 2012 -0700
     1.3 @@ -104,7 +104,7 @@
     1.4  \section{Background and Related Work}
     1.5  \label{sec:related}
     1.6  
     1.7 -A quick review of the process of performance tuning will provide much needed context for the shortcomings of other tools. %too negative
     1.8 +A quick review of the process of performance tuning will provide much needed context for the ddisussion of other tools. %too negative
     1.9  
    1.10   Performance tuning is an iterative process that involves a mental model. The programmer takes measurements during execution that are then compared to the desired outcome. A mental model, constructed through experience and knowledge of the mechanics of execution, is used to generate a hypothesis explaining any discrepancies between the measurement and expectations. This hypothesis is then linked, again through a mental model, to things within the programmer's control, to suggest a change to make to the code. The modified code is run again, and these steps are repeated until the programmer is satisfied with the performance of the program.
    1.11  
    1.12 @@ -397,31 +397,44 @@
    1.13  
    1.14  For the example, consider  a server with one rack,  having a back-plane that boards plug into. A board has its own memory with four sockets, each having a chip with four cores. So there is a back-plane network connecting the boards,  a bus on each board that connects the sockets to the DRAM, and inside the chip in each socket is a cache hierarchy that connects the cores.
    1.15  
    1.16 -The hardware is given a set of runtimes to match the hierarchy. Each network or bus has a runtime that schedules work onto the things connected below it. So the top runtime divides work among the boards, while each board has a runtime that divides work among the sockets, and each socket has a runtime that divides work among the cores.  
    1.17 +The hardware is given a set of runtimes to match the hierarchy. Each network or bus has a runtime that schedules work onto the things connected below it. So the top runtime divides work among the boards, while a board's  runtime  divides work among the sockets, and a socket's  runtime  divides work among the cores.  
    1.18  
    1.19 -To a runtime high up, each runtime below it looks like a complete machine. It schedules work-units to these machines, without knowing the internal details of how that machine is implemented. So the runtime at the top handles very large work-units that it schedules onto the boards. The runtime on a board, meanwhile, divides up the work-unit it gets into smaller work-units, and schedules one onto each socket, and so on.
    1.20 +To a runtime high up, each runtime below it looks like a complete machine. It schedules work-units to those machines, without knowing the internal details of how that machine is implemented. So the runtime at the top handles very large work-units that it schedules onto the runtimes on the boards. A board-level runtime  divides up the work-unit it gets into smaller work-units, then schedules one onto each socket's runtime, and so on.
    1.21  
    1.22 -The application in this example has been written in a language that allows work to be divided. The toolchain inserted a manipulator that allows each runtime to divide up the work it is given into smaller work-units. This pushed the UCC of the application all the way to the right on the unit axis.  
    1.23 +The application in this example has been written in a language that allows work to be divided. The toolchain inserted a manipulator that allows each runtime to divide up the work it is given into smaller work-units, such as via the DKU pattern[]. This pushed the UCC of the application all the way to the right on the unit axis.  
    1.24  
    1.25  So what does the concrete UCC produced during a run look like? Well, a unit is defined as the work resulting from one scheduling decision. Each runtime has its own scheduler, which means units are defined for each runtime.  That in turn means that each runtime has its own concrete UCC! 
    1.26  
    1.27 -Figure X shows that these UCCs are related to each other in the same hierarchy as the runtimes. A unit scheduled in one runtime is broken into smaller units in the one below it. Each of those units is  then separately scheduled, making a complete UCC just for them. So, as the figure shows, a unit in one UCC has an entire UCC below it. 
    1.28 +
    1.29 +\begin{figure}[ht]
    1.30 +  \centering
    1.31 +  \includegraphics[width = 2in, height = 1.8in]{../figures/UCC_levels.pdf}
    1.32 +  \caption{Representation of multiple levels of  UCC.}
    1.33 +  \label{fig:UCC_Levels}
    1.34 +\end{figure}
    1.35 +
    1.36 +Figure \ref{fig:UCC_Levels} shows that these UCCs are related to each other in the same hierarchy as the runtimes. A unit scheduled in one runtime is broken into smaller units in the one below it. Each of those units is  then separately scheduled, making a separate UCC just for them. So, as the figure shows, a unit in one UCC has an entire UCC inside it.  
    1.37  
    1.38   Great, that makes sense, now what about the consequence graphs?
    1.39  
    1.40  \subsubsection{Levels of Consequence Graph}
    1.41  
    1.42 -A consequence graph ties together scheduling decisions made on units with the consequences in the hardware of those decisions. The goal is to charge each segment of time on a physical core to exactly one box in a consequence graph.
    1.43 +A consequence graph ties together scheduling decisions made on units with the consequences in the hardware of those decisions. But there are now multiple levels of  consequence graph, one for each UCC. With multiple levels, a lower-level runtime is treated as a single  ``core" by the level above it. So, what does ``consequence'' mean in this case?   The answer is that for performance tuning, the consequence of  interest is the critical path.
    1.44  
    1.45 -In the UCCs, for a higher  runtime, each lower runtime it schedules onto is treated as a machine. We saw in Fig X that a unit has an entire  UCC in the level below, so there is a corresponding consequence graph. The UCC states the degrees of scheduling freedom, while the consequence graph shows the hardware consequences resulting from the particular scheduling choices made in the runtime.
    1.46 + That gives two  goals: first, to get the consequences to the critical path, then second,  to charge each segment of time on an actual physical core to exactly one box in one of the levels  of consequence graph.
    1.47 +We note that the critical path for one level  is in terms of the work-times of  its units, but those units now each have an entire consequence graph inside it. Hence the work time of a unit is  the critical path time for the consequence graph inside it.   
    1.48  
    1.49 - Now, the question is, what hardware usage should get counted towards one of the units? The answer is: only the time spent on the cores that is used to schedule, do work, and wait for non-overlapped communication of that unit.
    1.50 + Now, the question is, which portions of the physical core time should get counted towards one  higher-level unit? The answer is seen by looking at all the levels in the matrix multiply application, from the story in Section \ref{sec:casestudy}. Going up,  the level above the execution under study involves the invocation   of entire applications, via OS commands. At that level, a unit is an entire process, and the critical path of the SCG in Section  \ref{sec:casestudy} is the work-time of that unit.  That leaves the  time spent inside the OS as the runtime overhead assigned to that unit. 
    1.51  
    1.52 -The time spent on scheduling one of the units is straight-forward, it's the normal runtime overhead of receiving a unit, managing the constraints on it, and choosing the best location and time to execute it.  The only variation is that the location chosen is a lower-level runtime rather than a physical core.
    1.53 +In the other direction, the lower level is the operation of the out-of-order pipelines in the cores, which have the equivalent of a runtime, in hardware. The hardware runtime consists of  the dependency logic that determines which instructions are free, and the issue logic that determines which functional unit performs a free instruction. Hence, a unit is one instruction. The work time is the number of cycles it contributes to the critical path, due to dependencies limiting overlap. And the runtime overhead is the operation of the dependency and issue logic. Those don't contribute \textit{directly} to the critical path, so instructions effectively have no runtime overhead. 
    1.54  
    1.55 -But what core time should be charged as the work of that unit? The answer is: the core time not accounted for in the descendent consequence graphs. Each segment of physical core time can only be charged to one box in one consequence graph, so only the leaf graphs count the actual work. Further, the time spent in the lower runtime spent receiving, handling constraints, and choosing when and where to schedule the sub-units is charged to boxes in the lower-level consequence graph.  By the process of elimination, the only time not accounted for elsewhere is the time spent dividing up a unit into smaller ones, and time spent accumulating the individual results back together. So this is what gets charged to the work-time box for a higher-level unit.
    1.56 +We return now to   the question of  the  time a higher-level unit uses the cores outside of its sub-units. 
    1.57 +Consider, in the matrix multiply code, the core usage spent  while  dividing the work and handing it to other cores. This is not work of the application, but overhead spent breaking the single application-unit into multiple sub-units.   Even though it is in the application code, it's purpose is implementing the execution model, which makes it runtime overhead. But which runtime level? It's not part of the SSR language runtime, so not overhead of a unit inside the application,   but rather it's for the application itself, as a unit! So, the core time spent calculating the division  gets counted towards the application-level unit, while the time spent inside the SSR runtime creating the meta-units is counted towards those lower SSR-level units. But both are in the critical path, so both charged as work time of the higher-level unit.
    1.58  
    1.59 -The last question is how to handle communication consequences. This is tricky because decisions in higher-level runtimes set the context for decisions in lower-level ones. This means a higher-level choice is linked to the consequences from lower-level choices. The value of a consequence graph is due to linking the size of boxes in it to the decisions made by the scheduler, as represented by the shape. It's not clear how to divide, among the levels, the time that cores spend waiting for non-overlapped communication. We have no good answer at the moment and leave it for future work.
    1.60 +
    1.61 + Another way to view this is that  by the process of elimination, the only core-time not accounted for elsewhere is the time spent dividing up a unit into smaller ones, and time spent accumulating the individual results back together. So this is what gets charged to the  higher-level unit.
    1.62 +
    1.63 +The last question is how to handle communication consequences, which result from decisions made in all levels? The decisions in higher-level runtimes set the context for decisions in lower-level ones, which links a higher-level choice  to the consequences resulting from the lower-level choices. But the value of a consequence graph comes from linking the size of boxes in it to the decisions made by the scheduler. It's not clear how to divide-up  the time  cores spend waiting for non-overlapped communication, to assign portions to different levels. We have no good answer at the moment and leave it for future work.
    1.64  
    1.65  
    1.66  
    1.67 @@ -443,7 +456,7 @@
    1.68  Another benefit evident from the details in this section is that the instrumentation is done only once, for a language. All applications written in the language inherit the visualizations, without any change to the application code.
    1.69  
    1.70  \subsection{Meta-units and unit life-line in the computation model}
    1.71 -
    1.72 +\label{subsec:LifeLine}
    1.73  In preparation for mapping the model onto implementation details, we define a meta-unit and unit life-line. These form the basis for deciding points in the runtime  where data is collected.
    1.74  
    1.75  Every unit has a meta-unit that represents it in the runtime. A  unit is defined as the trace of application code that exists between two scheduling decisions. Looking at this in more detail, every runtime has some form of internal bookkeeping state for a unit, used to track constraints on it and make decisions about when and where to execute. This exists even if that state is just a pointer to a function that sits in a queue. We call this bookkeeping state for a unit the meta-unit.
     2.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     2.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/bib_for_papers_jun_2012.bib	Sat Jul 07 02:39:50 2012 -0700
     2.3 @@ -0,0 +1,942 @@
     2.4 +
     2.5 +@inbook{PerfToolPoem,
     2.6 +title = {The Poems of John Godfrey Saxe, Complete edition},
     2.7 +chapter = {The Blind Men and the Elephant},
     2.8 +author = {John Godfrey Saxe},
     2.9 +publisher = {Boston: James R. Osgood and Company},
    2.10 +year = {1873},
    2.11 +pages = {77-78}
    2.12 +}
    2.13 +@article{PerfToolTau,
    2.14 +author = {Shende, Sameer S. and Malony, Allen D.},
    2.15 +title = {The Tau Parallel Performance System},
    2.16 +volume = {20},
    2.17 +number = {2},
    2.18 +pages = {287-311},
    2.19 +year = {Summer 2006},
    2.20 +journal = {International Journal of High Performance Computing Applications}
    2.21 +}
    2.22 +@ARTICLE{PerfToolParadyn,
    2.23 +author={Miller, B.P. and Callaghan, M.D. and Cargille, J.M. and Hollingsworth, J.K. and Irvin, R.B. and Karavanic, K.L. and Kunchithapadam, K. and Newhall, T.},
    2.24 +journal={Computer},
    2.25 +title={The Paradyn parallel performance measurement tool},
    2.26 +year={1995},
    2.27 +month={nov},
    2.28 +volume={28},
    2.29 +number={11},
    2.30 +pages={37 -46},
    2.31 +}
    2.32 +@ARTICLE{PerfToolParagraph,
    2.33 +author={Heath, M.T. and Etheridge, J.A.},
    2.34 +journal={Software, IEEE},
    2.35 +title={Visualizing the performance of parallel programs},
    2.36 +year={1991},
    2.37 +month={sept. },
    2.38 +volume={8},
    2.39 +number={5},
    2.40 +pages={29 -39},
    2.41 +}
    2.42 +@article{PerfToolStarSs,
    2.43 +  author    = {Steffen Brinkmann and
    2.44 +               Jos{\'e} Gracia and
    2.45 +               Christoph Niethammer and
    2.46 +               Rainer Keller},
    2.47 +  title     = {TEMANEJO - a debugger for task based parallel programming
    2.48 +               models},
    2.49 +  journal   = {CoRR},
    2.50 +  volume    = {abs/1112.4604},
    2.51 +  year      = {2011},
    2.52 +}
    2.53 +@techrep{SyncConstr_impl_w_distr_coherence_HW_Utah_96,
    2.54 +	author = {Carter, J. B. and Kuo, C.-C. and Kuramkote, R.},
    2.55 +	title = { A comparison of software and hardware synchronization mechanisms for distributed shared memory multiprocessors},
    2.56 +	institution = {University of Utah, Salt Lake City, UT},
    2.57 +	year = 1996,
    2.58 +	url = {http://www.cs.utah.edu/research/techreports/1996/pdf/UUCS-96-011.pdf},
    2.59 +	number = {UUCS-96-011}
    2.60 +}
    2.61 +@Article{SWCoherence_Hill_SW_for_shared_coherence_w_HW_support_93,
    2.62 +	author = {Hill, Mark D. and Larus, James R. and Reinhardt, Steven K. and Wood, David A.},
    2.63 +	title = {Cooperative shared memory: software and hardware for scalable multiprocessors},
    2.64 +	journal = {ACM Trans. Comput. Syst.},
    2.65 +	volume = 11,
    2.66 +	number = 4,
    2.67 +	year = 1993,
    2.68 +	pages = {300--318}
    2.69 +}
    2.70 +@InProceedings{SWCache_MIT_embedSW_manages_cache_w_HW_supp,
    2.71 +	author = {Chiou, Derek and Jain, Prabhat and Rudolph, Larry and Devadas, Srinivas},
    2.72 +	title = {Application-specific memory management for embedded systems using software-controlled caches},
    2.73 +	booktitle = {DAC},
    2.74 +	year = 2000,
    2.75 +	pages = {416--419}
    2.76 +}
    2.77 +@InProceedings{SWCache_instr_trig_HW_supp_04,
    2.78 +	author = {Janapsatya, Andhi and Parameswaran, Sri and Ignjatovic, A.},
    2.79 +	title = {Hardware/software managed scratchpad memory for embedded system},
    2.80 +	booktitle = {Proceedings of the 2004 IEEE/ACM International conference on Computer-aided design},
    2.81 +	series = {ICCAD '04},
    2.82 +	year = 2004,
    2.83 +	pages = {370--377}
    2.84 +}
    2.85 +@InProceedings{SWCache_arch_supp_OS_policy_06,
    2.86 +	author = {Rafique, Nauman and Lim, Won-Taek and Thottethodi, Mithuna},
    2.87 +	title = {Architectural support for operating system-driven CMP cache management},
    2.88 +	booktitle = {Proceedings of the 15th international conference on Parallel architectures and compilation techniques},
    2.89 +	series = {PACT '06},
    2.90 +	year = 2006,
    2.91 +	pages = {2--12}
    2.92 +}
    2.93 +@InProceedings{SWCoherence_on_Distr_Mem_90,
    2.94 +	author = {Bennett, J.K. and Carter, J.B. and Zwaenepoel, W.},
    2.95 +	booktitle = {Computer Architecture, 1990. Proceedings., 17th Annual International Symposium on},
    2.96 +	title = {Adaptive software cache management for distributed shared memory architectures},
    2.97 +	year = 1990,
    2.98 +	pages = {125 -134}
    2.99 +}
   2.100 +@InProceedings{Charm_runtime_opt_10,
   2.101 +	author = {Mei, Chao and Zheng, Gengbin and Gioachin, Filippo and Kal{\'e}, Laxmikant V.},
   2.102 +	title = {Optimizing a parallel runtime system for multicore clusters: a case study},
   2.103 +	booktitle = {The 2010 TeraGrid Conference},
   2.104 +	year = 2010,
   2.105 +	pages = {12:1--12:8}
   2.106 +}
   2.107 +@InProceedings{TCC_Hammond_ISCA_04,
   2.108 +	author = {Hammond, Lance and al, et},
   2.109 +	title = {Transactional Memory Coherence and Consistency},
   2.110 +	series = {ISCA '04},
   2.111 +	pages = {102--},
   2.112 +	booktitle = {},
   2.113 +	year = {}
   2.114 +}
   2.115 +@Misc{WorkTableHome,
   2.116 +	author = {Halle, Sean},
   2.117 +	note = {http://musictwodotoh.com/worktable/content/refman.pdf},
   2.118 +	title = {The WorkTable Language Reference Manual},
   2.119 +	year = 2012
   2.120 +}
   2.121 +@Misc{HWSimHome,
   2.122 +	author = {Halle, Sean and Hausers, Stefan},
   2.123 +	note = {http://musictwodotoh.com/hwsim/content/refman.pdf},
   2.124 +	title = {The HWSim Language Reference Manual},
   2.125 +	year = 2012
   2.126 +}
   2.127 +@Article{Lamport78,
   2.128 +	author = {Lamport, Leslie},
   2.129 +	title = {Time, clocks, and the ordering of events in a distributed system},
   2.130 +	journal = {Commun. ACM},
   2.131 +	volume = 21,
   2.132 +	issue = 7,
   2.133 +	year = 1978,
   2.134 +	pages = {558--565}
   2.135 +}
   2.136 +@Article{Lamport87,
   2.137 +	author = {Lamport, Leslie},
   2.138 +	title = {A fast mutual exclusion algorithm},
   2.139 +	journal = {ACM Trans. Comput. Syst.},
   2.140 +	volume = 5,
   2.141 +	issue = 1,
   2.142 +	year = 1987,
   2.143 +	pages = {1--11}
   2.144 +}
   2.145 +@InProceedings{Dijkstra67,
   2.146 +	author = {Dijkstra, Edsger W.},
   2.147 +	title = {The structure of the "{THE}"-multiprogramming system},
   2.148 +	booktitle = {Proceedings of the first ACM symposium on Operating System Principles},
   2.149 +	series = {SOSP '67},
   2.150 +	year = 1967,
   2.151 +	pages = {10.1--10.6}
   2.152 +}
   2.153 +@Article{Conway63,
   2.154 +	author = {Conway, Melvin E.},
   2.155 +	title = {Design of a separable transition-diagram compiler},
   2.156 +	journal = {Commun. ACM},
   2.157 +	volume = 6,
   2.158 +	issue = 7,
   2.159 +	year = 1963,
   2.160 +	pages = {396--408}
   2.161 +}
   2.162 +@Book{ComponentModel00,
   2.163 +	author = {G Leavens and M Sitaraman (eds)},
   2.164 +	title = {Foundations of Component-Based Systems},
   2.165 +	publisher = {Cambridge University Press},
   2.166 +	year = 2000
   2.167 +}
   2.168 +@Misc{Hewitt10,
   2.169 +	author = {Carl Hewitt},
   2.170 +	title = {Actor Model of Computation},
   2.171 +	year = 2010,
   2.172 +	note = {http://arxiv.org/abs/1008.1459}
   2.173 +}
   2.174 +@Article{Actors97,
   2.175 +	author = {Agha,G. and Mason,I. and Smith,S. and Talcott,C.},
   2.176 +	title = {A foundation for actor computation},
   2.177 +	journal = {Journal of Functional Programming},
   2.178 +	volume = 7,
   2.179 +	number = 01,
   2.180 +	pages = {1-72},
   2.181 +	year = 1997
   2.182 +}
   2.183 +@Article{SchedActivations,
   2.184 +	author = {Anderson, Thomas E. and Bershad, Brian N. and Lazowska, Edward D. and Levy, Henry M.},
   2.185 +	title = {Scheduler activations: effective kernel support for the user-level management of parallelism},
   2.186 +	journal = {ACM Trans. Comput. Syst.},
   2.187 +	volume = 10,
   2.188 +	issue = 1,
   2.189 +	month = {February},
   2.190 +	year = 1992,
   2.191 +	pages = {53--79}
   2.192 +}
   2.193 +@InProceedings{BOMinManticore,
   2.194 +	author = {Fluet, Matthew and Rainey, Mike and Reppy, John and Shaw, Adam and Xiao, Yingqi},
   2.195 +	title = {Manticore: a heterogeneous parallel language},
   2.196 +	booktitle = {Proceedings of the 2007 workshop on Declarative aspects of multicore programming},
   2.197 +	series = {DAMP '07},
   2.198 +	year = 2007,
   2.199 +	pages = {37--44},
   2.200 +	numpages = 8
   2.201 +}
   2.202 +@TechReport{GainFromChaos_Halle_92,
   2.203 +	author = {Halle, K.S. and Chua, Leon O. and Anishchenko, V.S. and Safonova, M.A.},
   2.204 +	title = {Signal Amplification via Chaos: Experimental Evidence},
   2.205 +	institution = {EECS Department, University of California, Berkeley},
   2.206 +	year = 1992,
   2.207 +	url = {http://www.eecs.berkeley.edu/Pubs/TechRpts/1992/2223.html},
   2.208 +	number = {UCB/ERL M92/130}
   2.209 +}
   2.210 +@InProceedings{HotPar10_w_BLIS,
   2.211 +	author = {Sean Halle and Albert Cohen},
   2.212 +	booktitle = {HOTPAR '10: USENIX Workshop on Hot Topics in Parallelism},
   2.213 +	month = {June},
   2.214 +	title = {Leveraging Semantics Attached to Function Calls to Isolate Applications from Hardware},
   2.215 +	year = 2010
   2.216 +}
   2.217 +@InProceedings{HotPar11_w_Stack,
   2.218 +	author = {Sean Halle and Albert Cohen},
   2.219 +	booktitle = {HOTPAR '11: USENIX Workshop on Hot Topics in Parallelism},
   2.220 +	month = {May},
   2.221 +	title = {},
   2.222 +	year = 2011
   2.223 +}
   2.224 +@Article{VMS_LCPC_11,
   2.225 +	author = {Sean Halle and Albert Cohen},
   2.226 +	title = {A Mutable Hardware Abstraction to Replace Threads},
   2.227 +	journal = {24th International Workshop on Languages and Compilers for Parallel Languages (LCPC11)},
   2.228 +	year = 2011
   2.229 +}
   2.230 +@Misc{StackTechRep_10,
   2.231 +	author = {Halle, Sean and Nadezhkin, Dmitry and Cohen, Albert},
   2.232 +	note = {http://www.soe.ucsc.edu/share/technical-reports/2010/ucsc-soe-10-02.pdf},
   2.233 +	title = {A Framework to Support Research on Portable High Performance Parallelism},
   2.234 +	year = 2010
   2.235 +}
   2.236 +@Misc{CTBigStepSemTechRep_06,
   2.237 +	author = {Halle, Sean},
   2.238 +	note = {http://www.soe.ucsc.edu/share/technical-reports/2006/ucsc-crl-06-11.pdf},
   2.239 +	title = {The Big-Step Operational Semantics of CodeTime Circuits},
   2.240 +	year = 2006
   2.241 +}
   2.242 +@Misc{MentalFrameworkTechRep_06,
   2.243 +	author = {Halle, Sean},
   2.244 +	note = {http://www.soe.ucsc.edu/share/technical-reports/2006/ucsc-crl-06-12.pdf},
   2.245 +	title = {A Mental Framework for use in Creating Hardware Independent Parallel Languages},
   2.246 +	year = 2006
   2.247 +}
   2.248 +@Misc{DKUTechRep_09,
   2.249 +	author = {Halle, Sean and Cohen, Albert},
   2.250 +	note = {http://www.soe.ucsc.edu/share/technical-reports/2009/ucsc-soe-09-06.pdf},
   2.251 +	title = {DKU Pattern for Performance Portable Parallel Software},
   2.252 +	year = 2009
   2.253 +}
   2.254 +@Misc{EQNLangTechRep,
   2.255 +	author = {Halle, Sean},
   2.256 +	note = {http://www.soe.ucsc.edu/share/technical-reports/2009/ucsc-soe-09-16.pdf},
   2.257 +	title = {An Extensible Parallel Language},
   2.258 +	year = 2009
   2.259 +}
   2.260 +@Misc{CTOSTechRep,
   2.261 +	author = {Halle, Sean},
   2.262 +	note = {http://www.soe.ucsc.edu/share/technical-reports/2009/ucsc-soe-09-15.pdf},
   2.263 +	title = {A Hardware-Independent Parallel Operating System Abstraction LayerParallelism},
   2.264 +	year = 2009
   2.265 +}
   2.266 +@Misc{SideEffectsTechRep,
   2.267 +	author = {Halle, Sean and Cohen, Albert},
   2.268 +	note = {http://www.soe.ucsc.edu/share/technical-reports/2009/ucsc-soe-09-14.pdf},
   2.269 +	title = {Parallel Language Extensions for Side Effects},
   2.270 +	year = 2009
   2.271 +}
   2.272 +@Misc{BaCTiLTechRep,
   2.273 +	author = {Halle, Sean},
   2.274 +	note = {http://www.soe.ucsc.edu/share/technical-reports/2006/ucsc-crl-06-08.pdf},
   2.275 +	title = {BaCTiL: Base CodeTime Language},
   2.276 +	year = 2006
   2.277 +}
   2.278 +@Misc{CTPlatformTechRep,
   2.279 +	author = {Halle, Sean},
   2.280 +	note = {http://www.soe.ucsc.edu/share/technical-reports/2006/ucsc-crl-06-09.pdf},
   2.281 +	title = {The Elements of the CodeTime Software Platform},
   2.282 +	year = 2006
   2.283 +}
   2.284 +@Misc{CTRTTechRep,
   2.285 +	author = {Halle, Sean},
   2.286 +	note = {http://www.soe.ucsc.edu/share/technical-reports/2006/ucsc-crl-06-10.pdf},
   2.287 +	title = {A Scalable and Efficient Peer-to-Peer Run-Time System for a Hardware Independent Software Platform},
   2.288 +	year = 2006
   2.289 +}
   2.290 +@Misc{CIPTechRep,
   2.291 +	author = {Halle, Sean},
   2.292 +	note = {http://www.soe.ucsc.edu/share/technical-reports/2005/ucsc-crl-05-05.pdf},
   2.293 +	title = {The Case for an Integrated Software Platform for HEC Illustrated Using the CodeTime Platform},
   2.294 +	year = 2005
   2.295 +}
   2.296 +@Misc{Halle2008,
   2.297 +	author = {Sean Halle and Albert Cohen},
   2.298 +	note = {http://omp.musictwodotoh.com},
   2.299 +	title = {{DKU} infrastructure server}
   2.300 +}
   2.301 +@Misc{DKUSourceForge,
   2.302 +	author = {Sean Halle and Albert Cohen},
   2.303 +	month = {November},
   2.304 +	note = {http://dku.sourceforge.net},
   2.305 +	title = {{DKU} website},
   2.306 +	year = 2008
   2.307 +}
   2.308 +@Misc{BLISHome,
   2.309 +	author = {Sean Halle and Albert Cohen},
   2.310 +	month = {November},
   2.311 +	note = {http://blisplatform.sourceforge.net},
   2.312 +	title = {{BLIS} website},
   2.313 +	year = 2008
   2.314 +}
   2.315 +@Misc{VMSHome,
   2.316 +	author = {Sean Halle and Merten Sach and Ben Juurlink and Albert Cohen},
   2.317 +	note = {http://virtualizedmasterslave.org},
   2.318 +	title = {{VMS} Home Page},
   2.319 +	year = 2010
   2.320 +}
   2.321 +@Misc{PStackHome,
   2.322 +	author = {Sean Halle},
   2.323 +	note = {http://pstack.sourceforge.net},
   2.324 +	title = {{PStack} Home Page},
   2.325 +	year = 2012
   2.326 +}
   2.327 +@Misc{DeblockingCode,
   2.328 +	note = {http://dku.svn.sourceforge.net/viewvc/dku/branches/DKU\_C\_\_Deblocking\_\_orig/},
   2.329 +	title = {{DKU-ized Deblocking Filter} code}
   2.330 +}
   2.331 +@Misc{SampleBLISCode,
   2.332 +	note = {http://dku.sourceforge.net/SampleCode.htm},
   2.333 +	title = {{Sample BLIS Code}}
   2.334 +}
   2.335 +@Misc{OMPHome,
   2.336 +	note = {http://www.openmediaplatform.eu/},
   2.337 +	title = {{Open Media Platform} homepage}
   2.338 +}
   2.339 +@Misc{MapReduceHome,
   2.340 +	author = {Google Corp.},
   2.341 +	note = {http://labs.google.com/papers/mapreduce.html},
   2.342 +	title = {{MapReduce} Home page}
   2.343 +}
   2.344 +@Misc{TBBHome,
   2.345 +	author = {Intel Corp.},
   2.346 +	note = {http://www.threadingbuildingblocks.org},
   2.347 +	title = {{TBB} Home page}
   2.348 +}
   2.349 +@Misc{HPFWikipedia,
   2.350 +	author = {Wikipedia},
   2.351 +	note = {http://en.wikipedia.org/wiki/High_Performance_Fortran},
   2.352 +	title = {{HPF} wikipedia page}
   2.353 +}
   2.354 +@Misc{OpenMPHome,
   2.355 +	author = {{OpenMP} organization},
   2.356 +	note = {http://www.openmp.org},
   2.357 +	title = {{OpenMP} Home page}
   2.358 +}
   2.359 +@Misc{MPIHome,
   2.360 +	author = {open-mpi organization},
   2.361 +	note = {http://www.open-mpi.org},
   2.362 +	title = {{Open MPI} Home page}
   2.363 +}
   2.364 +@Misc{OpenCLHome,
   2.365 +	author = {Kronos Group},
   2.366 +	note = {http://www.khronos.org/opencl},
   2.367 +	title = {{OpenCL} Home page}
   2.368 +}
   2.369 +@Misc{CILKHome,
   2.370 +	author = {Cilk group at MIT},
   2.371 +	note = {http://supertech.csail.mit.edu/cilk/},
   2.372 +	title = {{CILK} homepage}
   2.373 +}
   2.374 +@InProceedings{Fri98,
   2.375 +	author = {M. Frigo and C. E. Leiserson and K. H. Randall},
   2.376 +	title = {The Implementation of the Cilk-5 Multithreaded Language},
   2.377 +	booktitle = {PLDI '98: Proceedings of the 1998 ACM SIGPLAN conference on Programming language design and implementation},
   2.378 +	pages = {212--223},
   2.379 +	year = 1998,
   2.380 +	address = {Montreal, Quebec},
   2.381 +	month = jun
   2.382 +}
   2.383 +@Misc{TitaniumHome,
   2.384 +	note = {http://titanium.cs.berkeley.edu},
   2.385 +	title = {{Titanium} homepage}
   2.386 +}
   2.387 +@InProceedings{CnCInHotPar,
   2.388 +	author = {Knobe, Kathleen},
   2.389 +	booktitle = {HOTPAR '09: USENIX Workshop on Hot Topics in Parallelism},
   2.390 +	title = {Ease of Use with Concurrent Collections {(CnC)}},
   2.391 +	year = 2009
   2.392 +}
   2.393 +@Misc{CnCHome,
   2.394 +	author = {Intel Corp.},
   2.395 +	note = {http://software.intel.com/en-us/articles/intel-concurrent-collections-for-cc/},
   2.396 +	title = {{CnC} homepage}
   2.397 +}
   2.398 +@Misc{SpiralHome,
   2.399 +	author = {Spiral Group at CMU},
   2.400 +	note = {http://www.spiral.net},
   2.401 +	title = {{Spiral} homepage}
   2.402 +}
   2.403 +@Misc{ScalaHome,
   2.404 +	author = {Scala organization},
   2.405 +	note = {http://www.scala-lang.org/},
   2.406 +	title = {{Scala} homepage}
   2.407 +}
   2.408 +@Misc{UPCHome,
   2.409 +	author = {UPC group at UC Berkeley},
   2.410 +	note = {http://upc.lbl.gov/},
   2.411 +	title = {{Unified Parallel C} homepage}
   2.412 +}
   2.413 +@Misc{SuifHome,
   2.414 +	note = {http://suif.stanford.edu},
   2.415 +	title = {{Suif} Parallelizing compiler homepage}
   2.416 +}
   2.417 +@Article{SEJITS,
   2.418 +	author = {B. Catanzaro and S. Kamil and Y. Lee and K. Asanovic and J. Demmel and K. Keutzer and J. Shalf and K. Yelick and A. Fox},
   2.419 +	title = {SEJITS: Getting Productivity AND Performance With Selective Embedded JIT Specialization},
   2.420 +	journal = {First Workshop on Programmable Models for Emerging Architecture at the 18th International Conference on Parallel Architectures and Compilation Techniques },
   2.421 +	year = 2009
   2.422 +}
   2.423 +@InProceedings{Arnaldo3D,
   2.424 +	author = {Azevedo, Arnaldo and Meenderinck, Cor and Juurlink, Ben and Terechko, Andrei and Hoogerbrugge, Jan and Alvarez, Mauricio and Ramirez, Alex},
   2.425 +	title = {Parallel H.264 Decoding on an Embedded Multicore Processor},
   2.426 +	booktitle = {HiPEAC '09: Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers},
   2.427 +	year = 2009,
   2.428 +	pages = {404--418}
   2.429 +}
   2.430 +@Article{NarayananGPUSched,
   2.431 +	author = {Narayanan Sundaram and Anand Raghunathan and Srimat T. Chakradhar},
   2.432 +	title = {A framework for efficient and scalable execution of domain-specific templates on GPUs},
   2.433 +	journal = {International Parallel and Distributed Processing Symposium {(IPDPS)}},
   2.434 +	year = 2009,
   2.435 +	pages = {1-12}
   2.436 +}
   2.437 +@InProceedings{PolyForGPU,
   2.438 +	author = {Baskaran, Muthu Manikandan and Bondhugula, Uday and Krishnamoorthy, Sriram and Ramanujam, J. and Rountev, Atanas and Sadayappan, P.},
   2.439 +	title = {A compiler framework for optimization of affine loop nests for gpgpus},
   2.440 +	booktitle = {ICS '08: Proceedings of the 22nd annual international conference on Supercomputing},
   2.441 +	year = 2008,
   2.442 +	pages = {225--234}
   2.443 +}
   2.444 +@InProceedings{Loulou08,
   2.445 +	author = {Pouchet, Louis-No\"{e}l and Bastoul, C\'{e}dric and Cohen, Albert and Cavazos, John},
   2.446 +	title = {Iterative optimization in the polyhedral model: part ii, multidimensional time},
   2.447 +	booktitle = {ACM SIGPLAN conference on Programming language design and implementation {(PLDI)} },
   2.448 +	year = 2008,
   2.449 +	pages = {90--100}
   2.450 +}
   2.451 +@InProceedings{MergeInHotPar,
   2.452 +	author = {Michael D. Linderman and James Balfour and Teresa H. Meng and William J. Dally},
   2.453 +	booktitle = {HOTPAR '09: USENIX Workshop on Hot Topics in Parallelism},
   2.454 +	month = {March},
   2.455 +	title = {Embracing Heterogeneity \- Parallel Programming for Changing Hardware},
   2.456 +	year = 2009
   2.457 +}
   2.458 +@InProceedings{GaloisRef,
   2.459 +	author = {Kulkarni, Milind and Pingali, Keshav and Walter, Bruce and Ramanarayanan, Ganesh and Bala, Kavita and Chew, L. Paul},
   2.460 +	title = {Optimistic parallelism requires abstractions},
   2.461 +	booktitle = {PLDI '07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation},
   2.462 +	year = 2007,
   2.463 +	pages = {211--222}
   2.464 +}
   2.465 +@Book{Allen2002,
   2.466 +	author = {Kennedy, Ken and Allen, John R.},
   2.467 +	title = {Optimizing compilers for modern architectures: a dependence-based approach},
   2.468 +	year = 2002,
   2.469 +	publisher = {Morgan Kaufmann Publishers Inc.}
   2.470 +}
   2.471 +@Misc{Stephens95,
   2.472 +	author = {R. Stephens},
   2.473 +	title = {A Survey Of Stream Processing},
   2.474 +	year = 1995
   2.475 +}
   2.476 +@InProceedings{Palatin06,
   2.477 +	author = {P Palatin and Y Lhuillier and O Temam},
   2.478 +	title = {CAPSULE: Hardware-assisted parallel execution of componentbased programs},
   2.479 +	booktitle = {In Proceedings of the 39th Annual International Symposium on Microarchitecture},
   2.480 +	year = 2006,
   2.481 +	pages = {247--258}
   2.482 +}
   2.483 +@InProceedings{Sequioa06,
   2.484 +	author = {Fatahalian,, Kayvon and Horn,, Daniel Reiter and Knight,, Timothy J. and Leem,, Larkhoon and Houston,, Mike and Park,, Ji Young and Erez,, Mattan and Ren,, Manman and Aiken,, Alex and Dally,, William J. and Hanrahan,, Pat},
   2.485 +	title = {Sequoia: programming the memory hierarchy},
   2.486 +	booktitle = {SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing},
   2.487 +	year = 2006,
   2.488 +	pages = 83
   2.489 +}
   2.490 +@Book{Cole89,
   2.491 +	author = {M Cole},
   2.492 +	title = {Algorithmic skeletons: Structured management of parallel computation},
   2.493 +	publisher = {Pitman},
   2.494 +	year = 1989
   2.495 +}
   2.496 +@InProceedings{Ginhac98,
   2.497 +	author = {Dominique Ginhac and Jocelyn Serot and Jean Pierre Derutin},
   2.498 +	title = {Fast prototyping of image processing applications using functional skeletons on a MIMD-DM architecture},
   2.499 +	booktitle = {In IAPR Workshop on Machine Vision and Applications},
   2.500 +	year = 1998,
   2.501 +	pages = {468--471}
   2.502 +}
   2.503 +@InProceedings{Serot08MetaParallel,
   2.504 +	author = {Serot, Jocelyn and Falcou, Joel},
   2.505 +	title = {Functional Meta-programming for Parallel Skeletons},
   2.506 +	booktitle = {ICCS '08: Proceedings of the 8th international conference on Computational Science, Part I},
   2.507 +	year = 2008,
   2.508 +	pages = {154--163}
   2.509 +}
   2.510 +@InProceedings{Darlington93,
   2.511 +	author = {J. Darlington and A. J. Field and P. G. Harrison and P. H. J. Kelly and D. W. N. Sharp and Q. Wu},
   2.512 +	title = {Parallel programming using skeleton functions},
   2.513 +	booktitle = {},
   2.514 +	year = 1993,
   2.515 +	pages = {146--160},
   2.516 +	publisher = {Springer-Verlag}
   2.517 +}
   2.518 +@Article{Asanovic06BerkeleyView,
   2.519 +	title = {{The landscape of parallel computing research: A view from berkeley}},
   2.520 +	author = {Asanovic, K. and Bodik, R. and Catanzaro, B.C. and Gebis, J.J. and Husbands, P. and Keutzer, K. and Patterson, D.A. and Plishker, W.L. and Shalf, J. and Williams, S.W. and others},
   2.521 +	journal = {Electrical Engineering and Computer Sciences, University of California at Berkeley, Technical Report No. UCB/EECS-2006-183, December},
   2.522 +	volume = 18,
   2.523 +	number = {2006-183},
   2.524 +	pages = 19,
   2.525 +	year = 2006
   2.526 +}
   2.527 +@Misc{BerkeleyPattLang,
   2.528 +	note = {http://parlab.eecs.berkeley.edu/wiki/patterns},
   2.529 +	title = {{Berkeley Pattern Language}}
   2.530 +}
   2.531 +@Book{Mattson04Patterns,
   2.532 +	title = {{Patterns for parallel programming}},
   2.533 +	author = {Mattson, T. and Sanders, B. and Massingill, B.},
   2.534 +	year = 2004,
   2.535 +	publisher = {Addison-Wesley Professional}
   2.536 +}
   2.537 +@Article{Skillicorn98,
   2.538 +	title = {{Models and languages for parallel computation}},
   2.539 +	author = {Skillicorn, D.B. and Talia, D.},
   2.540 +	journal = {ACM Computing Surveys (CSUR)},
   2.541 +	volume = 30,
   2.542 +	number = 2,
   2.543 +	pages = {123--169},
   2.544 +	year = 1998
   2.545 +}
   2.546 +@Conference{Blelloch93NESL,
   2.547 +	title = {{Implementation of a portable nested data-parallel language}},
   2.548 +	author = {Blelloch, G.E. and Hardwick, J.C. and Chatterjee, S. and Sipelstein, J. and Zagha, M.},
   2.549 +	booktitle = {Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming},
   2.550 +	pages = {102--111},
   2.551 +	year = 1993,
   2.552 +	organization = {ACM New York, NY, USA}
   2.553 +}
   2.554 +@Article{McgrawSisal,
   2.555 +	title = {{SISAL: Streams and iteration in a single assignment language: Reference manual version 1.2}},
   2.556 +	author = {McGraw, J. and Skedzielewski, SK and Allan, SJ and Oldehoeft, RR and Glauert, J. and Kirkham, C. and Noyce, B. and Thomas, R.},
   2.557 +	journal = {Manual M-146, Rev},
   2.558 +	volume = 1
   2.559 +}
   2.560 +@Article{Gelernter85Linda,
   2.561 +	title = {{Generative communication in Linda}},
   2.562 +	author = {Gelernter, D.},
   2.563 +	journal = {ACM Transactions on Programming Languages and Systems (TOPLAS)},
   2.564 +	volume = 7,
   2.565 +	number = 1,
   2.566 +	pages = {80--112},
   2.567 +	year = 1985
   2.568 +}
   2.569 +@Article{Lin94ZPL,
   2.570 +	title = {{ZPL: An array sublanguage}},
   2.571 +	author = {Lin, C. and Snyder, L.},
   2.572 +	journal = {Lecture Notes in Computer Science},
   2.573 +	volume = 768,
   2.574 +	pages = {96--114},
   2.575 +	year = 1994
   2.576 +}
   2.577 +@Article{baecker97,
   2.578 +	author = {Ron Baecker and Chris DiGiano and Aaron Marcus},
   2.579 +	title = {Software visualization for debugging},
   2.580 +	journal = {Communications of the ACM},
   2.581 +	volume = 40,
   2.582 +	number = 4,
   2.583 +	year = 1997,
   2.584 +	issn = {0001-0782},
   2.585 +	pages = {44--54},
   2.586 +	publisher = {ACM Press}
   2.587 +}
   2.588 +@Article{ball96,
   2.589 +	author = {T. A. Ball and S. G. Eick},
   2.590 +	title = {Software Visualization in the Large},
   2.591 +	journal = {IEEE Computer},
   2.592 +	volume = 29,
   2.593 +	number = 4,
   2.594 +	year = 1996,
   2.595 +	month = {apr},
   2.596 +	pages = {33--43}
   2.597 +}
   2.598 +@Book{berry89,
   2.599 +	title = {{The chemical abstract machine}},
   2.600 +	author = {Berry, G. and Boudol, G.},
   2.601 +	year = 1989,
   2.602 +	publisher = {ACM Press}
   2.603 +}
   2.604 +@Article{blumofe95,
   2.605 +	author = {Robert D. Blumofe and Christopher F. Joerg and Bradley C. Kuszmaul and Charles E. Leiserson and Keith H. Randall and Yuli Zhou},
   2.606 +	title = {Cilk: an efficient multithreaded runtime system},
   2.607 +	journal = {SIGPLAN Not.},
   2.608 +	volume = 30,
   2.609 +	number = 8,
   2.610 +	year = 1995,
   2.611 +	pages = {207--216}
   2.612 +}
   2.613 +@Article{burch90,
   2.614 +	title = {{Symbolic model checking: 10^{20} states and beyond}},
   2.615 +	author = {Burch, JR and Clarke, EM and McMillan, KL and Dill, DL and Hwang, LJ},
   2.616 +	journal = {Logic in Computer Science, 1990. LICS'90, Proceedings},
   2.617 +	pages = {428--439},
   2.618 +	year = 1990
   2.619 +}
   2.620 +@Article{chamberlain98,
   2.621 +	author = {B. Chamberlain and S. Choi and E. Lewis and C. Lin and L. Snyder and W. Weathersby},
   2.622 +	title = {ZPL's WYSIWYG Performance Model},
   2.623 +	journal = {hips},
   2.624 +	volume = 00,
   2.625 +	year = 1998,
   2.626 +	isbn = {0-8186-8412-7},
   2.627 +	pages = 50
   2.628 +}
   2.629 +@Article{church41,
   2.630 +	author = {A. Church},
   2.631 +	title = {The Calculi of Lambda-Conversion},
   2.632 +	journal = {Annals of Mathematics Studies},
   2.633 +	number = 6,
   2.634 +	year = 1941,
   2.635 +	publisher = {Princeton University}
   2.636 +}
   2.637 +@Misc{CodeTimeSite,
   2.638 +	author = {Sean Halle},
   2.639 +	key = {CodeTime},
   2.640 +	title = {Homepage for The CodeTime Parallel Software Platform},
   2.641 +	note = {{\ttfamily http://codetime.sourceforge.net}}
   2.642 +}
   2.643 +@Misc{CodeTimePlatform,
   2.644 +	author = {Sean Halle},
   2.645 +	key = {CodeTime},
   2.646 +	title = {The CodeTime Parallel Software Platform},
   2.647 +	note = {{\ttfamily http://codetime.sourceforge.net/content/CodeTime\_Platform.pdf}}
   2.648 +}
   2.649 +@Misc{CodeTimeVS,
   2.650 +	author = {Sean Halle},
   2.651 +	key = {CodeTime},
   2.652 +	title = {The Specification of the CodeTime Platform's Virtual Server},
   2.653 +	note = {{\ttfamily http://codetime.sourceforge.net/content/CodeTime\_Virtual\_Server.pdf}}
   2.654 +}
   2.655 +@Misc{CodeTimeOS,
   2.656 +	author = {Sean Halle},
   2.657 +	key = {CodeTime},
   2.658 +	title = {A Hardware Independent OS},
   2.659 +	note = {{\ttfamily http://codetime.sourceforge.net/content/CodeTime\_OS.pdf}}
   2.660 +}
   2.661 +@Misc{CodeTimeSem,
   2.662 +	author = {Sean Halle},
   2.663 +	key = {CodeTime},
   2.664 +	title = {The Big-Step Operational Semantics of the CodeTime Computational Model},
   2.665 +	note = {{\ttfamily http://codetime.sourceforge.net/content/CodeTime\_Semantics.pdf}}
   2.666 +}
   2.667 +@Misc{CodeTimeTh,
   2.668 +	author = {Sean Halle},
   2.669 +	key = {CodeTime},
   2.670 +	title = {A Mental Framework for Use in Creating Hardware-Independent Parallel Languages},
   2.671 +	note = {{\ttfamily http://codetime.sourceforge.net/content/CodeTiime\_Theoretical\_Framework.pdf}}
   2.672 +}
   2.673 +@Misc{CodeTimeTh1,
   2.674 +	author = {Sean Halle},
   2.675 +	key = {CodeTime},
   2.676 +	title = {The CodeTime Parallel Software Platform},
   2.677 +	note = {{\ttfamily http://codetime.sourceforge.net}}
   2.678 +}
   2.679 +@Misc{CodeTimeTh2,
   2.680 +	author = {Sean Halle},
   2.681 +	key = {CodeTime},
   2.682 +	title = {The CodeTime Parallel Software Platform},
   2.683 +	note = {{\ttfamily http://codetime.sourceforge.net}}
   2.684 +}
   2.685 +@Misc{CodeTimeRT,
   2.686 +	author = {Sean Halle},
   2.687 +	key = {CodeTime},
   2.688 +	title = {The CodeTime Parallel Software Platform},
   2.689 +	note = {{\ttfamily http://codetime.sourceforge.net}}
   2.690 +}
   2.691 +@Misc{CodeTimeWebSite,
   2.692 +	author = {Sean Halle},
   2.693 +	key = {CodeTime},
   2.694 +	title = {The CodeTime Parallel Software Platform},
   2.695 +	note = {{\ttfamily http://codetime.sourceforge.net}}
   2.696 +}
   2.697 +@Misc{CodeTimeBaCTiL,
   2.698 +	author = {Sean Halle},
   2.699 +	key = {CodeTime},
   2.700 +	title = {The Base CodeTime Language},
   2.701 +	note = {{\ttfamily http://codetime.sourceforge.net/content/CodeTime\_BaCTiL.pdf}}
   2.702 +}
   2.703 +@Misc{CodeTimeCert,
   2.704 +	author = {Sean Halle},
   2.705 +	key = {CodeTime},
   2.706 +	title = {The CodeTime Certification Strategy},
   2.707 +	note = {{\ttfamily http://codetime.sourceforge.net/content/CodeTime\_Certification.pdf}}
   2.708 +}
   2.709 +@InProceedings{ducournau94,
   2.710 +	author = {R. Ducournau and M. Habib and M. Huchard and M. L. Mugnier},
   2.711 +	title = {Proposal for a monotonic multiple inheritance linearization},
   2.712 +	booktitle = {OOPSLA '94: Proceedings of the ninth annual conference on Object-oriented programming systems, language, and applications},
   2.713 +	year = 1994,
   2.714 +	pages = {164--175},
   2.715 +	publisher = {ACM Press}
   2.716 +}
   2.717 +@Article{emerson91,
   2.718 +	title = {{Tree automata, mu-calculus and determinacy}},
   2.719 +	author = {Emerson, EA and Jutla, CS},
   2.720 +	journal = {Proceedings of the 32nd Symposium on Foundations of Computer Science},
   2.721 +	pages = {368--377},
   2.722 +	year = 1991
   2.723 +}
   2.724 +@Article{fortune78,
   2.725 +	title = {{Parallelism in random access machines}},
   2.726 +	author = {Fortune, S. and Wyllie, J.},
   2.727 +	journal = {STOC '78: Proceedings of the tenth annual ACM symposium on Theory of computing},
   2.728 +	pages = {114--118},
   2.729 +	year = 1978,
   2.730 +	publisher = {ACM Press New York, NY, USA}
   2.731 +}
   2.732 +@Book{goldberg83,
   2.733 +	title = {{Smalltalk-80: the language and its implementation}},
   2.734 +	author = {Goldberg, A. and Robson, D.},
   2.735 +	year = 1983,
   2.736 +	publisher = {Addison-Wesley}
   2.737 +}
   2.738 +@InProceedings{goldschlager78,
   2.739 +	author = {Leslie M. Goldschlager},
   2.740 +	title = {A unified approach to models of synchronous parallel machines},
   2.741 +	booktitle = {STOC '78: Proceedings of the tenth annual ACM symposium on Theory of computing},
   2.742 +	year = 1978,
   2.743 +	pages = {89--94},
   2.744 +	location = {San Diego, California, United States},
   2.745 +	doi = {http://doi.acm.org/10.1145/800133.804336},
   2.746 +	publisher = {ACM Press}
   2.747 +}
   2.748 +@Book{gosling96,
   2.749 +	author = {J. Gosling and B. Joy and G. Steele and G. Bracha},
   2.750 +	title = {The Java Language Specification},
   2.751 +	publisher = {Addison-Wesley},
   2.752 +	year = 1996
   2.753 +}
   2.754 +@Article{hasselbring00,
   2.755 +	author = {Wilhelm Hasselbring},
   2.756 +	title = {Programming languages and systems for prototyping concurrent applications},
   2.757 +	journal = {ACM Comput. Surv.},
   2.758 +	volume = 32,
   2.759 +	number = 1,
   2.760 +	year = 2000,
   2.761 +	issn = {0360-0300},
   2.762 +	pages = {43--79},
   2.763 +	doi = {http://doi.acm.org/10.1145/349194.349199},
   2.764 +	publisher = {ACM Press},
   2.765 +	address = {New York, NY, USA}
   2.766 +}
   2.767 +@Article{hoare78,
   2.768 +	author = {C. A. R. Hoare},
   2.769 +	title = {Communicating Sequential Processes},
   2.770 +	journal = {Communications of the ACM},
   2.771 +	year = 1978,
   2.772 +	volume = 21,
   2.773 +	number = 8,
   2.774 +	pages = {666-677}
   2.775 +}
   2.776 +@Article{huth,
   2.777 +	title = {{A Unifying Framework for Model Checking Labeled Kripke Structures, Modal Transition Systems, and Interval Transition Systems}},
   2.778 +	author = {Huth, M.},
   2.779 +	journal = {Proceedings of the 19th International Conference on the Foundations of Software Technology \& Theoretical Computer Science, Lecture Notes in Computer Science},
   2.780 +	pages = {369--380},
   2.781 +	publisher = {Springer-Verlag}
   2.782 +}
   2.783 +@Article{johnston04,
   2.784 +	author = {Wesley M. Johnston and J. R. Paul Hanna and Richard J. Millar},
   2.785 +	title = {Advances in dataflow programming languages},
   2.786 +	journal = {ACM Comput. Surv.},
   2.787 +	volume = 36,
   2.788 +	number = 1,
   2.789 +	year = 2004,
   2.790 +	issn = {0360-0300},
   2.791 +	pages = {1--34},
   2.792 +	doi = {http://doi.acm.org/10.1145/1013208.1013209},
   2.793 +	publisher = {ACM Press},
   2.794 +	address = {New York, NY, USA}
   2.795 +}
   2.796 +@Book{koelbel93,
   2.797 +	author = {C. H. Koelbel and D. Loveman and R. Schreiber and G. Steele Jr},
   2.798 +	title = {High Performance Fortran Handbook},
   2.799 +	year = 1993,
   2.800 +	publisher = {MIT Press}
   2.801 +}
   2.802 +@Article{kozen83,
   2.803 +	title = {{Results on the Propositional mu-Calculus}},
   2.804 +	author = {Kozen, D.},
   2.805 +	journal = {TCS},
   2.806 +	volume = 27,
   2.807 +	pages = {333--354},
   2.808 +	year = 1983
   2.809 +}
   2.810 +@Article{kripke63,
   2.811 +	title = {{Semantical analysis of modal logic}},
   2.812 +	author = {Kripke, S.},
   2.813 +	journal = {Zeitschrift fur Mathematische Logik und Grundlagen der Mathematik},
   2.814 +	volume = 9,
   2.815 +	pages = {67--96},
   2.816 +	year = 1963
   2.817 +}
   2.818 +@Book{mcGraw85,
   2.819 +	author = {J McGraw and S. Skedzielewski and S. Allan and R Odefoeft},
   2.820 +	title = {SISAL: Streams and Iteration in a Single-Assignment Language: Reference Manual Version 1.2},
   2.821 +	note = {Manual M-146 Rev. 1},
   2.822 +	publisher = {Lawrence Livermore National Laboratory},
   2.823 +	year = 1985
   2.824 +}
   2.825 +@Book{milner80,
   2.826 +	title = {{A Calculus of Communicating Systems, volume 92 of Lecture Notes in Computer Science}},
   2.827 +	author = {Milner, R.},
   2.828 +	year = 1980,
   2.829 +	publisher = {Springer-Verlag}
   2.830 +}
   2.831 +@Article{milner92,
   2.832 +	title = {{A calculus of mobile processes, parts I and II}},
   2.833 +	author = {Milner, R. and Parrow, J. and Walker, D.},
   2.834 +	journal = {Information and Computation},
   2.835 +	volume = 100,
   2.836 +	number = 1,
   2.837 +	pages = {1--40 and 41--77},
   2.838 +	year = 1992,
   2.839 +	publisher = {Academic Press}
   2.840 +}
   2.841 +@Book{milner99,
   2.842 +	author = {Robin Milner},
   2.843 +	title = {Communicating and Mobile Systems: The pi-Calculus},
   2.844 +	publisher = {Cambridge University Press},
   2.845 +	year = 1999
   2.846 +}
   2.847 +@Book{MPIForum94,
   2.848 +	author = {M. P. I. Forum},
   2.849 +	title = {MPI: A Message-Passing Interface Standard},
   2.850 +	year = 1994
   2.851 +}
   2.852 +@Article{petri62,
   2.853 +	title = {{Fundamentals of a theory of asynchronous information flow}},
   2.854 +	author = {Petri, C.A.},
   2.855 +	journal = {Proc. IFIP Congress},
   2.856 +	volume = 62,
   2.857 +	pages = {386--390},
   2.858 +	year = 1962
   2.859 +}
   2.860 +@Book{pierce02,
   2.861 +	title = {Types and Programming Languages},
   2.862 +	author = {Pierce, B. C.},
   2.863 +	year = 2002,
   2.864 +	publisher = {MIT Press}
   2.865 +}
   2.866 +@Article{price,
   2.867 +	author = {B. A. Price and R. M. Baecker and L. S. Small},
   2.868 +	title = {A Principled Taxonomy of Software Visualization},
   2.869 +	journal = {Journal of Visual Languages and Computing},
   2.870 +	volume = 4,
   2.871 +	number = 3,
   2.872 +	pages = {211--266}
   2.873 +}
   2.874 +@Misc{pythonWebSite,
   2.875 +	key = {Python},
   2.876 +	title = {The Python Software Foundation Mission Statement},
   2.877 +	note = {{\ttfamily http://www.python.org/psf/mission.html}}
   2.878 +}
   2.879 +@Unpublished{reed03,
   2.880 +	editor = {Daniel A. Reed},
   2.881 +	title = {Workshop on The Roadmap for the Revitalization of High-End Computing},
   2.882 +	day = {16--18},
   2.883 +	month = {jun},
   2.884 +	year = 2003,
   2.885 +	note = {Available at {\ttfamily http://www.cra.org/reports/supercomputing.web.pdf}}
   2.886 +}
   2.887 +@Article{reeves84,
   2.888 +	author = {A. P. Reeves},
   2.889 +	title = {Parallel Pascal -- An Extended Pascal for Parallel Computers},
   2.890 +	journal = {Journal of Parallel and Distributed Computing},
   2.891 +	volume = 1,
   2.892 +	number = {},
   2.893 +	year = 1984,
   2.894 +	month = {aug},
   2.895 +	pages = {64--80}
   2.896 +}
   2.897 +@Article{skillicorn98,
   2.898 +	author = {David B. Skillicorn and Domenico Talia},
   2.899 +	title = {Models and languages for parallel computation},
   2.900 +	journal = {ACM Comput. Surv.},
   2.901 +	volume = 30,
   2.902 +	number = 2,
   2.903 +	year = 1998,
   2.904 +	issn = {0360-0300},
   2.905 +	pages = {123--169},
   2.906 +	doi = {http://doi.acm.org/10.1145/280277.280278},
   2.907 +	publisher = {ACM Press},
   2.908 +	address = {New York, NY, USA}
   2.909 +}
   2.910 +@Article{stefik86,
   2.911 +	title = {Object Oriented Programming: Themes and Variations},
   2.912 +	author = {Stefik, M. and Bobrow, D. G.},
   2.913 +	journal = {The AI Magazine},
   2.914 +	volume = 6,
   2.915 +	number = 4,
   2.916 +	year = 1986
   2.917 +}
   2.918 +@Book{stirling92,
   2.919 +	title = {{Modal and Temporal Logics}},
   2.920 +	author = {Stirling, C.},
   2.921 +	year = 1992,
   2.922 +	publisher = {University of Edinburgh, Department of Computer Science}
   2.923 +}
   2.924 +@Misc{TitaniumWebSite,
   2.925 +	author = {Paul Hilfinger and et. al.},
   2.926 +	title = {The Titanium Project Home Page},
   2.927 +	note = {{\ttfamily http://www.cs.berkeley.edu/projects/titanium}}
   2.928 +}
   2.929 +@Misc{turing38,
   2.930 +	author = {A. Turing},
   2.931 +	note = {http://www.turingarchive.org/intro/, and http://www.turing.org.uk/sources/biblio4.html, and http://web.comlab.ox.ac.uk/oucl/research/areas/ieg/e-library/sources/tp2-ie.pdf},
   2.932 +	year = 1938
   2.933 +}
   2.934 +@Book{vonNeumann45,
   2.935 +	title = {First Draft of a Report on the EDVAC},
   2.936 +	author = {J. von Neumann},
   2.937 +	year = 1945,
   2.938 +	publisher = {United States Army Ordnance Department}
   2.939 +}
   2.940 +@Book{winskel93,
   2.941 +	title = {{The Formal Semantics of Programming Languages}},
   2.942 +	author = {Winskel, G.},
   2.943 +	year = 1993,
   2.944 +	publisher = {MIT Press}
   2.945 +}
     3.1 --- a/0__Papers/VMS/VMS__Foundation_Paper/VMS__Full_conference_version/latex/VMS__Full_conf_paper.tex	Tue Jul 03 11:27:13 2012 +0200
     3.2 +++ b/0__Papers/VMS/VMS__Foundation_Paper/VMS__Full_conference_version/latex/VMS__Full_conf_paper.tex	Sat Jul 07 02:39:50 2012 -0700
     3.3 @@ -71,9 +71,9 @@
     3.4  
     3.5  % authors.  separate groupings with \and.
     3.6  \author{
     3.7 -\authname{{Sean Halle \ \ \ \ \ \ \ \   Merten Sach \ \ \ \ \ \  \ \ Ben Juurlink}}
     3.8 -\authaddr{{Technical University Berlin, Germany}}
     3.9 -\authemail{{first.last@tu-berlin.de}}
    3.10 +\authname{Sean Halle \and Merten Sach \and Ben Juurlink \and Albert Cohen}
    3.11 +\authaddr{Technical University Berlin}
    3.12 +\authemail{first.last@tu-berlin.de}
    3.13  }
    3.14  
    3.15  %\authurl{\url{http://www.aes.tu-berlin.de/menue/home/parameter/en/}}
    3.16 @@ -88,7 +88,7 @@
    3.17  Software has not been keeping up with new parallel hardware, which slows the economy and retards adoption of  new hardware. Many believe the productivity and portability challenges of parallel software can be solved with domain-specific languages. But adoption is hindered by practical problems due to the small user-base, which means the language development time must be small and porting it across machines must be low effort.
    3.18  
    3.19  
    3.20 -To address this,  we propose the proto-runtime, which is a full runtime, but with two key pieces  replaced with an interface. A new language is created by providing: 1)the behavior of language constructs and 2) assignment of work onto hardware resources.  The pieces are simplified by keeping concurrency issues inside the proto-runtime, so the pieces are implemented using sequential reasoning. The high reuse of the proto-runtime allows intense hardware-specific tuning, which all languages inherit, keeping overhead low. 
    3.21 +To address this,  we propose the proto-runtime, which is a full runtime, but with two key pieces  replaced with an interface. A new language is created just by providing: 1) the behavior of language constructs and 2) assignment of work onto hardware resources.  The pieces are simplified by keeping concurrency issues inside the proto-runtime, so the pieces are implemented using sequential reasoning. The high reuse of the proto-runtime allows intense hardware-specific tuning, which all languages inherit, keeping overhead low. 
    3.22  
    3.23  
    3.24  
    3.25 @@ -112,7 +112,7 @@
    3.26  
    3.27  To simplify creation of domain-specific languages, we propose a ``proto" runtime, which is a normal, full runtime, but with two key parts  replaced with an interface.  To create a new language, one  provides an implementation of those two pieces: 1)behavior of language constructs and 2)assignment of work onto hardware resources.  The pieces are simplified by keeping concurrency issues inside the proto-runtime, so they are implemented using sequential reasoning. 
    3.28  
    3.29 -The proto-runtime remains the same for all languages, causing very high reuse, which gives benefit.  Intense effort can be spent fine tuning performance of the proto-runtime, which all languages then benefit from.  Such effort would be prohibitive if done separately for every language runtime on every target hardware platform.  In addition, services for debugging, performance tuning,  gathering portability information, and so on is centralized for use by the languages.
    3.30 +The proto-runtime remains the same for all languages, causing very high reuse, which gives benefit.  Intense effort can be spent tuning performance of the proto-runtime, which all languages then benefit from.  Such effort would be prohibitive if done separately for every language runtime on every target hardware platform.  In addition, services for debugging, performance tuning,  gathering portability information, and so on is centralized for use by the languages.
    3.31  
    3.32  Such an approach is only attractive if it delivers high application performance and low runtime overhead.  We demonstrate this in this paper for multicore hardware.  In the long run, the proto-runtime interface must be compatible with a much wider variety of parallel architectures and system approaches.  We reserve this for future work, concentrating here only on multi-core hardware.
    3.33  
    3.34 @@ -181,8 +181,113 @@
    3.35  
    3.36  In Section \ref{secWhatHW} we analyze a number of languages to show how their  constructs fit that basic pattern, including low-level parallel languages like TBB and pthreads, and show how a proto-runtime differs from them. In Section \ref{secResponsibility} we give the details of our proto-runtime implementation.  In Section \ref{secTopics} we show how to use the proto-runtime to create a new language construct. In Section X we give measurements of time-to-implement several different languages, and overhead in a direct comparison to pthreads and OpenMP. We conclude in Section \ref{secConclusion}.
    3.37  
    3.38 +=====================
    3.39 +
    3.40 +
    3.41 +
    3.42 +Overall Claim: 1) speed up dev of D.S  2) take perf tuning out of app, put into lang  3) Low overhead
    3.43 +
    3.44 +In detail, we claim the following features and benefits.
    3.45 +We claim our interface has the following features:
    3.46 +-] It modularizes runtimes, cleanly separating out the language-specific parts.
    3.47 +-] The language-specific parts *inherit* the performance effort put into the base proto-runtime as demonstrated in subsection X
    3.48 +-] Services are centralized in the proto-runtime, and so inherited by the new language with small or no extra effort as demonstrated in subsection X
    3.49 +-] The language directly controls hardware resources, as described in  subsection X. This enables assignment that uses construct-implied information to reduce movement of data, for high performance.
    3.50 +-] The language-specific portion can be treated as trusted code
    3.51 +-] Makes it practical to reuse behavior and assignment (scheduling) code as demonstrated in subsection X
    3.52 +-] domain-constructs co-designed w/resource assignment (not possible w/library, and higher perf due to control over comm pattern)
    3.53 +
    3.54 +Two kinds of services:
    3.55 +-] Visible to application-writer, vs visible to language implementer
    3.56 +-- --] These include services visible to the application programmer such as debugging, verification, and performance tuning
    3.57 +-- --] These also include services for the runtime implementer such as generic hardware information related to performance, generic performance counters in the form relevant to assigner-writers, optimized versions of data structures commonly used inside construct behavior implementations.
    3.58 +
    3.59 +We further claim that these features lead to the following benefits:
    3.60 +-] Good runtime overhead performance
    3.61 +-] Ultra low time to create a new langauge runtime
    3.62 +-] Consequent reduced time to port a language to new hardware (assuming the proto-runtime is available for that hardware)
    3.63 +-] Amortized effort of proto-runtime, across many languages 
    3.64 +-] Attractive to reuse language implementation of constructs and assignment (subsection X)
    3.65 +-] Improved overhead performance achieved from a fixed amount of impl effort
    3.66 +-] Improved application visible features achieved from a given effort
    3.67 +-] Enables high application performance
    3.68 +-] Reduces application-effort to achieve high app-perf (due to domain-constructs pulled out of app and into lang, where integrated w/resource assignment)
    3.69 +
    3.70 + (due to lang trusted and controlling resources for low-comm placement) 
    3.71 +
    3.72 +====================================
    3.73 + 
    3.74 +What have to show to support Features Claims: 
    3.75 +-] details of *things in action* that contribute to simplification
    3.76 +-- --] interface details.. what's involved with creating a plugin.. example (modular reduces effort of learning and effort of creating..  freedom from details of internals reduces effort)
    3.77 +-- --] services avail to plugin writer, as helpers (helpers reduce effort)
    3.78 +-- --] example of reuse of assigner code (reuse reduces effort)
    3.79 +-- --] example of reuse of construct code (singleton, atomic, trans, SSR into VSs) (reuse reduces effort)
    3.80 +
    3.81 +-] details of modularizing
    3.82 +-- --] interface details.. point out, in example of impl plugin, how the construct behavior is cleanly collected inside the handler, and the assignment behavior is cleanly collected inside the assigner.. more detail on assigner services avail to get hardware info
    3.83 +-- --] example of reusing SSR constructs inside VSs.. show how dispatch approach and separate handlers modularizes (also point out reuse here)
    3.84 +
    3.85 +-] details of centralizing runtime perf tuning
    3.86 +-- --] In example, when going through code, point out that internal runtime communications are inside proto-runtime, and that these are what determine the overhead of runtime.
    3.87 +
    3.88 +-] details of central services available.
    3.89 +-- --] app-services.. debugging phases, probes, perf tuning (companion paper), (planned) replay, (planned) verification (because interface provides simplifications and opportunities) 
    3.90 +-- --] plugin services.. send request to runtime, suspend VP, create VP, perf-counters for assigner use, migration of VP between cores
    3.91 +
    3.92 +-] Details of lang inside resource control.
    3.93 +-- --] when show assigner example, point out how lang is impl it, give example of constructs providing info the 
    3.94 +
    3.95 +Measurements to support Benefits Claims: time-to-create for variety of languages, including at least one DSL from scratch.  Overhead in head-to-head comparisons.
    3.96 +
    3.97 +Done.
    3.98 +
    3.99 +Creation simplification from: sequential plugin code -- show impl of at least two constructs (mutex and send)..  show equiv done with locks (?)
   3.100 +Simpl from: standard pattern
   3.101 +
   3.102 +Hmmm.. actually have interactions, and VMS is a proposed compromise.. can do equally simple construct implementations using global lock..  but interface hides things like the random backoff had  to include for larger machines, in contrast, the simple CAS method grinds to a halt on the larger machine.
   3.103 +
   3.104 +Okay -- claim THAT: VMS is a balance point btwn 
   3.105 +
   3.106 +
   3.107  \section{Background and Related Work}
   3.108  
   3.109 +?
   3.110 +Prob description.. part of why domain-specific good:
   3.111 +  Performance of application code is an even higher priority for language designers.
   3.112 +  In addition, the language implementor must account for performance of application code.
   3.113 +  In parallel execution, the application's execution is broken up into bits of work that are each scheduled onto particular hardware. The position that a particular bit executes determines which data has to be brought to that position, to be consumed by the work.  Poor choice causes unneeded movement of data, which reduces performance, in many cases drastically.
   3.114 +
   3.115 +  Any approach for implementing high performance languages must therefore give the language control over placement of work.  To be performant-portable as well, the language must also prevent hardware-related choices from entering application code. Such prevention means that domain-patterns that coordinate multiple bits of work must be either directly encoded as domain-constructs, or else the coordination be exposed via generic language constructs.
   3.116 +
   3.117 +  For example, performing a matrix multiply involves dividing the work into pieces and coordinating those pieces. As such, a language for a domain where matrix multiply is common can either make matrix mult a domain-construct in the language, or else supply a construct to put work division under language control, along with constructs that explicitly state the constraints among the pieces, plus a way to expose what data is in each piece to the resource assigner. Bringing the whole matrix mult in as a domain-construct allows the language implementor to use their knowledge of the pattern to create a work-divider tuned to the hardware, and encode inside the language their knowlege of what data is common to different pieces of work.  It also allows automated use of a tool like Spiral [] for hardware-optimal implementations.  Hence, comes the attractiveness of pulling domain patterns into the language as domain-constructs.
   3.118 +
   3.119 +\subsection{If choose domain-specific route, then what?}
   3.120 +  So, the choice has been made to implement a domain-specific language, in the embedded style, now how to do it?  The effort comes down to the runtime, so either implement the runtime from scratch, say on top of the OS's threads, or grab an existing one and modify it.  In practice, not much difference exists because code of existing runtimes is difficult to modify, due to the low-level nature and concurrency, unless the runtime is very simple, in which case it has low performance and lacks features.  For both approaches, the limited budget of effort available forces a choice. Either simple runtimes are made for several hardware targets, but they are low performance and without services like debugging helpers, or else the effort is invested in one target, giving it rich services and tuning for performance, but nothing is left in the budget for other hardware.
   3.121 +
   3.122 +  In this paper, we offer an alternative way to create languages that exploits an apparently universal pattern within runtimes, and so modularizes them, separating the language-specific parts from the low-level runtime performance relevant parts. This decomposition of the runtime represents a balance point: given a limited effort-budget, this approach provides ultra-low effort while delivering good runtime overhead performance and excellent opportunity for high application performance.  Future work will investigate other design points that promote even higher runtime performance, at the cost of slightly more implementation effort.
   3.123 +
   3.124 +  The end result is what we call a proto-runtime, which is a full runtime with two key pieces replaced by an interface.  To implement a new language, one provides implementations of these two pieces:
   3.125 +   1) The behavior of the new language constructs, which decide *when* a bit of work is free to execute
   3.126 +   2) An assigner that chooses *which* hardware executes a free bit of work.
   3.127 +
   3.128 +==============?
   3.129 +
   3.130 +?Three goals, as per future arch paper into ?
   3.131 +
   3.132 +For productivity, Domain-specific langs.. 
   3.133 +
   3.134 +Point 1 is lots of languages onto lots of hardware. 
   3.135 +
   3.136 +Point 2 is new hardware runs software right away, from all those langs (domain-specific)
   3.137 +
   3.138 +Serious logistical, real-world issue.. if no solution, then don't get domain-specific -- has to be low-labor to add domain-specific for all popular hardware, or else domain-specific not viable.  If domain-specific succeeds, then need low-labor to enable all on new hardware, else have software-deficit for new hardware -- retards hardware advancement..
   3.139 +
   3.140 +No matter what, for domain-specific to succeed and allow hardware to advance freely, need a solution for many languages low-labor onto many hardware.
   3.141 +
   3.142 +
   3.143 +=================
   3.144 +
   3.145  More on domain-specific
   3.146  
   3.147  HWSim as a domain-specific
   3.148 @@ -195,26 +300,16 @@
   3.149  
   3.150  SEJITs -- limited to just operations -- can't encode patterns like HWSim, can't supply new sync constructs.
   3.151  
   3.152 +\subsection{Tie Points}
   3.153 +This is about 
   3.154 +
   3.155  
   3.156  \section{Paper Design}
   3.157  
   3.158  Starting-point:  
   3.159  
   3.160 -What do people know, buy right away as "yes, this is problem, need solution"
   3.161  
   3.162 -The line about: software lags behind hardware, and line about: need to be easier to introduce new hardware. 
   3.163  
   3.164 -?Three goals, as per future arch paper into ?
   3.165 -
   3.166 -For productivity, Domain-specific langs.. 
   3.167 -
   3.168 -Point 1 is lots of languages onto lots of hardware. 
   3.169 -
   3.170 -Point 2 is new hardware runs software right away, from all those langs (domain-specific)
   3.171 -
   3.172 -Serious logistical, real-world issue.. if no solution, then don't get domain-specific -- has to be low-labor to add domain-specific for all popular hardware, or else domain-specific not viable.  If domain-specific succeeds, then need low-labor to enable all on new hardware, else have software-deficit for new hardware -- retards hardware advancement..
   3.173 -
   3.174 -No matter what, for domain-specific to succeed and allow hardware to advance freely, need a solution for many languages low-labor onto many hardware.
   3.175  
   3.176  
   3.177  Golden bridge: