# HG changeset patch # User Some Random Person # Date 1334245749 25200 # Node ID dd038db1f19178a00ad3f286fe0bfa3ac31d87df # Parent 1de9173d4226bedabc774d190f45f5f7910f3ba1 Future arch -- not sure diff -r 1de9173d4226 -r dd038db1f191 0__Papers/Future_Architecture/latex/Future_Architecture.tex --- a/0__Papers/Future_Architecture/latex/Future_Architecture.tex Thu Apr 12 08:25:57 2012 -0700 +++ b/0__Papers/Future_Architecture/latex/Future_Architecture.tex Thu Apr 12 08:49:09 2012 -0700 @@ -286,9 +286,10 @@ \end{abstract} \section{Introduction} - Current parallel programming is blocked from hitting main-stream industry because it has lower productivity than sequential, requires to re-write source for each new target to get good performance, and disrupts ways programmers think and their work-flow. All of which makes it too expensive. +\label{secIntro} +Current parallel programming is blocked from main-stream industry because it has lower productivity than sequential programming, forces a re-write source for each new target to get good performance, and disrupts the ways programmers think and their work-flow. All of which makes it too expensive. -Many believe a solution to productivity is domain-specific languages. To be a solution, a large number of such domain-specific languages has to be created and ported to each hardware target. Such creation and porting has to be done inexpensively because each language has a small user-base. +Many believe a solution to productivity is domain-specific languages. However, to be a real solution, a large number of such domain-specific languages has to be created and ported to each hardware target. Such creation and porting has to be done inexpensively because each language has a small user-base. Solving performant-portability is more difficult. Such portability means source is written once, then automatically specialized to all hardware targets, so that it runs high performance on each. To achieve this, the one source has to capture all information needed by all specialization techniques for all hardware, current and future. @@ -297,22 +298,32 @@ We call this the triple-goal of Productivity, Performant-Portability and Adoptability for parallel software. Throughout the paper, we tie specific details of our proposed approach to these three goals. -A previously suggested solution to the triple-goal is a software stack that is based around specialization, and is oriented towards independent, small, contributions to the stack, which collectively improve the specialization process. Productivity is solved by efficient and practical support of domain-specific languages. Performant-portability is solved by conveniently supporting the full range of specialization techniques. Adoptability is solved by flexibility to adapt to current and future hardware, with gentle transition that is practical, cost-sensitive, and effort-reducing. +One suggested solution to the triple-goal is a software stack that is based around specialization, and is oriented towards independent, small, contributions to the stack, which collectively perform the specialization process. Productivity is solved by efficient and practical support of domain-specific languages. Performant-portability is solved by conveniently supporting the full range of specialization techniques, and accumulating them from many sources. Adoptability is solved by flexibility to adapt to current and future hardware, with gentle transition that is practical, cost-sensitive, and effort-reducing. In this paper, if the premise of such a software stack is accepted, and the premise that domain-specific solves the productivity problem, then we propose that supporting runtimes in hardware is better than supporting any particular set of parallelism constructs, even ones as basic as the Compare And Swap instruction or Thread constructs. -In Section X we give details of the hardware we propose to support the runtimes. In Section X we expand on the software stack and how it fits with the runtime hardware and how the two together support the three goals. In Section X we apply the proposal to the topics of interest of this workshop to see if they are consistent and address the concerns. And we conclude in Section X with a summary. -(In stack section, be sure to mention that to achieve portable, have to get to point that no software uses shared variables without protecting via a language construct. Also, software does all sync via language constructs, doesn't roll its own via flags on shared vars or something like shared-mem sync impl -- end result is no comm of any kind outside of language construct "protection". +reason: specific constructs better on specific programming models, but worse on others.. but, domain-specific means very wide variety, so most will not fit the direct hardware well. + +Other reason is adoption: specific constructs in HW are only advantageous for the few programming models they directly support -- means, to be economically viable, those programming models have to be dominant.. but chicken and egg because without specific support, those models don't have a special advantage, so not much drives them becoming the dominant model, so no motivation for HW to go to expense of supporting. + + +In Section \ref{secWhatHW} we give details of the hardware we propose to support the runtimes. In Section \ref{secResponsibility} we expand on the software stack and how it fits with the runtime hardware and how the two together support the three goals. In Section \ref{secTopics} we apply the proposal to the topics of interest of this workshop to see if they are consistent and address the concerns. We conclude in Section \ref{secConclusion} with a summary. + \section{What parallel abstractions should the hardware provide?} - +\label{secWhatHW} Our position is that the hardware should not directly supply any parallel abstractions. Instead, it should supply a mechanism that elevates the language runtime to the status of a Hardware Abstraction Layer, which is separate from the executable and separate from the OS. Thus, parallel abstractions are implemented as soft-extensions to the hardware. With suitable support, many firmware-implemented parallel abstractions would require only a handful of instructions with a similarly low number of cycles of overhead. -This arrangement solves a number of problems currently facing language designers and runtime implementers, as shall be seen throughout the rest of the paper. First, it makes all application-resident information available to control the innermost level of hardware, right down to swapping contexts in and out of registers. Second it increases practicality of domain-specific languages, which is one main path to high programmer productivity. Third it improves portability directly and supports a software stack arrangement that may be a viable long-term solution to portability. -These claims may not be apparent at this point, but support for them will be pointed out throughout the paper. +This arrangement solves a number of problems currently facing language designers and runtime implementers. First, it makes all application-resident information available to the runtime, and gives it control over the innermost level of hardware, right down to swapping contexts in and out of registers. Second, it increases practicality of domain-specific languages, which is one main path to high programmer productivity. Third it improves portability directly, and fits the proposed software stack arrangement, providing a natural and smooth transition from existing hardware to hardware with such firm-ware runtime support. -Expanding on the first claim, the semantics of constructs, and information extracted by the toolchain can both be used by the runtime in decisions about task contents, which task to which core, and order of task assignment. The communication pattern that results determines how much communication is overlapped with useful work, the energy of the computation, throughput, and average utilization. + + + +\subsection{Soft-extension of instruction-set} + +Precedence for soft-extensions to instruction sets exists. The Alpha chips from DEC executed complex VAX instructions by switching fetch over to a special memory containing normal Alpha instructions, which implemented the functionality. + \begin{figure}[ht] \center{ @@ -324,19 +335,9 @@ \label{figTimeMapping} \end{figure} - -Expanding on the second claim, domain-specific languages are expected by many to be the path to high productivity, but currently, each language requires significant effort to create, and more importantly to port to each hardware target. The small user-base of each language cannot support such cost, making domain-specific languages impractical. We will show how such a firm-ware runtime supports a style of software stack that minimizes the creation and porting effort for domain-specific languages [HWSim and codec lang]. - -The third claim, portability, occupies most of section \ref{secResponsibility}. - - -\subsection{Soft-extension of instruction-set} - -Precedence for soft-extensions to instruction sets exists. The Alpha chips from DEC executed complex VAX instructions by switching fetch over to a special memory containing normal Alpha instructions, which implemented the functionality. - An analogous approach is illustrated in Figure \ref{figTimeMapping}. Here, one op-code is set aside as the ``switch to runtime" operation. Its execution causes instructions to switch to fetching from the firm-ware. Information is communicated via register contents, which point to data-structures that include a hardware defined portion and a language defined portion. -This firmware was written by the language-provider, so it is separate from the executable. It implements the behavior of parallelism constructs of the language. +This firm-ware was written by the language-provider, so it is separate from the executable. It implements the behavior of parallelism constructs of the language. Such an approach addresses security, portability, and efficiency. It is secure because the OS controls the firm-ware. It is portable because the executable only contains the \emph{interface} to the constructs (implementation is separate). It is efficient because the firm-ware runs in user-space, and switching to it costs the same as a \texttt{call}. This also improves application performance, because firm-ware has control over low-level behaviors such as hardware-supported swapping of contexts and control of hybrid cache/scratchpad memory. @@ -376,7 +377,7 @@ The language used provides constructs for rendez-vous style send and receive, plus constructs that identify the bundle-data and unbundle-data code. Send and receive are implemented as part of the language, as runtime firm-ware. In contrast, the bundle and unbundle code is extracted from the application by the toolchain and packaged into the executable. During the run, an OS call causes that bundle and unbundle \emph{communication} firm-ware to be linked into the communication processors. -When a task executes send or receive, the firm-ware swaps the context out, suspending the task, and replaces it with a non-blocked task. Simultaneously, the firm-ware causes the communication processor to execute bundle or unbundle code. When communication completes, the task is unblocked. +When a task executes send or receive, the runtime firm-ware swaps the context out, suspending the task, and replaces it with a non-blocked task. Simultaneously, the runtime causes the communication processor to execute bundle or unbundle code. When communication completes, the task is unblocked. @@ -394,7 +395,7 @@ \subsection{Speculation and Fast Control Message Support} -Hardware support for speculation will work especially well with a firm-ware runtime coupled to a communication processor. Transactional memory[], thread-level speculation[], and higher-level speculative constructs[] could each be supported by generic lower-level mechanisms, which are in turn invoked by the firm-ware runtime. +Hardware support for speculation will work especially well with a firm-ware runtime coupled to a communication processor. Transactional memory[], thread-level speculation[], and higher-level speculative constructs[] could each be supported by generic lower-level mechanisms, which are in turn invoked by the communication firm-ware. This arrangement isolates hardware from the language consistency-model and execution-model. There is no longer a large penalty for mis-match. To get this decoupling, hardware is simplified, by factoring the semantics out, leaving only generic ``ordering'' primitives. @@ -420,7 +421,7 @@ Checkpoints may also be used to support shared-memory style consistency models, but speculatively. New check-points are periodically generated, while previous ones are examined for conflicts. Examination takes place in the communication processors, supported by hardware for comparing lists of tags. Conflicts cause roll-back, and restart with updated state from one of the conflicting local memories. - Such hardware can also be used to turn off the tight consistency of current snooping-based protocols for the bulk of computation, saving time and energy for the code that doesn't need it. Such consistency is only enabled for the few specialized portions of code, those that use shared variables as control-messages, such as in software-based mutex algorithms. + Such hardware can also be used to turn off the tight consistency of current snooping-based protocols for the bulk of computation, saving time and energy for the code that doesn't need it. Such consistency is only enabled for a few specialized portions of code, those that use shared variables as control-messages, such as in software-based mutex algorithms. Another alternative is to only update shared memory when synchronization constructs imply handoff of ownership. This uses the tag hardware to track individual objects or data structures. The synchronization construct in the runtime firm-ware triggers the communication firm-ware to update all objects on the core that gains ownership, from modifications made on the core giving up ownership. The tag-processing comparison functions make this fast and efficient. This not only eliminates the time and energy lost to snooping and directory protocols, but also simplifies the programming model and removes non-portable shared-memory code from executables. @@ -434,7 +435,7 @@ \paragraph{setup and switch} At the appropriate place in the binary, instructions load one register with the pointer to a mutex structure, and another register with the pointer to the virtual-processor (VP) requesting the mutex-lock. Next, the \texttt{switch} instruction executes, which switches fetch over to the firm-ware of the runtime, while saving the stack and frame pointers into the data-struct of the requesting VP. -In this example, the hardware specifies a ``virtual processor'' (VP) data structure. It begins with a hardware defined portion that the \texttt{switch} instr automatically manages. +In this example, the hardware specifies a ``virtual processor'' (VP) data structure. Its first locations make up a hardware defined portion that the \texttt{switch} instr automatically manages. \paragraph{runtime internals} After \texttt{switch}, runtime code executes from the protected firm-ware. The code for mutex-acquire expects a pointer to a mutex struct to be in a particular register, checks the ``current owner'' field, and if empty writes the pointer to the VP (held in another register) into it. It then marks the VP as unblocked. Similarly, if the mutex is already owned, it places the VP into the mutex struct's queue, where it remains blocked. @@ -443,7 +444,7 @@ The execution time of this can be on the order of 10 cycles. Such speed requires hardware support for swapping VPs in and out, such as set-aside cache or scratch-pad memory with a wide port to registers, and speculative access to the mutex data-structure. This makes all memory access local and fast. -The speculative access would be verified while computation continues. If memory consistency is performed only upon command of the runtime, and hardware supports check-point and rollback, such as in Lujan's work[] then computation can continue without speed penalty. +Speculative accesses would be verified while computation continues. If memory consistency is performed only upon command of the runtime, and communication hardware supports check-point and rollback, then computation can continue without speed penalty. Notice that no atomic memory instructions have been used. Further, the application binary contains only \emph{interfaces} to high-level constructs. All operations have been local and fast, despite maintaining global consistency of global address space. @@ -452,16 +453,31 @@ \section{Which should be the responsibility / functionality of the programmer, the runtime software, and the hardware?} \label{secResponsibility} -With such a hardware arrangement, the responsibilities naturally break down along the lines of a software stack []. The goal of it is to support specialization, which is the process of transforming the original source into a form that is highly efficient on the target hardware. This is the heart of portability. + +According to the cited work on portability, responsibilities naturally break down along the lines of a software stack []. The goal of it is to support specialization, which is the process of transforming the original source into a form that is highly efficient on the target hardware. This is the heart of portability. Each layer of the stack has some role in the specialization process, while the application, on top, provides the information that the rest of the stack needs while performing the specialization. Ideally, the application must not expose hardware assumptions nor hinder specializations for particular targets. -The proposed hardware naturally supports such a stack. The bottom layer is an interface to simplify creation of the firm-ware runtime implementations. The set of runtimes themselves forms the next layer above that. Above the runtimes is the set of toolchains that generate the executables that talk to the runtimes. Above the toolchains is the set of language-interfaces, and above that, at the top, is the set of applications. + + + + The layers and interfaces are seen in Fig X. Starting at the bottom is the set of interfaces to the processors, currently the Instruction Sets of multi-core chips. Above that is a layer consisting of the implementations of a hardware abstraction. The abstraction exports an interface that simplifies creation of runtimes. The set of runtime implementations forms the next layer, and above them is the layer consisting of toolchains that generate the executables that talk to the runtimes. The languages comprise the interface between applications and toolchains, and on the top, is the set of applications. -The applications only expose constructs, ones designed to avoid hardware implications. Languages with such constructs include CnC[], WorkTable[] and HWSim []. The concurrency constructs are implemented by the runtimes. This alone doesn't ensure portability, but it goes a long way towards that goal, by removing the largest source of hardware-specific information. +The applications only use specially designed constructs, which avoid hardware implications. Languages with such constructs include CnC[], WorkTable[] and HWSim []. The constructs are implemented mainly by the runtimes, and occasionally by the toolchain. Using the constructs doesn't by itself ensure portability, but it goes a long way towards that goal, by removing the largest source of hardware-specific information. -Such a stack supports high productivity through domain-specific languages, such as HWSim, making them simple to create, easy to port across hardware, and high performance. The application programmer is responsible only for application-relevant concepts, reducing their learning curve and matching their mental model to the language. They have domain-specific parallelism constructs provided, either embedded-style as library calls, or with compiler support. +Applications on top of such a stack should not use shared variables without protecting access via a language construct. This precludes ``roll your own'' synchronizations or communications implemented using shared variables. + + +The proposed hardware naturally supports such a stack. The abstraction used to simplify runtime creation is currently implemented as a software layer, including assembly primitives for switching between application and runtime. In our proposal, much of the abstraction will be directly implemented as hardware. + +Large portions of current language runtime code that exists for multi-cores should work verbatim with the new hardware support. Only portions that take advantage of acceleration should need modification. + +This helps adoptability of the new hardware, by providing a seamless migration from current hardware to the new, without modification of application code. Being able to re-purpose existing runtime code to such new hardware further eases adoption of the hardware. + +Contrast this with hardware that directly implements specific parallelism constructs. That does not fit such a software stack well. The abstraction can still be supplied for it, but the construct hardware will sit idle for most applications. The construct hardware will only be used when running code written in a language that includes those constructs. Hence, domain-specific languages are not supported, and the hardware will only be attractive to the segment of industry that uses those langauges. + +making them simple to create, easy to port across hardware, and high performance. The application programmer is responsible only for application-relevant concepts, reducing their learning curve and matching their mental model to the language. They have domain-specific parallelism constructs provided, either embedded-style as library calls, or with compiler support. The constructs help specialization by identifying the tasks, the constraints on scheduling the tasks, and the data to be communicated between tasks. @@ -469,50 +485,59 @@ The helpers are either derived by the toolchain, or encoded directly in the application via suitable constructs. Either way, domain-specific constructs must be designed such that the information is captured, and convenient for the tools to extract. -One last concern is the creation of all these firm-ware runtimes. It would be good to uniform-ize them as much as possible. That reduces the work of creating one for a particular language, by reusing the interface over many languages. An example is the Virtualized Master-Slave interface[]. - - +One last concern is the creation of all these firm-ware runtimes. It would be good to uniformize them as much as possible. That reduces the work of creating one for a particular language, by reusing the interface over many languages. An example is the Virtualized Master-Slave interface[]. \section{Specific Topics of Interest} +\label{secTopics} Now that a position has been stated, let us examine how it applies to the topics of interest, to check its consistency and usefulness. \paragraph{enabling future parallel programming models} - \texttt{switch}-to-runtime supports current and enables foreseeable future parallel programming models. It maintains very low overhead for them, by embedding the switch mechanism in the pipeline, and by providing hardware support for common runtime constraint-management and assignment operations like hash tables and context swapping. The combination of software flexibility, with efficiency, and the added bonus of bringing application information into the lowest-hardware-level resource management appears strong. + \texttt{switch}-to-runtime supports current and enables foreseeable future parallel programming models. It maintains very low overhead for them, by embedding the switch mechanism in the pipeline, and by providing hardware support for common runtime constraint-management and assignment operations like hash tables and context swapping. The combination of software flexibility, efficiency, and bringing application information into the lowest-hardware-level resource management appears strong. \paragraph{innovative architectural execution models} Innovative architectural execution models are more practical when isolated from the programming model. \texttt{switch}-to-runtime lets widely different hardware all implement the same programming model efficiently. This gives hardware freedom to explore without code legacy constraining it. However, high-speed internal-to-runtime messages, speculation support, and decoupled communication processors may be considered elements of an architectural execution model advocated by our position. -\paragraph{novel memory hierarchies} This would be coupling memories to their own communication processor that performs all movement of data to remote memories. Also that memories be configurable, to have tags that include check-point and sandbox IDs, along with hardware for sending lists of tags that have a given ID, and ability to check tags against such a list. -Together, these features should efficiently implement transactional memory, thread-level speculation, acquire-release, and speculative implementation of the tighter variations on sequential consistency. -\paragraph{simplified and scalable memory models} The communication processor plus speculation hardware can support a wide variety of consistency models, including simplified high-level ones implied by domain-specific constructs. The speculation and linkage to context-swapping allows memory consistency and communication to be overlapped by work. Scalability is in the hands of communication firm-ware. +\paragraph{novel memory hierarchies} Coupling memories to their own communication processor that performs all movement of data to remote memories is one memory hierarchy suggestion. Also, make memories be configurable and have tags that include check-point and sandbox IDs. Add hardware for sending lists of tags that have a given ID, and ability to check tags against such a list. +Together, these features should efficiently implement transactional memory, thread-level speculation, acquire-release, and speculative implementation of the variations on sequential consistency. +\paragraph{simplified and scalable memory models} The communication processor plus speculation hardware can support a wide variety of memory models, including simplified high-level ones implied by domain-specific constructs. The speculation and linkage to context-swapping allows memory consistency and communication to be overlapped with work. Scalability is then in the hands of communication firm-ware. -\paragraph{high-level constructs for on-chip communications} Essentially any high-level communication construct can be implemented in firm-ware of the communication processors. Further, linkage between communication processor and runtime in the work processor brings pipeline-level hardware control into the high-level constructs. High-level constructs can cause virtual-processors to be swapped out of hardware during communication, so that it is overlapped with useful work from a different context. +\paragraph{high-level constructs for on-chip communications} Essentially any high-level communication construct can be implemented in firm-ware of the communication processors. Further, linkage between communication processor and work processor brings pipeline-level hardware control to the high-level communication constructs. They can cause virtual-processors to be swapped out of hardware during communication, so that it is overlapped with useful work from a different context. -\paragraph{future directions in programming massively parallel systems} hierarchy of runtimes, each level tuned to one level in HW hierarchy, algorithms and code that arrange data and perform computation in a ``fractal'' arrangement, with each level of hardware looking the same in terms of communication and computation activity. Thus, communication within the computation scales the same as communication available in the hardware scales, with level in the hierarchy. +\paragraph{future directions in programming massively parallel systems} A hierarchy of runtimes, with each level tuned to one level in the HW hierarchy will be key. The algorithms and code should be arranged so that data and computation on it is divided into fractal-like patterns. The goal is for each level of hardware to look the same in terms of communication and computation activity. Thus, communication within work-units scales the same as communication available in the hardware, as level in the hierarchy is traversed. -Find hierarchical approximations to problems, that accumulate lower-level results, so amount of communication decreases as go up in the HW hierarchy. +This means programmers need to find hierarchical approximations to problems, where they accumulate lower-level results. This produces an application hierarchy in which amount of communication between pieces decreases as go up. \paragraph{potential bottlenecks for future parallel systems} -communication-to-computation ratio of the hardware is worsening. This drove the previous suggestion of fractal-like communication within application code. In addition, memory size is growing more slowly than computation rate, and more slowly than hardware-supported parallelism. Both of these suggest smaller work-units be found in code, else amount of parallelism will be the bottleneck, leaving processors idle. +Amount of parallelism in code will be the bottleneck. Communication-to-computation ratio of the hardware is worsening. In addition, memory size is growing more slowly than computation rate, and more slowly than hardware parallelism, hence weak-scaling doesn't apply. Both of these suggest smaller work-units be found in code. The code has to change, else amount of parallelism will be the bottleneck, leaving processors idle. + +\section{Conclusion}\label{secConclusion} + + + + +\bibliography{Bib_for_papers} + ================================== --] Main programmer visible elements: causal ordering, names of data (pointers inside data-structs), communication of data, operations applied to data, units of work, scheduling events, resulting concrete sequences of work-unit instances, tied together at certain points (for dataflow, is the firing of operations on data-sets, for functional is the application of lambda to data-instances -- the tie is where data instance output flows to multiple inputs) +-] Main programmer visible elements: +units of work, +level of units (key-word vs line vs function -- for parallel, is construct-delimited unit), +units of data (data structs, linked collections of them), +names of data (pointers inside data-structs), +communication of data, +operations applied to data, +scheduling events, +ordering among units (constraints on scheduling units), +resulting concrete sequences of work-unit instances, +tied together at certain points (for dataflow, is the firing of operations on data-sets, for functional is the application of lambda to data-instances -- the tie is where data instance output flows to multiple inputs) --] Runtime support includes: - --] "speculative exclusive access to local memory-line" - - - --] HW to create a "soft" ctxt (a virtual processor with stack), checkpoint it and restore a checkpoint.. - --] HW to accelerate common parallelism-construct ops, like hash-table, queue, search-for-match (ex is runtime impl of mutex and cond vars via queues and dataflow via hash-table) +-] HW to create a "soft" ctxt, a virtual processor struct with stack -] HW for multi-context stack (stuff talked about with Albert) @@ -520,24 +545,17 @@ -] HW to support "namespace", which is a chunk of allocated memory that a virtual-processor sees.. all pointers within a namespace are offsets from the start of the namespace.. so have a reserved register that holds namespace base addr, and pointers are added to that to get final addr. Makes pointers equivalent to global, but relocatable. A namespace is essentially a stack with only one frame. When access an out-of-namespace pointer, the target namespace is accessed, the data brought over, added to the end (or malloc'd into the namespace), and pointers within the data are translated to new offsets. This provides automated HW management of distributed memories. If out-of-namespace pointer is within same addr-space, then it is directly accessed -- HW has a number of base-addr regs, which it can swap in and out +-] HW support for memory spaces.. all data is viewed as existing within a memory-space, where that memory-space is a HW entity.. it has a start address and a length, so all pointers are offsets from the start addr (goes back to early main-frame ideas) In code, no difference from shared-memory -- all data is within a data-struct or array, and data-structs contain pointers -- difference is that code is supplied either by language-impl or by programmer that translates the pointers when data is copied or moved to a different memory-space. Each memory-space exists inside an addr-space, but is fully repositionable just by changing the base pointer.. Thinking one memory-space per virtual processor (SW ctxt)? --] HW support for independent code performs translation of pointers from previous memory-space to new memory-space.. so pointers become base plus offset where base is start of the memory-space - --] HW support for memory spaces.. all data is viewed as existing within a memory-space, where that memory-space is a HW entity.. it has a start address and a length, so all pointers are offsets from the start addr (goes back to early main-frame ideas) In code, no difference from shared-memory -- all data is within a data-struct or array, and data-structs contain pointers -- difference is that code is supplied either by language-impl or by programmer that translates the pointers when data is copied or moved to a different memory-space. Each memory-space exists inside an addr-space, but is fully repositionable just by changing the base pointer.. Thinking one memory-space per virtual processor (SW ctxt)? +-] HW support for translation of pointers from previous memory-space to new memory-space.. so pointers become base plus offset where base is start of the memory-space ======================================= application code at the top, held within development tools, the runtime is separate from the executable, in this stack. The separation allows a single executable to run without modification on several versions of hardware, even though the runtime uses specialized hardware instructions. -The end point is the triple goal: Productivity, Performant-Portability, and Adoptability. - --] Productivity comes from a combination of language design that provides a mental model close to the applications while hiding any influence from hardware -] Performant-Portability is the most difficult technically, and boils down to the process of specializing code to the hardware. This process can span multiple points in an application's life-time, which correspond to multiple levels of the software stack. For example, compiler transforms, and then runtime choices (auto-tuners), and even swapping particular HW abstractions is part of specializing code to end-hardware. - -Then, responsibilities are assigned to layers and interfaces within the software stack: - Application Layer: -] state features of the application, in terms of constructs provided by language interface (constructs can be "embedded" into base sequential language or base parallel lang being enhanced -- for continuity with current code bases) @@ -594,18 +612,29 @@ ? + + + + +======================================================== + Outline: --] Premise: Runtime impls constrs in firm-ware, no constr in HW, RT supp in HW -Because solves problems currently have, and enables big-three goals not possible with specific constructs in HW (if buy the domain-specific lang and custom construct vision) +\section{Introduction and Motivation} +Problem: parallel programming not productive, not performantly portable, has blocks to adoption into industry. -Problems to be solved: -1) -2) -3) +Solution: software stack suggested in previous publications [hotpar], plus domain-specific langs. +If buy the premise of stack and domain-specific then firmware runtime better than specific constructs. --] What want -- productivity (from domain-specific, and wide variety of constructs), portability (from unit-define, plus constraints, plus helpers, and no HW implications in progr model.. CnC is example), adoptability (current tools work with new, debugging, similar work-flow, separation of concerns ((perf tuning separated from app dev -- enough info in constructs that don't need separate app understanding)) ) +reason: specific constructs better on specific programming models, but worse on others.. but, domain-specific means very wide variety, so most will not fit the direct hardware well. Plus, firmware runtime only slightly more overhead than specific constructs, especially if common runtime operations have acceleration hardware. + +Other reason is adoption: specific constructs in HW are only advantageous for the few programming models they directly support -- means, to be economically viable, those programming models have to be dominant.. but chicken and egg because without specific support, those models don't have a special advantage, so not much drives them becoming the dominant model, so no motivation for HW to go to expense of supporting. + +productivity (from domain-specific, and wide variety of constructs), +portability (from unit-define, plus constraints, plus helpers, and no HW implications in progr model.. CnC is example), +adoptability (current tools work with new, debugging, similar work-flow, separation of concerns ((perf tuning separated from app dev -- enough info in constructs that don't need separate app understanding)) ) + -] Here is how current works: libGomp, pthreads based, message lib based, runtime as part of app (MPI, threads, TBB, -- contrast to CnC) -] Problems with current -- isolates HW manage from app info, so penalty in utilization, throughput, and energy -- prob with constr in HW, prob w/atomic in HW, prob with current SW threads -- can't get wants 1, 2, and 3 because X -] How new addresses the problems -- clears problems by X and Y -- achieves wants 1,2, and 3 by X -- runtime is hierarchical, matching hardware -- at bottom, every processor spends part time on runtime, part on work -- above that, runtime makes inter-node decisions.. it's treated as work by lower-level runtimes -- above that, runtime makes decisions about inter-rack.. decision is parallel work divided among the nodes.. their runtimes schedule as work -- shoot for 10% runtime overhead or so -- runtime complexity and sophistication increases as get higher in HW hierarchy (work sizes get bigger, so 10% allows much more optimization work as part of decision making).. @@ -615,12 +644,13 @@ ==================== +\end{document} -\section{Conclusion}\label{secConclusion} +Expanding on the first claim, the semantics of constructs, and information extracted from the application code by the toolchain can both be used by the runtime in decisions about task contents, which task to run on which core, and order of task execution. The communication pattern that results determines how much communication is overlapped with useful work, the energy of the computation, throughput, and average utilization. +Expanding on the second claim, currently, each domain-specific language requires significant effort to create, and more importantly to port to each hardware target. The small user-base of each language cannot support such cost, making domain-specific languages impractical. The suggested software stack minimizes the creation and porting effort for domain-specific languages, and firm-ware runtime support fits well within such a stack [HWSim and codec lang]. +The third claim, portability, occupies most of section \ref{secResponsibility}. -\bibliography{Bib_for_papers} -\end{document}