Mercurial > cgi-bin > hgwebdir.cgi > VMS > 0__Writings > kshalle

changeset 12:b815a830d75a
Future Architecture update -- complete through section on topics of interest
author: Some Random Person <seanhalle@yahoo.com>
date: Sun, 08 Apr 2012 13:11:55 -0700
parents: 254d86cf269d
children: 83b3b9e15fb2
files: 0__Papers/Future_Architecture/latex/Future_Architecture.tex
diffstat: 1 files changed, 26 insertions(+), 25 deletions(-) [+]
[-]

0__Papers/Future_Architecture/latex/Future_Architecture.tex 51 0__Papers/Future_Architecture/latex/Future_Architecture.tex 51
0__Papers/Future_Architecture/latex/Future_Architecture.tex 51
     1.1 --- a/0__Papers/Future_Architecture/latex/Future_Architecture.tex	Sun Apr 08 09:00:10 2012 -0700
     1.2 +++ b/0__Papers/Future_Architecture/latex/Future_Architecture.tex	Sun Apr 08 13:11:55 2012 -0700
     1.3 @@ -295,7 +295,7 @@
     1.4   \includegraphics[width=3in, height=2in]{../figures/Substitute_instr_with_firm-ware.eps}
     1.5   }
     1.6   \caption
     1.7 - {A special op-code is recognized by the decode stage, and triggers fetch of instructions from firm-ware. The firm-ware instrs are provided to the OS as a ``hardware driver", and implement the runtime behavior of a language. The application communicates to the runtime by placing pointers to data-structures into registers just before executing the ``switch to runtime" instruction, which starts the fetch from firm-ware. Helper instructions accelerate common runtime operations, such as hash-table lookups, communication, search-for-optimum, and so on.   
     1.8 + {A special \texttt{switch} op-code is recognized by the decode stage, and triggers fetch of instructions from firm-ware. The firm-ware instrs are provided to the OS as a ``hardware driver", and implement the runtime behavior of a language. The application communicates to the runtime by placing pointers to data-structures into registers just before executing the ``switch to runtime" instruction, which starts the fetch from firm-ware. Helper instructions accelerate common runtime operations, such as hash-table lookups, communication, search-for-optimum, and so on.   
     1.9    }
    1.10  \label{figTimeMapping}
    1.11  \end{figure}
    1.12 @@ -422,44 +422,51 @@
    1.13  \section{Which should be the responsibility / functionality of the programmer, the runtime software, and the hardware?}
    1.14  
    1.15  
    1.16 -With such a hardware arrangement, the responsibilities naturally break down along the lines of a software stack. The goal of it is to support specialization, which is the process of transforming the original source into a form that is highly efficient on the target hardware. 
    1.17 +With such a hardware arrangement, the responsibilities naturally break down along the lines of a software stack []. The goal of it is to support specialization, which is the process of transforming the original source into a form that is highly efficient on the target hardware. 
    1.18  
    1.19  Each layer of the stack has some role in the specialization process, while the application, on top, provides the information that the rest of the stack needs while performing the specialization.  Ideally, the application must not expose hardware assumptions nor hinder specializations for particular targets.
    1.20  
    1.21 -The proposed hardware naturally supports such a stack. The bottom layer is the set of firm-ware runtime implementations.
    1.22 -
    1.23 -The application only exposes the interface to such runtimes. This alone doesn't ensure portability, but it goes a long way towards that goal, by removing the largest source of hardware-specific information.
    1.24 +The proposed hardware naturally supports such a stack. The bottom layer is an interface to simplify creation of the firm-ware runtime implementations. The set of runtimes themselves forms the next layer above that. Above the runtimes is the set of toolchains that generate the executables that talk to the runtimes. Above the toolchains is the set of language-interfaces, and above that, at the top, is the set of applications.
    1.25 + 
    1.26 +The applications only expose constructs, ones designed to avoid hardware implications. Languages with such constructs include CnC[], WorkTable[] and HWSim []. The concurrency constructs are implemented by the runtimes. This alone doesn't ensure portability, but it goes a long way towards that goal, by removing the largest source of hardware-specific information.
    1.27  
    1.28   
    1.29 -Such a stack supports high productivity through domain-specific languages, making them simple to create, easy to port across hardware, and high performance. The application programmer is responsible only for application-relevant concepts, reducing their learning curve and matching their mental model to the language.  They have domain-specific parallelism constructs provided, either embedded-style as library calls, or with compiler support.
    1.30 +Such a stack supports high productivity through domain-specific languages, such as HWSim, making them simple to create, easy to port across hardware, and high performance. The application programmer is responsible only for application-relevant concepts, reducing their learning curve and matching their mental model to the language.  They have domain-specific parallelism constructs provided, either embedded-style as library calls, or with compiler support.
    1.31  
    1.32  The constructs help specialization by identifying the tasks, the constraints on scheduling the tasks, and the data to be communicated between tasks.
    1.33  
    1.34 -In addition, high-quality specialization requires certain ``helpers"[].  These  enable: 1) modifying the layout and order of access of data, 2) modifying the size of a task, both the data consumed and code executed by it, and 3) predicting both execution-time  and data consumed by each task.
    1.35 +In addition, high-quality specialization requires certain ``helpers"[].  These  enable: 1) modifying the layout and order of access of data, 2) modifying the size of a task, both the data consumed and code executed by it, and 3) predicting both execution-time  and data consumed by each task. An example is DKU [], which provides task-size-modification helpers.
    1.36  
    1.37  The helpers are either derived by the toolchain, or encoded directly in the application via suitable constructs.  Either way, the domain-specific constructs must be designed such that the information is captured, and convenient for the tools to extract.
    1.38  
    1.39 +One last concern is the creation of all these firm-ware runtimes.  It would be good to uniform-ize  them as much as possible. That reduces the work of creating one for a particular language, by reusing the interface over many languages. An example is the Virtualized Master-Slave interface[]. 
    1.40 +
    1.41 +
    1.42 +
    1.43  
    1.44  
    1.45  \section{Specific Topics of Interest}
    1.46  Now that a position has been stated, let us examine how it applies to the topics of interest, to check its consistency and usefulness.
    1.47  \paragraph{enabling future parallel programming models}
    1.48 - The concept of  switch-to-runtime appears to be fully general, such that it supports all current and any foreseeable parallel programming models.  It maintains very low overhead for them, by embedding the switch mechanism in the pipeline and providing hardware support for common runtime constraint-management and assignment operations. The combination of software flexibility, with efficiency, and the added bonus of bringing application information into the lowest-hardware-level resource management  appears strong.
    1.49 + \texttt{switch}-to-runtime supports current and enables foreseeable future parallel programming models.  It maintains very low overhead for them, by embedding the switch mechanism in the pipeline, and by providing hardware support for common runtime constraint-management and assignment operations like hash tables and context swapping. The combination of software flexibility, with efficiency, and the added bonus of bringing application information into the lowest-hardware-level resource management  appears strong.
    1.50  
    1.51 -\paragraph{innovative architectural execution models} Our position doesn't necessarily advocate a particular architectural execution model.  However, the concept of switch-to-runtime cleanly separates the hardware execution model from the programming model, which is a benefit because it gives hardware more freedom to explore, without code legacy constraining it. However, high-speed internal-to-runtime messages, speculation support, and decoupled communication processors  may be considered elements of an architectural execution model.
    1.52 +\paragraph{innovative architectural execution models} Our position advocates isolating the architectural execution model from the programming model.  \texttt{switch}-to-runtime lets widely different hardware all implement the same programming model. This gives hardware freedom to explore, without code legacy constraining it.
    1.53 +However, high-speed internal-to-runtime messages, speculation support, and decoupled communication processors  may be considered elements of an architectural execution model advocated by our position.
    1.54  
    1.55 -\paragraph{novel memory hierarchies} -- helper cores run code for movement
    1.56 +\paragraph{novel memory hierarchies} Our position suggests that memories be coupled with their own communication processor that performs all movement of data to remote memories. Also that memories be configurable, to have tags that include check-point and sandbox IDs, along with hardware for sending lists of tags that have a given ID, and ability to check tags against such a list.
    1.57 +Together, these features should efficiently implement transactional memory, thread-level speculation, acquire-release, and speculative implementation of the tighter variations on sequential consistency.
    1.58 +\paragraph{simplified and scalable memory models} The communication processor plus speculation hardware can support a wide variety of consistency models, including simplified high-level ones implied by domain-specific constructs. The speculation and linkage to the context-swapping allows memory consistency and communication to overlap work in the work processor. The scalability is left to communication firm-ware.
    1.59  
    1.60 -\paragraph{simplified and scalable memory models} -- name-space idea
    1.61 +\paragraph{high-level constructs for on-chip communications} Essentially any high-level communication construct can be implemented in firm-ware of the communication processors. Further, linkage between communication processor and  runtime in the work processor brings pipeline-level hardware control into the high-level constructs. High-level constructs can cause virtual-processors to be swapped out of hardware during communication, so that it is overlapped with useful work from a different context. 
    1.62  
    1.63 -\paragraph{high-level constructs for on-chip communications} -- implemented in plugin
    1.64  
    1.65 -\paragraph{characterization of the runtime overheads of parallel applications}
    1.66  
    1.67  \paragraph{future directions in programming massively parallel systems}  hierarchy of runtimes, each level tuned to one level in HW hierarchy, algorithms and code that arrange data and perform computation in a ``fractal'' arrangement, with each level of hardware looking the same in terms of communication and computation activity.  Thus, communication within the computation scales the same as communication available in the hardware scales, with level in the hierarchy.
    1.68  
    1.69 +Find hierarchical approximations to problems, that accumulate lower-level results, so amount of communication decreases as go up in the HW hierarchy.
    1.70 +
    1.71  \paragraph{potential bottlenecks for future parallel systems}
    1.72 -communication-to-comp ratio in hardware is worsening..  must find hierarchical approximations to problems, that accumulate lower-level results, so amount of comm decreases as go up in the HW hierarchy.
    1.73 +communication-to-computation ratio of the hardware is worsening. This drove the previous suggestion of fractal-like communication within application code. In addition, memory size is growing more slowly than computation rate, and more slowly than hardware-supported parallelism.  Both of these suggest smaller work-units be found in code, else amount of parallelism will be the bottleneck, leaving processors idle.  
    1.74  
    1.75  
    1.76  ==================================
    1.77 @@ -471,9 +478,7 @@
    1.78  
    1.79  -] "speculative exclusive access to local memory-line"
    1.80  
    1.81 --] ultra-fast control messages between cores for use *inside runtime* only
    1.82  
    1.83 --] Fast context switching.. either a reserved HW ctxt that is just for the runtime or else HW support for saving a ctxt check-point and later restoring it (in universal runtime, save ctxt checkpoint before using that ctxt to do runtime code)
    1.84  
    1.85  -] HW to create a "soft" ctxt (a virtual processor with stack), checkpoint it and restore a checkpoint..
    1.86  
    1.87 @@ -485,20 +490,14 @@
    1.88  
    1.89  -] HW to support "namespace", which is a chunk of allocated memory that a virtual-processor sees..  all pointers within a namespace are offsets from the start of the namespace.. so have a reserved register that holds namespace base addr, and pointers are added to that to get final addr.  Makes pointers equivalent to global, but relocatable.  A namespace is essentially a stack with only one frame.  When access an out-of-namespace pointer, the target namespace is accessed, the data brought over, added to the end (or malloc'd into the namespace), and pointers within the data are translated to new offsets.  This provides automated HW management of distributed memories.  If out-of-namespace pointer is within same addr-space, then it is directly accessed -- HW has a number of base-addr regs, which it can swap in and out 
    1.90  
    1.91 --] HW for tracking changes in local memories (whether cache or scratch-pad, HW support for "this unit contains changes from previous mark-point" -- for cache, means "line is dirty", for scrath-pad, means "line has been written since previous check-point") 
    1.92  
    1.93 --] HW support for speculation.. transactional memory is a speculation mechanism
    1.94 -
    1.95 --] HW support for independent code that manages memory transfers, such as scatter-gather, fill namespace from remote memory (runs in separate context, moves data between local memory and remote, including main-memory -- the context has access to the "changed" markers) -- this code performs translation of pointers from previous memory-space to new memory-space.. so pointers become base plus offset where base is start of the memory-space)
    1.96 +-] HW support for independent code  performs translation of pointers from previous memory-space to new memory-space.. so pointers become base plus offset where base is start of the memory-space
    1.97  
    1.98  -] HW support for memory spaces.. all data is viewed as existing within a memory-space, where that memory-space is a HW entity.. it has a start address and a length, so all pointers are offsets from the start addr (goes back to early main-frame ideas)  In code, no difference from shared-memory -- all data is within a data-struct or array, and data-structs contain pointers -- difference is that code is supplied either by language-impl or by programmer that translates the pointers when data is copied or moved to a different memory-space.  Each memory-space exists inside an addr-space, but is fully repositionable just by changing the base pointer..  Thinking one memory-space per virtual processor (SW ctxt)?  
    1.99  
   1.100 -======================
   1.101 +=======================================
   1.102  
   1.103 -Vision: app invoking a parallelism construct equals switch over to runtime HW ctxt (or checkpoint current), then perform construct semantics using the high-speed comm.. As an example, to implement the CAS instr, would switch to runtime, perform "exclusive access to memory-line" instr then compare, brch, write, etc
   1.104 -
   1.105 -===================================
   1.106 -Taking a software stack as the organization of parallel software, with application code at the top, held within development tools, and resting upon a language interface.. (PStack picture)  note, the runtime is separate from the executable, in this stack.  The separation allows a single executable to run without modification on several versions of hardware, even though the runtime uses specialized hardware instructions.
   1.107 +application code at the top, held within development tools, the runtime is separate from the executable, in this stack.  The separation allows a single executable to run without modification on several versions of hardware, even though the runtime uses specialized hardware instructions.
   1.108  
   1.109  The end point is the triple goal: Productivity, Performant-Portability, and Adoptability.
   1.110  
   1.111 @@ -523,7 +522,8 @@
   1.112  
   1.113  
   1.114  -] provide for toolchain to manipulate data-size of work-unit and code-content of work-unit, provide for data ancestry ("data footprint") to be tracked among work-units, provide for prediction of execution time of a work-unit, for real-time provide stating real-time related constraints on scheduling of units (latency, deadlines, quality relationship)
   1.115 --- note, these don't all have to be language constructs, but could be, for example, code-snippets supplied to the language, via a construct.  The snippets are then used either in the toolchain or in the runtime.  Examples: DKU for task re-sizing, WorkTable for dynamic dependencies (H264 wait-unitl example)
   1.116 +
   1.117 +-- note, these don't all have to be language constructs, but could be, for example, code-snippets supplied to the language, via a construct.  The snippets are then used either in the toolchain or in the runtime.  Examples: DKU for task re-sizing, WorkTable for dynamic dependencies (H264 wait-until example)
   1.118  -- purpose of each is in terms of the specialization process.  Specialization is the embodiment of performant portability -- the term means any changes to UCC done for purposes of performance (define UCC).
   1.119  
   1.120  Toolchain Layer: