comparison 2__Other/jsps_proposal/research plan @ 8:98e9df819eaf

stipendienanträge
author Nina Engelhardt <nengel@mailbox.tu-berlin.de>
date Tue, 14 May 2013 12:03:39 +0200
parents 1d37e9d849e8
children
comparison
equal deleted inserted replaced
0:d45ada6e3465 1:edc83ab94b7f
1 15. Research Plan in Japan 1 15. Research Plan in Japan
2 a. Present research related to research plan 2 a. Present research related to research plan
3 3
4 I am currently working on a PhD in the domain of runtimes for parallel programming models. The object of my research is to improve the VMS runtime framework. VMS addresses the productivity gap between sequential and parallel programming. The difficulty of parallel programming is reduced when the programming model offers high-level concepts that closely match the application's concepts. However, the more specialized a programming model is, the less users it has, and so less effort can be spent on developing the model due to diminishing returns (it becomes useless when the time to write the runtime is larger than the extra time it takes to write the applications in a less suitable model). 4 I am currently working on a PhD in the domain of runtimes for parallel programming models. The object of my research is to improve the VMS[1] runtime framework. VMS addresses the productivity gap between sequential and parallel programming. The difficulty of parallel programming is reduced when the programming model offers high-level concepts that closely match the application's concepts. However, the more specialized a programming model is, the less users it has, and so less effort can be spent on developing the model due to diminishing returns (it becomes useless when the time to write the runtime is larger than the extra time it takes to write the applications in a less suitable model).
5 5
6 VMS brings the simplicity of sequential programming to the development of parallel runtimes by providing a base layer that takes charge of the most difficult aspect, synchronization. Building on this, the runtime writer only needs to provide specific methods in the form of a plugin to support a certain programming model. These plugin methods are easy to write because they need to be neither thread-safe nor reentrant. 6 VMS brings the simplicity of sequential programming to the development of parallel runtimes by providing a base layer that takes charge of the most difficult aspect, synchronization. Building on this, the runtime writer only needs to provide specific methods in the form of a plugin to support a certain programming model. These plugin methods are easy to write because they need to be neither thread-safe nor reentrant.
7 7
8 I have been evaluating the performance of the current VMS implementation. It was designed for cache-coherent shared memory machines not exceeding a dozen cores and performs well on these machines, offering comparable performance to the default runtime libraries for models such as OmpSs. However, some of the assumptions in the VMS abstractions are unsuited to larger and/or distributed systems. Preparing for the research to take place during the exchange period, additional benchmarks using different combinations of features will be developed to evaluate the runtime's performance, and establish a baseline to compare improvements to. Next, a distributed version of VMS will be developed. The goal is to have a first working implementation ready that will be improved using the combined knowledge of distributed systems at TokyoTech and parallel runtimes at TU Berlin. 8 I have been evaluating the performance of the current VMS implementation. It was designed for cache-coherent shared memory machines not exceeding a dozen cores and performs well on these machines, offering comparable performance to the default runtime libraries for programming models such as OmpSs[2]. However, some of the assumptions in the VMS abstractions are unsuited to larger and/or distributed systems. I have started development of a distributed version of VMS. The goal is to have a first working implementation ready that will be improved during the research stay using the combined knowledge of distributed systems at TokyoTech and parallel runtimes at TU Berlin.
9
10 Concurrently, additional benchmarks using different combinations of features will be developed to evaluate the runtime's performance, and establish a baseline to compare improvements to.
9 11
10 b. Purpose of proposed research 12 b. Purpose of proposed research
11 13
12 Often, a significant part of the runtime overhead is spent on ensuring compatibility with a wide range of features of the programming model, or even with other programming models, even though most of these features will not be used in this particular application. If the runtime could know if these features were needed or not, it would be possible to reduce overhead by eliding the unnecessary functionalities. 14 The embedded systems of tomorrow are increasingly resembling the high-performance systems of today, featuring computationally intense parallel applications and complex arrangements of large numbers of cores. In both contexts, the goal is to extract maximal performance with minimal energy usage from these systems. Domain Specific Languages are generally recognized as the most promising tool to efficiently achieve this end. However, DSLs are themselves a programming challenge, which we wish to solve. This project aims to harness the combined experience and resources from the TU Berlin Embedded Systems Architecture lab and TokyoTech's Matsuoka lab to solve critical performance and productivity issues that affect parallel and distributed application development.
13 15
14 The proposed research will enable building modularized runtimes that can load a reduced feature set based on the developer's indication of which features they wish to use. A further aim is to allow development of hardware support that can be flexibly used by replacing selected runtime models with ones taking advantage of special-purpose hardware when it is available.
15 16
16 c. Proposed plan 17 c. Proposed plan
17 18
19 In this cooperation, we wish to bring the VMS approach from TU Berlin, which has proven successful for consumer applications on commodity hardware, to specialty systems such as the Tsubame 2.0 supercomputer from TokyoTech. Our devised plan takes extensive advantage of the complementary expertise of both labs, combining the strengths in parallel and distributed application development of TokyoTech with the knowledge of low-level hardware and runtime detail from TU Berlin.
20
18 Main tasks during the research stay: 21 Main tasks during the research stay:
19 - Define a set of basic abstractions suited to distributed systems that enable plugins for many different programming models. These abstractions should support several goals: 22 - Define a set of basic abstractions suited to distributed systems that enable plugins for many different programming models. These abstractions should support several goals:
20 1. Generality: allow implementation of different types of programming models (thread-based, task-based, dataflow, ...) 23 1. Generality: allow implementation of different types of programming models (thread-based, task-based, dataflow, ...). This way, all programming models based on a common foundation will profit from improvements to the efficiency of the base layer. Sharing a basic structure will also increase compatibility between different models, removing an important barrier for Hybrid Programming, an approach that is gaining popularity rapidly.
21 2. Simplicity: present only very few basic abstractions that can be quickly understood 24 2. Simplicity: present only very few basic abstractions that can be quickly understood. This increases productivity of the runtime developers, makes it easier to implement the abstractions for different architectures, and reduces occasions for mistakes.
22 3. Modularity: encourage separation of the different runtime functionalities. Modularity enables two approaches to boost performance. One is to selectively load only the necessary functionalities for a specific application, eschewing overhead due to preparing features that are never actually used. The other is to make it easier to integrate specialized hardware that can support specific runtime functions. 25 3. Modularity: encourage separation of the different runtime functionalities. Modularity enables two approaches to boost performance.
26 One is to selectively load only the necessary functionalities for a specific application. Often, a significant part of the runtime overhead is spent on ensuring compatibility with a wide range of features of the programming model, or even with other programming models, even though most of these features will not be used in this particular application. If the runtime could know if these features were needed or not, it would be possible to reduce overhead by eliding the unnecessary functionalities.
27 The other is to make it easier to integrate specialized hardware that accelerates runtime functions. As a solution to ever larger amounts of transistors available but ever tighter energy budgets, future architectures will include numerous specialized units that can be selectively activated when an application can profit from them. If the runtime is highly modular, only a small part of the runtime needs to be replaced with a variant that takes advantage of special-purpose hardware.
23 28
24 Possibly offer simplified abstractions for common models (thread- and task-based programming models in particular). 29 Possibly offer simplified abstractions for common models (thread- and task-based programming models in particular).
25 30
26 - Develop a modularized plugin for the OmpSs programming model using the new abstractions. This programming model will serve as case study to evaluate the suitability of the abstractions both from a productivity and from a performance perspective. 31 - Develop a modularized plugin for the OmpSs[2] and MPI[3] programming models using the new abstractions. These programming models will serve as case study to evaluate the suitability of the abstractions both from a productivity and from a performance perspective. Simultaneously, this will allow us to investigate whether certain scalability challenges in dynamic dependency tracking are solvable or inherent to the problem.
32
33 - Apply the improved runtimes to selected scientific and engineering applications developed at TokyoTech. In particular, we wish to take advantage of more fine-grained parallelism in the Fast Multipole Method, a calculation which serves in many physics and chemistry simulations such as of molecular dynamics and plasma physics. Additionally, we aim to port common kernels such as LU/Cholesky decomposition, iterative methods, and structure/unstructured grid.
27 34
28 d. Expected results and impacts 35 d. Expected results and impacts
29 36
30 The expected immediate benefit is to have a runtime whose overhead is proportional to the featureset actually used by the application. This allows the usage of powerful programming models for a larger group of applications. 37 The expected immediate benefit is to have a runtime whose overhead is proportional to the featureset actually used by the application. This allows the usage of powerful programming models for a larger group of applications.
31 38
32 This work will also enable important future research both at TokyoTech and TU Berlin. Future plans are to use the improved VMS platform to develop runtimes that can automatically adjust task size to different systems' capabilities (number of cores, computation vs. communication speed, etc). It will also serve to develop a set of hardware accelerators for schedulers. 39 Through the cooperation, Matsuoka lab at TokyoTech and more generally the Japanese HPC community will obtain more efficient runtimes and hands-on experience in their use, while the AES lab from TU Berlin will gain access to important real-world applications and resources beyond consumer scale.
33 40
34 The embedded systems of tomorrow are increasingly resembling the high-performance systems of today, and both domains share the drive to extract maximum performance from the systems. Combining the experience and techniques from both domains is profitable for both sides. 41 Beyond the direct impact, this work will also lay the foundation for future collaboration between TokyoTech and TU Berlin. Using the distributed VMS platform, we plan to investigate runtimes that can automatically adjust task size to different systems' capabilities (number of cores, computation vs. communication speed, etc). It will also serve as basis to develop a set of hardware accelerators for schedulers.
42
43 [1] Halle, S., Cohen, A.: A mutable hardware abstraction to replace threads. 24th Intl. Workshop on Languages and Compilers for Parallel Languages (2011)
44 [2] Duran, A., Ayguade, E., Badia, R. M., Labarta, J., Martinell, L., Martorell, X., & Planas, J. (2011). Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21(02), 173-193.
45 [3] Gropp, W., Lusk, E. L., & Skjellum, A. (1999). Using MPI: Portable Parallel Programming with the Message Passing Interface (Vol. 1). MIT press.