changeset 74:a65dcc907150

perf-tuning Small changes -- "levels" subsection is not complete!
author Sean Halle <seanhalle@yahoo.com>
date Thu, 09 Aug 2012 16:15:33 -0700
parents 2c858fb55da0
children f0347dbf1702
files 0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex
diffstat 1 files changed, 269 insertions(+), 14 deletions(-) [+]
line diff
     1.1 --- a/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Tue Aug 07 18:16:29 2012 +0200
     1.2 +++ b/0__Papers/Holistic_Model/Perf_Tune/latex/Holistic_Perf_Tuning.tex	Thu Aug 09 16:15:33 2012 -0700
     1.3 @@ -71,7 +71,8 @@
     1.4  Performance tuning is an important aspect of parallel programming. Yet when trying to track down the causes of performance loss, a great deal of knowledge about the internal structure of both the application and the runtime is often needed to understand how the observed patterns of performance have come to pass.
     1.5  The trend in parallel programming languages has been towards models that capture more structural information about the application, in an effort to increase both performance and ease of programming. This structural information can be used to improve performance tuning tools by making the causes of performance loss more readily apparent.
     1.6  We propose a universal but adaptable way of integrating more application structure into performance visualizations, relying on a model of parallel computation. The visualizations produced clearly identify idle cores, and tie the idleness to causal interactions within the runtime and hardware, and from there to the parallelism constructs that constrained the runtime and hardware behavior, thereby eliminating guesswork.
     1.7 -This is implemented for multi-core hardware, and we walk through a tuning session on a large multi-core machine to illustrate how performance loss is identified and how hypotheses for the cause are generated. We also give a concise description of the implementation and the computation model. \end{abstract}
     1.8 +This is implemented for multi-core hardware, and we walk through a tuning session on a large multi-core machine to illustrate how performance loss is identified and how hypotheses for the cause are generated. We also give a concise description of the implementation and the computation model. 
     1.9 +\end{abstract}
    1.10  
    1.11  
    1.12  
    1.13 @@ -87,13 +88,10 @@
    1.14  
    1.15  Current tools may cover all the parts of the application code, but fail to adequately connect their observations to the runtime behavior, scheduling decisions, and consequent hardware behavior. The decision about which task or virtual processor is assigned to which core at what point in time is at the heart of parallel performance, so these choices need to be highlighted and connected to the application features influencing them.
    1.16  
    1.17 -In this paper we describe a model and framework that provides a more complete picture. It captures the relevant application features (units of work and the constraints governing their execution), exposes the scheduling decisions taken in accordance with the application and other constraints, and provides a structure to which to attach the measurements captured during execution.
    1.18 -The model relates the measurements to each other and to the application.
    1.19 +In this paper we describe a model and framework that provides a more complete picture. It captures the relevant application features (units of work and the constraints governing their execution), exposes the scheduling decisions taken in accordance with the application and other constraints, and provides a structure to which to attach the measurements captured during execution. The model relates the measurements to each other and to the application.
    1.20   
    1.21  
    1.22 -Using the model, we generate views that show the schedulable units in the code, runtime overhead assigned to each, the scheduling decisions made for them by the runtime, and consequent usage of the cores. Identifying idle cores is immediate, and the connections indicated by visual features  enable more quickly generating the correct hypotheses for the causes of the performance losses. The visual features connect each unit to the  segment of code executed, and 
    1.23 -%% hard to understand
    1.24 -to the constraint in the code that combined with the runtime to cause its placement in time and location. The pattern of placements, combined with contents of the code, leads to the hypothesis. 
    1.25 +Views generated by using the model identify the schedulable units in the application code, and indicate how causal activities related to them result in the observed performance.  The result is like having a map showing idle cores linked back to the spots in the application code responsible. The information visualized includes the units, the scheduling decisions made for each by the runtime, runtime overhead assigned to each, and consequent usage of the cores. Identifying idle cores is immediate, and the visual features  enable more quickly generating the correct hypotheses for the causes of the performance losses, and pinpointing where in the application code changes can be made to improve performance.
    1.26  
    1.27  The model of computation and its usage is illustrated with a story line of performance tuning a standard parallel application on a large multi-core system.
    1.28  
    1.29 @@ -103,9 +101,10 @@
    1.30  \section{Background and Related Work}
    1.31  \label{sec:related}
    1.32  
    1.33 -Performance tuning is an iterative process that involves a mental model. The programmer takes measurements during execution that are then compared to the desired outcome. A mental model, constructed through experience and knowledge of the mechanics of execution, is used to generate a hypothesis explaining any discrepancies between the measurement and expectations. This hypothesis is then linked, again through a mental model, to things within the programmer's control, to suggest a change to make to the code. The modified code is run again, and these steps are repeated until the programmer is satisfied with the performance of the program. Thus, the mental model is central to performance tuning.
    1.34 + Performance tuning is an iterative process that involves a mental model. The programmer takes measurements during execution and compares them to the desired outcome. A mental model, constructed through experience and knowledge of the mechanics of execution, is used to generate a hypothesis explaining any discrepancies between the measurement and expectations. This hypothesis is then linked, again through a mental model, to things within the programmer's control, to suggest a change to make to the code. The modified code is run again, and these steps are repeated until the programmer is satisfied with the performance of the program. Thus, the mental model is central to performance tuning.
    1.35  
    1.36  The pthreads abstraction is very close to the hardware. There are almost no applications whose structure maps gracefully onto pthreads, which means that the user has to simultaneously keep in mind and connect two very different mental models. This accounts for much of the difficulty of programming with threads, and remains a problem when analyzing performance. Yet as the earliest and most widespread parallel programming model, and one which serves as the basis on top of which many other models are implemented, it must be supported.
    1.37 +
    1.38  Pthreads intentionally introduces randomness, and its synchronization constructs imply only indirect connections. For instance, very little conclusion can be drawn about the relationship between the computations in two separate threads from their accessing the same lock consecutively. Tools such as Paradyn \cite{PerfToolParadyn} or VTune \cite{PerfToolVTune} that rely on this model consequently have a hard time connecting measurements to the application. They collect a wealth of statistics, but the application is to the tool a foreign process, where things ``just happen'' for no reason. The anticipated problems are things like ``application bottleneck is synchronisation'', and the detailed problems ``too much time is spent in spinlocks''. If there is more than one use of spinlocks in the application, it is not even obvious where the problem actually is.
    1.39  
    1.40  One fix to these problems is to allow the users to introduce measuring points into their own code. While this allows for a great deal of flexibility, it requires a lot more effort. One major advantage of this approach is that instrumentation code is written in the source language, so it has access to application concepts. This advantage can be kept with automated instrumentation, by providing an instrumenting compiler, like the Tau \cite{PerfToolTau} project does.
    1.41 @@ -367,15 +366,15 @@
    1.42  
    1.43  There is one kind of region for each reason that the core is being used (or being forced idle), and several kinds of arcs, one for each type of causality between regions.
    1.44  
    1.45 -The core activities associated with region types are: application work, waiting for communication of work data, managing constraints, choosing assignment of work onto cores, and runtime internals. The runtime internals in turn have sub-categories, which we will not detail here, but some of which will be mentioned in section \ref{sec:Implementation}.
    1.46 +The core activities of the region types are: application work, idling while waiting for communication of work data, managing constraints, choosing assignment of work onto cores, and runtime internals. The runtime internals in turn have sub-categories, which we will not detail here, but some of which will be mentioned in section \ref{sec:Implementation}.
    1.47  
    1.48 -The arc types, representing the type of causal relationship, are: control dependency in the base language, parallel constraint that had to be satisfied (i.e., one unit did something to satisfy a constraint on the other, causing it to be free to be scheduled), runtime internal causality such as a global lock (runtime on one core releases the lock, causing the other to acquire it), and arcs that represent hardware causal relationships (one work-unit finishing on a core causes another work-unit to start there, given the choice by the runtime). The formal details are given in the longer format companion paper.
    1.49 +There is an arc-type for each category of causality. The arc-types (causality categories) represent: sequential dependency in the base language being satisfied, parallel constraint being satisfied (i.e., one unit did something to satisfy a constraint on the other, causing it to be free to be scheduled), runtime internal causality such as a global lock (runtime on one core releases the lock, causing the other to acquire it), and arcs that represent hardware causal relationships (one work-unit finishing on a core causes another work-unit to start there, given the choice by the runtime). The formal details are given in the longer format companion paper.
    1.50  
    1.51  We will now look at each source of causal relationship.
    1.52  
    1.53 -\paragraph{Constraint  causality} There is a constraint causality when two units are involved in a constraint, where action by one unit causes (or contributes to) satisfaction of the constraint blocking the other unit. This includes control dependencies from the base language.
    1.54 +\paragraph{Constraint  causality} There is a constraint causality when  action by one unit causes (or contributes to) satisfaction of the constraint blocking the other unit. This includes sequential dependencies from the base language.
    1.55  
    1.56 -Control dependencies may add superfluous constraints that  eliminate some otherwise allowed choices in the UCC. An example would be a \texttt{for} loop that creates work-units -- no parallelism constructs cause the creations to be done in sequence, but the base C language sequentializes it nonetheless. 
    1.57 +Sequential dependencies may add superfluous constraints that  eliminate some otherwise allowed choices in the UCC. An example would be a \texttt{for} loop that creates work-units -- no parallelism constructs cause the creations to be done in sequence, but the base C language sequentializes it nonetheless. 
    1.58  
    1.59  \paragraph{Runtime internal causality} Runtime implementation details may introduce ``extra" causalities between units. For example, the version of VMS we instrumented for this paper runs separately on each core and relies upon a global lock for accessing shared runtime information. This lock introduces a causal relationship  when the runtime on one core is attempting to process one unit, but must wait for the runtime on a different core to finish with its unit.
    1.60  
    1.61 @@ -419,11 +418,11 @@
    1.62  
    1.63  With multiple levels of scheduling activities, a scheduling consequence graph can be constructed for each level. How, then, do these tie together?
    1.64  
    1.65 -For a given level, a lower-level runtime is seen as a single processing unit. What is the consequence of deciding to schedule a given work unit on a given processing unit? When trying to improve performance, it is the time taken until completion of the work unit that is of interest.
    1.66 +For a given level, a lower-level runtime is seen as a single processing unit. So, what is the consequence of deciding to schedule a given work unit onto one of those  processing units? When the goal is  improving performance, the consequence of interest is the time taken until completion of the work unit.
    1.67  
    1.68 -If we now look at the lower level, the processing unit is often not in fact a single unit but itself a parallel system comprised of several processing units. When can the upper-level work unit be considered completed? When all the lower-level units of which it is comprised are completed. Once the lower-level consequence graph is established, this time can be easily determined: it is the critical path across this graph, the longest path between all units having no propendents and the units with no dependents.
    1.69 +Hence, the question becomes,  when can the upper-level work unit be considered completed? The answer is, when all the lower-level units of which it is comprised are completed, plus any additional work spent dividing up the work and combining results. Once the lower-level consequence graph is established, this  time can be easily determined: it is the critical path across the lower-level consequence graph plus the additional time spent creating the lower-level units and combining results into the format of the higher-level unit.
    1.70  
    1.71 -Consider the concrete example of the SSR matrix multiply application, from the story in section \ref{sec:casestudy}. Going up,  the level above the execution under study involves the invocation of entire applications, via OS commands. At that level, a unit is an entire process, and the critical path of the SCG in section \ref{sec:casestudy} is the work-time of that unit.
    1.72 +Consider the concrete example of the SSR matrix multiply application, from the story in section \ref{sec:casestudy}. Going up,  the level above the execution under study involves the invocations of entire applications, via OS commands. At that level, a unit is an entire process, and the  work-time of that unit is the execution time of the application. The SSR matrix multiply execution time includes the critical path through the matrix multiply, plus creation of the various VPs, and collection of the results by the results VP. If the story in section \ref{sec:casestudy} were done in hierarchical fashion, the creation
    1.73  
    1.74  One thing that is specific to the consequence graph and does not appear in the UCC is overhead. How is overhead dealt with across levels? Looking again at our SSR matrix-multiply, it is visible that the overhead recorded in the SSR-level consequence graph, when it contributes to the critical path, is counted as part of the work-time of the application unit. Only the overhead of the runtime level under investigation is distinguished in the SCG.
    1.75  
    1.76 @@ -996,3 +995,259 @@
    1.77  
    1.78  
    1.79  
    1.80 + all, why go parallel, if not for higher performance? Yet when trying to track 
    1.81 +
    1.82 +
    1.83 +
    1.84 +
    1.85 +
    1.86 +
    1.87 +
    1.88 +
    1.89 +
    1.90 +
    1.91 +
    1.92 +
    1.93 +
    1.94 +
    1.95 +
    1.96 +
    1.97 +
    1.98 +
    1.99 +
   1.100 +
   1.101 +
   1.102 +
   1.103 +
   1.104 +
   1.105 +
   1.106 +
   1.107 +
   1.108 +
   1.109 +
   1.110 +
   1.111 +
   1.112 +
   1.113 +
   1.114 +
   1.115 +
   1.116 +
   1.117 +
   1.118 +
   1.119 +
   1.120 +
   1.121 +
   1.122 +
   1.123 +
   1.124 +
   1.125 +
   1.126 +
   1.127 +
   1.128 +
   1.129 +
   1.130 +
   1.131 +
   1.132 +
   1.133 +
   1.134 +
   1.135 +
   1.136 +
   1.137 +
   1.138 +
   1.139 +
   1.140 +
   1.141 +
   1.142 +
   1.143 +
   1.144 +
   1.145 +
   1.146 +
   1.147 +
   1.148 +
   1.149 +
   1.150 +
   1.151 +
   1.152 +
   1.153 +
   1.154 +
   1.155 +
   1.156 +
   1.157 +
   1.158 +
   1.159 +
   1.160 +
   1.161 +
   1.162 +
   1.163 +
   1.164 +
   1.165 +
   1.166 +
   1.167 +
   1.168 +
   1.169 +
   1.170 +
   1.171 +
   1.172 +
   1.173 +
   1.174 +
   1.175 +
   1.176 +
   1.177 +
   1.178 +
   1.179 +
   1.180 +
   1.181 +
   1.182 +
   1.183 +
   1.184 +
   1.185 +
   1.186 +
   1.187 +
   1.188 +
   1.189 +
   1.190 +
   1.191 +
   1.192 +
   1.193 +
   1.194 +
   1.195 +
   1.196 +
   1.197 +
   1.198 +
   1.199 +
   1.200 +
   1.201 +
   1.202 +
   1.203 +
   1.204 +
   1.205 +
   1.206 +
   1.207 +
   1.208 +
   1.209 +
   1.210 +
   1.211 +
   1.212 +
   1.213 +
   1.214 +
   1.215 +
   1.216 +
   1.217 +
   1.218 +
   1.219 +
   1.220 +
   1.221 +
   1.222 +
   1.223 +
   1.224 +
   1.225 +
   1.226 +
   1.227 +
   1.228 +
   1.229 +
   1.230 +
   1.231 +
   1.232 +
   1.233 +
   1.234 +
   1.235 +
   1.236 +
   1.237 +
   1.238 +
   1.239 +
   1.240 +
   1.241 +
   1.242 +
   1.243 +
   1.244 +
   1.245 +
   1.246 +
   1.247 +
   1.248 +
   1.249 +
   1.250 +
   1.251 +
   1.252 +
   1.253 +
   1.254 +
   1.255 +
   1.256 +
   1.257 +
   1.258 +
   1.259 +
   1.260 +
   1.261 +
   1.262 +
   1.263 +
   1.264 +
   1.265 +
   1.266 +
   1.267 +
   1.268 +
   1.269 +
   1.270 +
   1.271 +
   1.272 +
   1.273 +
   1.274 +
   1.275 +
   1.276 +
   1.277 +
   1.278 +
   1.279 +
   1.280 +
   1.281 +
   1.282 +
   1.283 +
   1.284 +
   1.285 +
   1.286 +
   1.287 +
   1.288 +
   1.289 +
   1.290 +
   1.291 +
   1.292 +
   1.293 +
   1.294 +
   1.295 +
   1.296 +
   1.297 +
   1.298 +
   1.299 +
   1.300 +
   1.301 +
   1.302 +
   1.303 +
   1.304 +
   1.305 +
   1.306 +
   1.307 +
   1.308 +
   1.309 +
   1.310 +
   1.311 +
   1.312 +
   1.313 +
   1.314 +
   1.315 +
   1.316 +
   1.317 +
   1.318 +
   1.319 +
   1.320 +
   1.321 +
   1.322 +
   1.323 +
   1.324 +
   1.325 +
   1.326 +
   1.327 +
   1.328 +
   1.329 +
   1.330 +
   1.331 +
   1.332 +
   1.333 +
   1.334 +
   1.335 +