VMS/0__Writings/kshalle: 7b45d1764e2b 0__Papers/Future_Architecture/Future

283 Process technology is rapidly approaching multiple fundamental physical limits that combine to increase the cost of producing chips faster than the increase in density. Whether or not process technology continues to shrink, the economics will slow the demand for chips from future processes. This will, for the first time, shift the design decisions from being shortest time-to-market towards being the greatest efficiency per transistor -- either in computation delivered towards the goal end-user measurement such as throughput, or in energy per operation.

284

285 The implication for processor architecture is to accelerate the trend away from "easy to program" structures such as out-of-order pipelines and coherent shared memory, and towards hierarchical designs that force application information to be exploited in explicitly managing the placement and movement of data, in order to keep it as local for as much of the computation as possible. The cost of communication will be the dominant factor driving the shape of future architectures, both from a computation-bottleneck cost and an energy per operation cost.

286

287 Integration will be pushed towards the third dimension in order to mitigate the communication costs -- indeed main memories have long been on that path. This approach will hasten the development of 3 dimensional integration processes. The architecture implication is that

288 \end{abstract}

289

290 \section{Motivation}

291

292 Threads were invented to be virtual sequential processors, but are

293 problematic when one has multiple physical cores. Not only

294 are they difficult to use for parallel code, but their parallel performance is poor

295 because they block the language from controlling

296 task placement. The language often knows which cores are likely to already

297 have data in the cache which a new task requires as input. If it had

298 control, the language could place tasks where the data is likely to already

299 reside, reducing communication and increasing performance and power

300 efficiency. A thread replacement is needed that is both easy to use in

301 applications and gives the language control over task placement.

302

303 The first question is, what is the easiest to use parallelism construct? There

304 exist many to choose from: Threads (for legacy reasons); Actors

305 {\cite{Hewitt10}}{\cite{Actors97}}; Components {\cite{ComponentModel00}};

306 process calculi {\cite{hoare78}} {\cite{milner99}};

307 coordination languages {\cite{Gelernter85Linda}}; and new ones

308 continually invented.

309

310 We demonstrate in this paper a thread replacement that lets any such parallelism

311 construct be plugged in. It is the first extensible hardware abstraction,

312 allowing to plug-in both preferred parallelism constructs and preferred

313 runtime scheduler. We focus in this introductory paper on

314 the parallelism construct support, illustrating how to implement synchronous

315 \texttt{send}-\texttt{receive} constructs motivated by process calculi,

316 the \texttt{spawn} and \texttt{sync} constructs from Cilk, and even

317 \texttt{mutex} and \texttt{condition variable} constructs from pthreads.

318 However, the schedulers we implemented have no

319 interesting performance optimizations.

320

321 The next question is, how much effort is required to implement a plugin? Our

322 mechanism protects the plugin from exposure to concurrency issues, allowing

323 the parallelism constructs to be implemented with sequential algorithms. It

324 accomplishes this by introducing a ``virtual'' time that sequentializes events

325 that appear, to the program, to be simultaneous.

326

327 We call our abstraction Virtualized Master-Slave, or VMS. We demonstrate a user-level

328 implementation in this paper, although it is ideally implemented as the base

329 hardware abstraction of the OS.

330

331 It involves subtle concepts related to time in the program vs time in the

332 plugin vs physical time. We explain it in four levels, starting with an abstract definition

333 and moving down to implementation.

334

335 The first level is the theoretical definition, given in Section \ref{secAbsModel}. Second is the elements

336 of our implementation and how they relate to the theoretical definition, in Section \ref{secInternal}.

337 Third is the application code point of view, in Section \ref{secApp}, which is tied back to both

338 the abstract model and the internal elements. The fourth view,

339 also in Section \ref{secApp}, shows the implementation of the plugin for a parallel

340 construct. Additionally,

341 measurements of effectiveness appear

342 in Section \ref{secResults} and conclusion is in \ref{secConclusion}.

343

344 \section{Background and Related Work}

345

346 {\bf Side-stepping OS Threads:\ }User-level thread packages and most parallel

347 language runtimes have to side-step OS threads, by pinning one to each core,

348 which effectively gives the user-level package control over the core. Our VMS

349 implementation also does this. We are not claiming in this paper to have the

350 OS level implementation of VMS that the model is capable of -- just the

351 user-space version.

352

353 Related work either provides a variation on the Thread model, or is a full

354 language with included parallelism constructs. For example,

355 TBB {\cite{TBBHome}} is a user-space thread package with added features.

356 It allows

357 tasks to create other tasks and control when to start their

358 execution. However, one cannot plug in alternate parallelism-control constructs, such as

359 rendez-vous style send and receive, which would have to be made using locks.

360 Likewise, Cilk {\cite{CILKHome}}

361 provides a simpler and easier to use version of TBB's task scheduler, but is

362 limited to only tree-shaped task graphs. OpenMP {\cite{OpenMPHome}} allows

363 creating tasks and controlling their execution order,

364 but is a language, not a hardware abstraction. All of these may

365 alternatively be implemented via plugins to VMS, with similar performance.

366

367 In contrast, VMS is the first hardware abstraction that doesn't impose its own concurrency

368 semantics, but rather takes preferred ones as plugins. This makes it not a

369 language itself, but a {\em{support}} mechanism for language level

370 parallelism constructs. The parallelism constructs in Actors, Components,

371 pthreads, and so on may all be implemented as VMS plugins.

372

373 {\bf Virtual Processor (VP) Definition:\ }We want to avoid the confusion

374 associated with the terms ``thread'' and ``task'' so will use the term

375 {\em{virtual processor}} (VP), which we define as state in combination with

376 the ability to animate code or {\em{an additional level of virtual

377 processors}}. The state consists of a stack with its contents, a program

378 counter, a pointer to top of stack, and a pointer to the current stack frame.

379

380 \section{Abstract Definition of VMS}

381 \label{secAbsModel}

382

383 We give an intuitive overview, then add details in the following sub-sections.

384

385 {\bf Intuitive Overview:\ }VMS is concerned primarily with time and

386 guarantees about it. This is because parallelism constructs control how the

387 time-lines of different virtual processors intersect. They also guarantee

388 relations of time lines to hardware events.

389

390 As an example, consider a program that writes into a data structure in one

391 time-line, then calls a \texttt{send} construct, meanwhile in a different

392 time-line it calls the \texttt{receive} construct then reads the data

393 structure. The constructs should guarantee that all data written before the

394 \texttt{send} is readable in the other time-line after the

395 \texttt{receive}. VMS provides primitive guarantees, which plugged-in code

396 builds upon to provide such higher-level guarantees.

397

398 To support parallelism constructs, VMS provides: primitive operations to create and suspend

399 VPs; a way for plugged-in code to control when each VP is (re)started; and

400 time-related guarantees. These are enforced on all hardware, be it shared

401 memory or distributed, with strong memory consistency or weak.

402

403 {\bf Definition in Three Parts:\ }We give the abstract definition in three

404 parts: a definition of the elements of a VMS computation system; a definition

405 of time and the key VMS guarantee; and a definition of virtual processor

406 scheduling states and transitions between them.

407

408 The definition we give is for VMS {\em{with plugins present}}. \ Hence, it

409 covers the behavior of all possible parallelism constructs implementable with

410 VMS. The Master mentioned in the definition is an abstract

411 entity, with a plugin present. In practice, this Master entity is implemented

412 as part of a core VMS, and plugins later added. This VMS-core is the

413 hardware abstraction. It hides the physical hardware behind an interface that

414 creates virtual processors and enforces well-defined time-behavior.

415

416

417 \subsection{The Elements of a VMS Computation System}

418

419 \texttt{- }A VMS program has multiple VPs, which are Slaves,

420 each with an independent time-line.

421

422 \texttt{- }A schedule of Slaves is generated by a Master

423 entity, from within separate time-line(s).

424

425 \texttt{- }A schedule is defined as the set of points at which

426 VPs are (re)animated.

427

428 \texttt{- }All semantic parallelism behavior is invoked via

429 communication with the Master.

430

431 \texttt{- }Communication with the Master happens by using a

432 model-provided primitive, which causes {\em{voluntary}} suspension of the

433 program's VP.

434

435

436 What is important here is: that the choice of which VP is animated, at which

437 point, happens in a separate time-line; and that the VPs suspend voluntarily

438 for each parallelism construct. This means that {\em{scheduling is separated

439 from the application code}}, the key point.

440

441 The Master entity appears to be a single entity to the slaves, but may be

442 implemented by multiple Master VPs hidden inside the VMS implementation.

443

444 VPs use the Master as an intermediary to: semantically communicate with each

445 other; cause creation of new program VPs; and to influence re-animation of

446 VPs. As a subtlety, notice that hardware mechanisms, such as coherent shared

447 memory, allow communication to take place that is not visible to the

448 parallelism constructs. Parallelism constructs must be separately called in

449 order to make use of shared variable communication safe.

450

451 {\bf Definitions:\ }VMS is intended only for hardware systems that consist of processing elements

452 connected by communication. We define a memory-space to be a processing

453 element, albeit without the ability to transform data. We define a

454 {\em{physical core}} to be a processing element that {\em{does}} transform

455 data, and require that it execute a sequential stream of instructions. We

456 define a program-time as the sequence of instructions animated by a Slave VP

457 (which is eventually animated by a physical core). A Slave VP has associated

458 {\em{scheduling state}} that, among other things, relates to how its

459 program-time progresses relative to physical time on the cores.

460

461 \subsection{Time in VMS}

462

463 \texttt{- }VMS has three levels of time: {\em{Program time}},

464 {\em{ Master time}}, and {\em{Virtual time}}.

465

466 \texttt{- }Program time is local to a Slave VP, measured in

467 instruction executions.

468

469 \texttt{- }Master time is hidden from the program and is

470 independent from all Program times.

471

472 \texttt{- }Virtual time is the ordered set of changes in

473 scheduling state of Slave VPs.

474

475

476

477 What is most important here is that Virtual time defines a global sequential

478 ordering. This ordering is consistent with the key VMS guarantee (given

479 below), and each point in it is computed within Master time.

480

481 Also, the independence between program times and master time has subtle

482 advantages. It enables elegant enforcement of the VMS guarantee, and

483 implementation simplifications that become clear after gaining deep

484 implementation knowledge.

485

486 In VMS, each event relevant to parallel semantics is tied to a transition of

487 the state of a Slave VP. This means that implementing the behavior of parallel

488 semantics is equivalent to controlling the order of transitions of state of

489 virtual processors.

490

491 {\bf Definitions:\ }The stream of instructions in a given program-time is

492 broken into a number of {\em{trace-segments}}, separated by suspension

493 points. Each trace-segment is animated by a single physical core, but not

494 necessarily the same core as animated the other trace segments. A suspend

495 point is created by a Slave VP executing the ``suspend'' primitive provided by

496 VMS. A suspend point has no duration in program time, but has distinct start

497 and end points in virtual time. The end-suspension points of two different

498 program times can be tied together within virtual time, which is called a

499 {\em{tie point}} and has special significance to parallel constructs. The

500 physical-time of a core has no relationship to any program time, except for

501 the various time-guarantees in this definition of VMS.

502

503 \begin{figure}[ht]

504 \includegraphics[width=2.9in]{../figures/Time_in_VMS_1.png}

505 \caption

506 {Mapping program time onto Virtual time. \ The

507 Master controls creation of new program time lines, and ending suspend

508 points. Here, it has ended two suspend points at a common tie-point.

509 }

510 \label{figTimeMapping}

511 \end{figure}

512

513 {\bf Relating time-lines to each other:\ }Figure \ref{figTimeMapping}

514 illustrates how trace-segments relate to suspend points, and map onto

515 virtual time. A trace segment starts in virtual time where suspend is ended, as seen.

516 In fact, the two trace segments shown have a common

517 start-point within virtual time. This is because the parallelism semantics

518 chose to start them at the same point -- this is what a tie

519 point is. A key note is that the lengths in virtual time have no relation to

520 the lengths in program-time. The only defined feature is that those two

521 trace-segments have a common start-point in virtual time. This means that the

522 two suspend points are considered to be tied together.

523

524 {\bf The Key VMS guarantee:\ }\label{VMSguarantee}Being tied together

525 means that all physical events that can be observed by both program-times are

526 covered by the key VMS guarantee: any events triggered before the common

527 suspend point in one program time are guaranteed visible in the other program

528 time after the common suspend point. They {\em{might}} be visible before,

529 but it's not guaranteed. In addition, events triggered after the common

530 suspend point in one are guaranteed not visible before the common suspend

531 point in the other. {\em{This two-part guarantee can be considered the heart

532 of VMS.}}

533

534 Intuitively, a tie-point separates before it from after such that tied program

535 times agree (illustrated with code in Section \ref{secApp} Figure \ref{figAnimVP}). But the subtlety is

536 that events triggered before the tie-point, {\em{might}} be visible to the

537 other before, and ones triggered after {\em{might not}} be visible to the

538 other after -- physical events triggered before are only guaranteed visible

539 {\em{after}} the tie point, and events after are only guaranteed

540 {\em{not}} visible {\em{before}} the tie point.

541

542 This is a form of bounded non-determinism. The pattern of suspension

543 end-points determines which trace-segments overlap in Virtual time, and events

544 triggered in one might be visible in overlapped ones. But no guarantees cover

545 these. If one segment tries to observe, it will see events triggered by

546 overlapped segments in non-deterministic order.

547

548 The VMS implementation defines which physical events are covered by the key

549 VMS guarantee (reads/writes, network communication, DMA, I/O).

550

551 {\bf Globally consistent sequential order:\ }VMS maps suspend-start,

552 suspend-end, and hence tie-points, to a globally-consistent sequential order

553 in Virtual time. This enables one of VMS's key benefits: sequential

554 algorithms for parallel constructs.

555

556 Tie points define parallel behavior, so an implementation of how to choose tie

557 points equals an implementation of parallel constructs. The Master chooses

558 tie-points, so plugging code to choose tie-points into the Master equals

559 plugging in parallel constructs.

560

561 \subsection{Scheduling State}

562

563 Scheduling state is used in VMS to organize internal activity, for enforcing

564 the guarantees.

565

566 \texttt{- }VPs have three scheduling states: {\em{Animated}},

567 {\em{Blocked}}, {\em{Ready}} (Figure \ref{figStates}).

568

569 \texttt{- }VPs in Animated are {\em{allowed}} to advance

570 program-time with {\em{local}} physical time.

571

572 \texttt{- }VPs in Blocked and Ready do not advance their

573 program-time.

574

575 \texttt{- }Animated has two physical states:

576 {\em{Progressing}} and {\em{Stalled}}.

577

578 \texttt{- }VPs in Progressing advance program-time with local

579 physical time, those in Stalled do not (allowing non-semantic suspend).

580

581 \texttt{- }Scheduling states are defined in Virtual time only.

582

583 \texttt{- }Progressing and Stalled are defined in local

584 physical time only (invisible in Virtual).

585

586

587 \begin{figure}[h]

588 \includegraphics{../figures/Scheduling_states_2.png}

589 \caption{Scheduling states of a slave VP in the VMS model.}

590 \label{figStates}

591 \end{figure}

592

593 Some important points: 1) only VPs Animated can trigger physical events that

594 are seen in other program time-lines; 2) the distinction between Blocked vs

595 Stalled is that a VP has to explicitly execute a VMS primitive operation to

596 enter Blocked, making it part of the semantics of parallelism constructs. In

597 contrast, Stalled happens invisibly, with no effect on semantic behavior. It

598 is due to hardware events hidden inside VMS, such as interrupts.

599

600 The Ready state is used to separate the parallelism-construct behavior from

601 the scheduling behavior. It acts as a ``staging area'' for scheduling. VPs

602 placed into this state are {\em{allowed}} to be animated, then the scheduler

603 decides when and where.

604

605 A subtle but illustrative point is that actions {\em{outside}} a given

606 program time cause the VP to transition Blocked$\rightarrow$Ready, which

607 contrasts to lock algorithms like spin-locks or Dijkstra's, where the

608 concurrency-related behavior takes place {\em{inside}} program time.

609

610 {\bf Transition Between Slave Scheduling States:\ }

611

612 \texttt{- }VPs transition states as shown in Figure \ref{figStates}.

613

614 \texttt{- }Animated$\rightarrow$Blocked is caused by a Slave VP

615 executing the Suspend VMS primitive.

616

617 \texttt{- }Blocked$\rightarrow$Ready is determined by the

618 semantics implemented in the plugin.

619

620 \texttt{- }Ready$\rightarrow$Animated is determined by the

621 scheduler in the plugin.

622

623 \texttt{- }Transitions in scheduling state have a globally

624 consistent order in Virtual time.

625

626

627 The parallelism primitives executed by a program do not control change in

628 scheduling states. They merely communicate messages to the Master, via a VMS

629 supplied primitive. Inside the Master, the plugin's parallelism construct

630 implementation processes the message. Based on that, it performs changes in

631 state from Blocked$\rightarrow$Ready, creates new VPs, and dissipates existing

632 VPs. Most communication from Slave to Master requires the VP to suspend when

633 it sends the message. A few messages, like creating new Slave may be sent

634 without suspending.

635

636 The suspend primitive decouples local physical time from Virtual time.

637 Execution causes immediate transition to Stalled in physical time, then the

638 Master performs Animated$\rightarrow$Blocked, fixing that transition in

639 Virtual time. The only relationship is causality. This weak relation is what

640 allows suspension-points to be serialized in Virtual time, which in turn is

641 what allows using sequential algorithms to implement parallelism constructs.

642

643

644 \subsection{Plugins}

645

646 \begin{figure}[ht]

647 \includegraphics{../figures/VMS-core__plugins.png}

648 \caption

649 {

650 The Master has been split into a generic core and a language-specific plug-in.

651 The core encapsulates the hardware and remains the same across applications.

652 The plug-in is part of the parallelism-construct implementation. It is

653 loaded separately onto the hardware and linked to the application when run.

654 }

655 \label{figMasterSplit}

656 \end{figure}

657

658

659 The Master entity has two parts, a generic core part and a plugin (Figure \ref{figMasterSplit}).

660 The core part of the Master is implemented as part of

661 VMS-core. The plug-in supplies two functions: the communication-handler and

662 the scheduler, both having a standard prototype. The communication-handler

663 implements the parallelism constructs, while scheduler assigns VPs to cores.

664

665 An {\em{instance}} of a plugin is created as part of initializing an

666 application, and the instance holds the semantic and scheduling state for that

667 run of the application. This state, combined with the virtual processor states

668 of the slaves created during that application run, represents progress of the

669 work of the application. \ For example, multi-tasking is performed simply by

670 the Master switching among plug-in instances when it has a resource to offer

671 to a scheduler. The parallelism-semantic state holds all information needed to

672 resume (hardware state, such as TLB and cache-tags is inside VMS-core).

673

674

675

676 \section{Internal Workings of Our Implementation}

677 \label{secInternal}

678

679 We name the elements of our example implementation and describe their logical

680 function, then relate them to the abstract model. We then step through the

681 operation of the elements.

682

683 {\bf Elements and Their Logical Function:\ }As illustrated in Figure \ref{figInternals},

684 our VMS implementation is organized around physical cores.

685 Each core has its own {\em{master

686 virtual-processor}}, \texttt{masterVP}, and a {\em{physical-core controller}}, which communicate via a set of

687 scheduling slots, \texttt{schedSlot}. The Master in the abstract definition

688 is implemented by the multiple \texttt{masterVP}s plus a particular plugin

689 instance with its shared parallelism-semantic state (seen at the top).

690

691 On a given core, only one of: the core-controller, \texttt{masterVP}, or a

692 slave VP, is animated at any point in local physical time. Each

693 \texttt{masterVP} animates the same function, called

694 \texttt{master\_loop}, and each slave VP animates a function from the

695 application, starting with the top-level function the slave is created with,

696 and following its call sequence. The core controller is implemented here as a

697 Linux pthread that runs the \texttt{core\_loop} function.

698

699 Switching between VPs is done by executing a VMS primitive that suspends the

700 VP. This switches the physical core over to the controller, by jumping to the

701 start of the \texttt{core\_loop} function, which chooses the next VP and

702 switches to that (switching is detailed in Section \ref{secApp} Figure \ref{figAssembly}).

703

704 {\bf Relation to Abstract Model:\ }We chose to implement the Master entity

705 of the model by a set of \texttt{masterVP}s, plus plug-in functions and

706 shared parallelism-semantic state. What we call VMS-core consists of this

707 implementation of the Master, plus the core-controllers, plus the VMS

708 primitive libraries, for creating new VPs and dissipating existing VPs, suspending VPs,

709 and communicating from slave VP to Master. In Figure \ref{figInternals},

710 everything in green is part of VMS-core, while the plugin is in red, and

711 application code appears as blue, inside the slave VP.

712

713 Virtual time in the model is implemented via a combination of four things: a

714 \texttt{masterLock} (not shown) that guarantees non-overlap of

715 \texttt{masterVP} trace-segments; the \texttt{master\_loop} which performs

716 transition Animated$\rightarrow$Blocked; the \texttt{comm\_handler\_fn}

717 which performs Blocked$\rightarrow$Ready and the \texttt{scheduler\_fn}

718 which performs Ready$\rightarrow$Animated. \ Each state transition is one step

719 of Virtual time; is guaranteed sequential by the non-overlap of

720 \texttt{masterVP} trace segments; and is global due to being in

721 parallelism-semantic state that is shared (top of Figure \ref{figInternals}).

722

723 Transitions Progressing$\rightleftarrows$Stalled within the Animated state are

724 invisible to the parallelism semantics, the Master, and Virtual time, and so

725 have no effect on the elements seen.

726

727

728 \begin{figure*}[!t]

729 \includegraphics[width=5in]{../figures/VMS-core__internal_workings.png}

730 \caption

731 { Internal elements of our example VMS implementation

732 }

733 \label{figInternals}

734 \end{figure*}

735

736 {\bf Steps of Operation:\ }The steps of operation are numbered, in Figure \ref{figInternals}.

737 Taking them in order, 1) \texttt{master\_loop} scans

738 the scheduling slots to see which ones' slaves have suspended since the

739 previous scan. \ 2) It hands these to the \texttt{comm\_handler\_fn} plugged

740 in (which equals transition Animated$\rightarrow$Blocked). \ 3) The VP has a

741 request attached, and data in it causes \ the \texttt{comm\_handler\_fn}

742 to manipulate data structures in the shared parallelism-semantic state. \

743 These structures hold all the slaves in the blocked state (code-level detail

744 in Figure \ref{figReqHdlr}, Section \ref{secApp}). \ 4) Some requests cause slaves to be moved to a

745 \texttt{readyQ} on one of the cores (Blocked$\rightarrow$Ready). Which

746 core's \texttt{readyQ} receives the slave is under plugin control,

747 determined by a combination of request contents, semantic state and physical

748 machine state. 5) During the scan, the \texttt{master\_loop} also looks for

749 empty slots, and for each calls the \texttt{scheduler\_fn} plugged in. It

750 chooses a slave from the \texttt{readyQ} on the core animating

751 \texttt{master\_loop}. \ 6) The \texttt{master\_loop} then places the

752 slave VP's pointer into the scheduling slot (Ready$\rightarrow$Animated),

753 making it available to the \texttt{core\_loop}. 7) When done with the scan,

754 \texttt{masterVP} suspends, switching animation back to the

755 \texttt{core\_loop}. \ 8) \texttt{core\_loop} takes slave VPs out of the

756 slots, then 9) switches animation to them. \ 10) When a slave self-suspends,

757 animation returns to the \texttt{core\_loop} (detail in code in Figure 9),

758 which picks another, until 11) all slots are empty and the

759 \texttt{core\_loop} switches animation to the \texttt{masterVP}.

760

761 {\bf Enabling sequential implementation of parallelism semantics:\ }All of

762 that happens on each core separately, but in this particular implementation we

763 use a central \texttt{masterLock} to ensure that only one core's

764 \texttt{masterVP} can be active at any time. This guarantees non-overlap

765 of trace-segments from different \texttt{masterVP}s, allowing the plugins to

766 use sequential algorithms, without a performance penalty, as verified in

767 Section \ref{secResults}.

768

769 Relating this to the abstract model: the parallelism-semantic behavior of the

770 Master is implemented by the communication handler, in the plugin. It thus

771 runs in the Master time referred to, in the model, in Section \ref{secAbsModel}. Requests are

772 sent to the Master by self-suspension of the slaves, but sit idle until the

773 other slaves in the scheduling slots have also run. This is the passive

774 behavior of requests that was noted in Section \ref{secAbsModel}, which allows the

775 \texttt{masterVP}s to remain suspended until needed. This in turn enables

776 the \texttt{masterVP}s from different cores to be non-overlapped. It is the

777 non-overlap that enables the algorithms for the parallelism semantics to be

778 sequential.

779

780

781

782

783 \section{Code Level View}

784 \label{secApp}

785

786 To relate the abstract model and the internal elements to application code and

787 parallelism-library code, we give code snippets that illustrate key features.

788 We start with the application then work down through the sequence of calls,

789 to the plugin, using our SSR {\cite{VMSHome}} parallelism-library as an

790 example.

791

792 In general, applications are either written in terms of a full custom language

793 that has its own syntax, or a base language with a parallelism library, which

794 is often called an {\em{embedded language}}. Our demonstrators, VCilk

795 {\cite{VMSHome}}, Vthread, and SSR, are all parallelism libraries. A full

796 custom language would follow the standard practice of performing

797 source-to-source transform, from custom syntax into C plus parallelism-library

798 calls.

799

800 {\bf SSR:\ }SSR stands for Synchronous Send-Receive, and details of its

801 calls and internal implementation will be given throughout this section. It

802 has two types of construct. The first, called {\em{from-to}} has two calls:

803 \texttt{SSR\_send\_from\_to} and \texttt{SSR\_receive\_from\_to}, both of

804 which specify the sending VP as well as the receiving VP. \ The other, called

805 {\em{of-type}} also has two calls: \texttt{SSR\_\_send\_of\_type\_to} and

806 \texttt{SSR\_\_receive\_of\_type}, which allow a receiver to accept from

807 anonymous senders, but select according to type of message.

808

809

810 % An example of a double column floating figure using two subfigures.

811 % (The subfig.sty package must be loaded for this to work.)

812 % The subfigure \label commands are set within each subfloat command, the

813 % \label for the overall figure must come after \caption.

814 % \hfil must be used as a separator to get equal spacing.

815 % The subfigure.sty package works much the same way, except \subfigure is

816 % used instead of \subfloat.

817 %

818 %\begin{figure*}[!t]

819 %\centerline{\subfloat[Case I]\includegraphics[width=2.5in]{subfigcase1}%

820 %\label{fig_first_case}}

821 %\hfil

822 %\subfloat[Case II]{\includegraphics[width=2.5in]{subfigcase2}%

823 %\label{fig_second_case}}}

824 %\caption{Simulation results}

825 %\label{fig_sim}

826 %\end{figure*}

827 %

828 % Note that often IEEE papers with subfigures do not employ subfigure

829 % captions (using the optional argument to \subfloat), but instead will

830 % reference/describe all of them (a), (b), etc., within the main caption.

831

832

833 {\bf Application View:\ } Figure \ref{figAnimVP} shows snippets of application code, which use the SSR parallelism

834 library. The most important feature is that all calls take a pointer to the

835 VP that is animating the call. This is seen at the top of the figure where slave VP creation takes a pointer

836 to the VP asking for creation. Below that is the standard prototype for top level functions,

837 showing that the function receives a pointer to the VP it is the top level function for.

838

839 The pointer is placed on the stack by VMS when it creates the VP, and is the means by

840 which the application comes into possession of the pointer. This animating VP is

841 passed to all library calls made from there. For example, the bottom shows a pointer to the

842 animating VP placed in the position of sender in

843 the \texttt{send} construct call. Correspondingly, for the \texttt{receive} construct,

844 the position of receiving VP is filled by the VP animating the call.

845

846 \begin{figure}[ht]

847 {\noindent

848 {\scriptsize

849 {\small Creating a new processor:}

850 \begin{verbatim}

851 newProcessor = SSR__create_procr( &top_VP_fn,

852 paramsPtr, animatingVP ); \end{verbatim}

853

854 {\small prototype for the top level function:}

855 \begin{verbatim}

856 top_VP_fn( void *parameterStrucPtr, VirtProcr

857 *animatingVP ); \end{verbatim}

858

859 {\small handing animating VP to parallelism constructs:}

860 \begin{verbatim}

861 SSR__send_from_to( messagePtr, animatingVP,

862 receivingVP );

863 messagePtr = SSR__receive_from_to( sendingVP,

864 animatingVP ); \end{verbatim}

865 }

866 }

867 \caption

868 {

869 Application code snippets showing that all calls to the parallelism library

870 take the VP animating that call as a parameter.

871 }

872 \label{figAnimVP}

873 \end{figure}

874

875 Relating these to the internal elements of our implementation, the

876 \texttt{animatingVP} suspends inside each of these

877 calls, passing a request (generated in the library) to one of the \texttt{masterVP}s.

878 The \texttt{masterVP} then calls the \texttt{comm-handler}

879 plugin, and so on, as described in Section \ref{secInternal}.

880

881 For the \texttt{SSR\_\_create\_processor} call, the comm-handler

882 in turn calls a VMS primitive to perform the creation.

883 The primitive places a pointer to the newly created VP onto its stack, so that when

884 \texttt{top\_VP\_fn} is later animated, it sees the VP-pointer as a

885 parameter passed to it. \ All

886 application code is either such a top-level function, or has one at the root

887 of the call-stack.

888

889 The send and receive calls both suspend their animating VP. When both have

890 been called, the communication handler pairs them up and resumes both. This

891 ties time-lines together, invoking the VMS guarantee. Both

892 application-functions know, because of the VMS guarantee (Section \ref{secAbsModel}), that

893 writes to shared variables made before the send call by the sender are visible

894 to the receiver after the receive call. This is the programmer's view of tying

895 together the local time-lines of two different VPs, as defined in Section \ref{secAbsModel}.

896

897

898 {\bf Concurrency-Library View:\ }A parallelism library function, in

899 general, only creates a request, sends it, and returns, as seen below. To

900 send a request, it uses the combined request-and-suspend VMS primitive that

901 attaches the request then suspends the VP. The primitive requires the

902 pointer to the VP, to attach the request and to suspend it.

903

904

905 \begin{figure}[ht]

906 {\noindent

907 {\scriptsize

908 \begin{verbatim}

909 void * SSR__receive_from_to( VirtProcr *sendVP,

910 VirtProcr *receiveVP )

911 { SSRSemReq reqData;

912 reqData.receiveVP = receiveVP;

913 reqData.sendVP = sendVP;

914 reqData.reqType = receive_from_to;

915 VMS__send_sem_request( &reqData, receiveVP );

916 return receiveVP->dataReturnedFromRequest;

917 } \end{verbatim}

918 }

919 }

920 \caption{Implementation of SSR's receive\_from\_to library function.}

921 \label{figImplLib}

922 \end{figure}

923

924

925 In Figure \ref{figImplLib}, notice that the request's data is on the stack of the virtual

926 processor that's animating the call, which is the \texttt{receiveVP}. The

927 \texttt{VMS\_\_send\_sem\_request} suspends this VP, which changes the

928 physical core's stack pointer to a different stack. So the request data is

929 guaranteed to remain undisturbed while the VP is suspended.

930

931 Figure \ref{figAssembly} shows the implementation of the VMS suspend primitive. As seen in

932 Figure \ref{figInternals}, suspending the \texttt{receiveVP} involves

933 switching to the \texttt{core\_loop}. In our implementation, this is done by

934 switching to the stack of the pthread pinned to the physical core and then

935 jumping to the start-point of \texttt{core\_loop}.

936

937 This code uses standard techniques commonly employed in co-routine

938 implementations. Tuning effort spent in \texttt{core\_loop} is inherited by

939 all applications.

940

941

942 \begin{figure}[ht]

943 {\noindent

944 {\scriptsize

945 \begin{verbatim}

946 VMS__suspend_procr( VirtProcr *animatingVP )

947 { animatingVP->resumeInstrAddr = &&ResumePt;

948 //GCC takes addr of label

949 animatingVP->schedSlotAssignedTo->

950 isNewlySuspended = TRUE;

951 //for master_loop to see

952 <assembly code stores current physical core's

953 stack reg into animatingVP struct>

954 <assembly code loads stack reg with core_loop

955 stackPtr, which was saved into animatingVP>

956 <assembly code jmps to core_loop start instr

957 addr, which was also saved into animatingVP>

958 ResumePt:

959 return;

960 } \end{verbatim}

961 }

962 }

963 \caption

964 {Implementation of VMS suspend processor.

965 Re-animating the virtual processor reverses this sequence. \ It saves the

966 \texttt{core\_loop}'s resume instr-addr and stack ptr into the VP structure,

967 then loads the VP's stack ptr and jmps to its \texttt{resumeInstrAddr}.

968 }

969 \label{figAssembly}

970 \end{figure}

971

972

973 {\bf Plugin View:\ }SSR's communication handler dispatches on the

974 \texttt{reqType} field of the request data, as set by the

975 \texttt{SSR\_\_receive\_from\_to} code. It calls the handler code in

976 Figure \ref{figReqHdlr}. This constructs a hash-key, by concatenating the from-VP's pointer

977 with the to-VP's pointer. Then it looks-up that key in the hash-table that SSR uses

978 to match sends with receives, which is in the shared semantic state seen at

979 the top of Figure \ref{figInternals} in Section \ref{secInternal}.

980

981 The most important feature in Figure \ref{figReqHdlr} is that both send and receive

982 will construct the same key, so will find the same

983 hash entry. Whichever request is handled first in Virtual time will see the

984 hash entry empty, and save itself in that entry. The second to arrive

985 sees the waiting request and then resumes both VPs, by putting them into their

986 \texttt{readyQ}s.

987

988 Access to the shared hash

989 table can be considered private, as in a sequential algorithm. This is because

990 our VMS-core implementation ensures that only

991 one handler on one core is executing at a time.

992

993

994 \begin{figure}[ht]

995 {\noindent

996 {\scriptsize

997 \begin{verbatim}

998 handle_receive_from_to( VirtProcr *requestingVP,

999 SSRSemReq *reqData, SSRSemEnv *semEnv )

1000 { commHashTbl = semEnv->communicatingVPHashTable;

1001 key[0] = reqData->receiveVP; key[1] =

1002 reqData->sendVP; //send uses same key

1003 waitingReqData = lookup_and_remove( key,

1004 commHashTbl ); //get waiting request

1005 if( waitingReqData != NULL )

1006 { resume_virt_procr( waitingReqData->sendVP );

1007 resume_virt_procr( waitingReqData->

1008 receiveVP );

1009 }

1010 else

1011 insert( key, reqData, commHashTbl );

1012 //receive is first to arrive, make it wait

1013 } \end{verbatim}

1014 }

1015 }

1016 \caption

1017 {Pseudo-code of communication-handler for

1018 \texttt{receive\_from\_to} request type. The \texttt{semEnv} is a pointer

1019 to the shared parallelism-semantic state seen at the top of Figure

1020 \ref{figInternals}.

1021 }

1022 \label{figReqHdlr}

1023 \end{figure}

1024

1025

1026

1027

1028 \section{Results}

1029 \label{secResults}

1030

1031 {\bf Setup:\ }We implemented blocked dense matrix multiply with right

1032 sub-matrices copied to transposed form. We ran on a 1 socket by 4 core

1033 Core2Quad 2.4Ghz chip.

1034

1035 {\bf Implementation-Time:\ }As shown in Table 1, time to implement the

1036 three parallel libraries averages 2 days each. As an example of productivity,

1037 adding nested transactions, parallel singleton, and atomic function-execution

1038 to SSR required a single afternoon, totaling less than 100 lines of C code.

1039

1040

1041 \begin{table}[ht]

1042 \caption

1043 {Person-days to design, code, and

1044 test each parallelism library. L.O.C. is lines of (original) C code, excluding libraries and

1045 comments.

1046 }

1047 \label{tabPersonDaysLang}

1048 \begin{center}

1049 \begin{tabular}{llll}

1050 & SSR & Vthread & VCilk\\

1051 Design & 4 & 1 & 0.5\\

1052 Code & 2 & 0.5 & 0.5\\

1053 Test & 1 & 0.5 & 0.5\\

1054 L.O.C. & 470 & 290 & 310

1055 \end{tabular}

1056 \end{center}

1057 \end{table}

1058

1059

1060 {\bf Execution Performance:\ }Performance of VMS is seen in Table

1061 \ref{tabOverheadCycles}. The code is not optimized, but rather written

1062 to be easy to understand and modify. The majority of the plugin time

1063 is lost to cache misses because the shared parallelism-semantic state moves

1064 between cores on a majority of accesses. Acquisition of the master lock is slow due to the hardware

1065 implementing the CAS instruction.

1066

1067 Existing techniques will likely improve performance, such

1068 as localizing semantic data to cores, splitting malloc across the cores,

1069 pre-allocating slabs that are recycled, and pre-fetching. However, in many cases, several hundred

1070 nano-seconds per task is as optimal as the applications can benefit from.

1071

1072 \begin{table}[ht]

1073 \caption

1074 {Cycles of overhead, per scheduled

1075 slave. ``comp only'' is perfect memory, ``comp + mem'' is actual cycles.

1076 ``Plugin-concur'' only concurrency requests, ``plugin-all'' includes

1077 create and malloc requests. Two significant digits due to variability.

1078 }

1079 \label{tabOverheadCycles}

1080 \begin{center}

1081 \begin{tabular}{|l|ll|r|r|}

1082 \hline

1083 & & comp & comp\\

1084 & & only & +mem\\

1085 VMS Only & \texttt{master\_loop} & 91 & \ 110\\

1086 & switch VPs & 77 & \ 130\\

1087 & (malloc) & 160 & 2300\\

1088 & (create VP) & 540 & 3800\\

1089 \hline

1090 Language: & & & \\

1091 SSR & plugin -- concur & 190 & 540\\

1092 & plugin -- all & 530 & 2200\\

1093 & lock & & 250\\

1094 Vthread & plugin -- concur & 66 & 710\\

1095 & plugin -- all & 180 & 1500\\

1096 & lock & & 250\\

1097 VCilk & plugin -- concur & 65 & 260\\

1098 & plugin -- all & 330 & 1800\\

1099 & lock & & 250\\

1100 \hline

1101 \end{tabular}

1102 \end{center}

1103 \end{table}

1104

1105

1106

1107 {\bf Head to Head:\ }We

1108 compare our implementation of the \texttt{spawn} and \texttt{sync}

1109 constructs against Cilk 5.4, on the top in Table

1110 \ref{tabHeadToHead}, which shows that the same application code has similar

1111 performance. For large matrices, Cilk 5.4's better use of the memory hierarchy

1112 achieves 23\% better performance. However, for small matrices, VCilk is better,

1113 with a factor 2 lower overhead. Cilk 5.4 does not allow controlling the number of spawn events it

1114 actually executes, and chooses to run smaller matrices sequentially, limiting

1115 our comparison.

1116

1117 When comparing to pthreads, our VMS based implementation has more than an

1118 order of magnitude better overhead per invocation of mutex or condition

1119 variable functionality, as seen on the bottom of Table \ref{tabHeadToHead}.

1120 Applications that inherently have short trace segments will synchronize often

1121 and benefit the most from Vthread.

1122

1123

1124

1125 \begin{table}[ht]

1126 \caption

1127 {On top, exe time in seconds for MM. Below, overhead for pthread vs Vthread. First column is cycles

1128 for perfect memory and second is total measured cycles. pthread cycles are

1129 deduced from round-trip experiments.

1130 }

1131 \label{tabHeadToHead}

1132 \begin{center}

1133 \begin{tabular}{|l@{\ }|@{\ }lr|}

1134 \hline

1135 \rule{0pt}{12pt}

1136 Matrix size&Lang.&sec.\\

1137 [2pt]\hline

1138 81x81 & Cilk & 0.017\\

1139 & VCilk & 0.008\\

1140 \hline

1141 324x324 & Cilk & 0.13\\

1142 & VCilk & 0.13\\

1143 \hline

1144 648x648 & Cilk & 0.71\\

1145 & VCilk & 0.85\\

1146 \hline

1147 1296x1296 & Cilk & 4.8\\

1148 & VCilk & 6.2 \\

1149 [2pt]\hline

1150 \end{tabular}

1151 \begin{tabular}{c}

1152 \begin{tabular}{|l|rr|r|r|}

1153 \hline

1154 operation & \multicolumn{2}{c|}{Vthread} & pthread & ratio\\

1155 \hline

1156 & comp & total & & \\

1157 & only & & & \\

1158 [2pt]\hline

1159 mutex\_lock & 85 & 1050 & 50,000 & 48:1\\

1160 mutex\_unlock & 85 & 610 & 45,000 & 74:1\\

1161 cond\_wait & 85 & 850 & 60,000 & 71:1\\

1162 cond\_signal & 90 & 650 & 60,000 & 92:1\\

1163 \hline

1164 \end{tabular}\\

1165 \end{tabular}

1166 \end{center}

1167 \end{table}

1168

1169 \section{Conclusion}\label{secConclusion}

1170

1171 We have shown an alternative to the Thread model that enables easier-to-use

1172 parallelism constructs by splitting

1173 the scheduler open, to accept new parallelism constructs in the form of

1174 plugins. This gives the language control over assigning virtual

1175 processors to physical cores, for performance, debugging, and flexibility

1176 benefits. Parallelism constructs of languages are implementable using

1177 sequential algorithms, within a matter of days, while maintaining low run-time

1178 overhead, on the order of a few hundred nano-seconds per concurrency

1179 operation.

1180

1181 \bibliography{Bib_for_papers}

1182

1183 \end{document}

Mercurial > cgi-bin > hgwebdir.cgi > VMS > 0__Writings > kshalle

view 0__Papers/Future_Architecture/Future_Architecture.tex @ 0:7b45d1764e2b