Porting OpenVMS from VAX to Alpha AXP.
We had two important requirements besides delivering performance. We wanted to make it easy to move existing users and applications from OpenVMS VAX to OpenVMS AXP systems. The second requirement was to deliver a high-quality first version of the product as early as possible. These requirements led us to adopt a fairly straightforward porting strategy with minimal redesigns or rewrites. We view the first version of the OpenVMS AXP product as a beginning, with other evoutionary steps to follow.
The Alpha AXP architecture was designed for high performance but also with software migration from the VAX to the Alpha AXP architecture in mind. We included in the Alpha AXP architecture some VAX features which ease the migration without compromising hardware performance. VAX architecture features in the Alpha AXP architecture that are important to OpenVMS system software are four protection modes, per page protection, and 32 interrupt priority levels (IPLs). The Alpha AXP architecture also defines a Privileged Architecture Library (PAL) environment which runs with interrupts disabled and in the most privileged of the four modes (kernel). PALcode is a set of Alpha AXP instructions which execute in the PAL environment, implementing such basic system software functions as translation buffer (TB) miss service. On OpenVMS AXP systems, PALcode also implements some OpenVMS AXP systems, PALcode also implements some OpenVMS VAX features such as software interrupts and asynchronous traps (ASTs). The combination of hardware architecture assists and OpenVMS AXP PALcode made it easier to port the operating system to the Alpha AXP architecture.
The VAX architecture is 32-bit. It has 32 bits of virtual address space, 16 32-bit registers, and a comprehensive set of byte, word (16-bit), and longword (32-bit) instructions. The Alpha AXP architecture is 64-bit with 64 bits of virtual address space, 32 64-bit integer registers, 32 64-bit floating point registers, and instructions that load, store, and operate on 64-bit quantities. There are also longword load, store, and operate instructions, and a canonical sign-extended form for a longword in a 64-bit register.
The OpenVMS AXP system has anticipated evolution from 32-bit address space size to 64-bit address space by changing to a page table format that supports large address space. However, the OpenVMS software assumes that an address is the same size as a longword integer. The same assumption can exist in applications. Therefore, the first version of the OpenVMS AXP system supports 32-bit address space only.
Most of the OpenVMS kernel is in VAX assembly language (VAX MACRO-32). Instead of rewriting the VAX MACRO-32 code, we developed a compiler for it. In addition, we required inspection and manual modification of the VAX MACRO-32 code to deal with certain VAX architectural dependencies. Parts of the kernel heavily dependent on the VAX architecture were rewritten, but this was a minority of the total volume of VAX MACRO-32 source code.
Compiling VAX MACRO-32 for Alpha AXP
Simply stated, the VAX MACRO-32 compiler treats VAX MACRO-32 as a source language to be compiled and creates native OpenVMS AXP object files just as a Fortran compiler might. This task is far more complex than a simple instruction-by-instruction translation because of fundamental differences in the architectures and because source code frequently contains assumptions about the VAX architecture  and the OpenVMS Calling Standard on VAX . The complier must either transparently convert these VAX dependencies to their OpenVMS AXP counterparts or inform the user that the source code must be changed.
Source Code Annotation
We extended the VAX MACRO-32 source language to include entry point declarations and other directives for the compiler's use, which provide more information about the intended behavior of the program.
Entry point declarations were introduced to solve two problems. They allow declaration of the register semantics for a routine when the defaults are not appropriate, and they allow the specialized semantics of frameless subroutines and exception routines to be declared.
The differing register size between VAX and Alpha AXP influenced the design of the compiler. On VAX, MACRO-32 operates on 32-bit registers, and in general, the compiled code maintains 32-bit sign-extended values in Alpha AXP 64-bit registers. However, this code is now part of a system which uses true 64-bit values. As a result, we designed the compiler to generate 64-bit register saves of any registers modified in a routine, as part of what is termed the "routine prologue code." Automatic register preservation has proved to be the safest default, but must be overridden for routines which return values in registers, using appropriate entry point declarations.
Other directives allow the user to provide additional information about register state and code flow to improve generated code. Another class of directives instructs the compiler to preserve the VAX behavior with respect to granularity of memory writes or atomicity of memory updates. The Alpha AXP architecture makes guaranteed write granularity and atomic updates sufficiently more costly to performance that they should be enabled only when required. These concepts are discussed later with respect to related OpenVMS AXP kernel changes.
Identifying VAX Architecture and Calling Standard Dependencies
As mentioned earlier, the compiler must either transparently support VAX architecture-dependent constructs or inform the user that a source change is necessary. A good example of the latter is reliance on VAX JSB/RSB (jump to subroutine and return) instruction return address semantics. On VAX, a JSB leaves the return address on top of the stack, which is used by the RSB instruction to return . System subroutines often take advantage of this in order to change the return address. This level of stack control is not available in a compiled language. In porting OpenVMS to Alpha AXP, alternative coding practices were developed for this and many other nontransportable idioms.
The compiler must also account for the differences in the OpenVMS Calling Standard on VAX and Alpha AXP, which, although transparent to high-level language programmers, are very significant in assembly language . To operate correctly in a mixed-language environment, the code generated by the VAX MACRO-32 compiler must conform to the OpenVMS Calling Standard on Alpha AXP.
On VAX, a routine refers to its arguments via an argument pointer register (AP) which points to an argument list built in memory by its caller. On Alpha AXP, up to six routine arguments are passed in stack locations. Normally, the VAX MACRO-32 compiler transparently converts AP-based references to their correct Alpha AXP locations and converts the code which builds the list to initalize the arguments correctly. In some cases, the compiler cannot convert all references to their new locations, so an emulated VAX argument list must be constructed from the arguments received in the registers. This so-called "homing" of the argument list is required if the routine contains indexed references into the argument list or stores or passes the address of an argument list element to another routine.
The compiler identifies these coding practices during its process of flow analysis, which is similar to the analysis done by a standard high-level language optimizing compiler. It builds a flow graph for each routine and tracks stack depth, register use, condition code use, and loop depth through all paths in routine flow. This same information is required for generating correct and efficient code.
Access to Alpha AXP Instructions and Registers
In addition to providing migration of existing VAX code, the VAX MACRO-32 compiler includes (1) support for a subset of Alpha AXP instructions and PALcode calls and (2) access to the 16 integer registers beyond those that map to the VAX register set. The instructions supported are those which either have no direct counterpart on VAX or which are required to operate efficiently on a full 64-bit register value. These "built-ins" were required because OpenVMS AXP uses full 64-bit values for some operations such as manipulation of 64-bit page table entries.
The compiler includes certain optimizations which are particularly important for the Alpha AXP architecture. For example, on VAX, a reference to an external symbol would not be considered expensive. On Alpha AXP, however, such a reference requires a load from the linkage section to obtain the symbol's address prior to loading the symbol's value. (The linkage section is a data region used for resolving external references .) Multiple loads of this address from the linkage section may be reduced to a single load, or the load maybe moved out of a loop to reduce memory references.
Other optimizations include the elimination of memory reads on multiple safe references, register state tracking for optimal register-based memory references, redundant register save/restore removal, and many local code generation optimizations for particular operand types. Peephole optimization of local code sequences and low-level instruction scheduling are performed by the compiler's back end.
In some instances, the programmer has knowledge of register state or other code behavior which cannot be inferred from the source code alone. Compiler directives are provided for specifying register alignment state, structure base address alignment, and likely flow paths at branch points.
Certain types of optimization typically performed by a high-level language compiler cannot be performed by the VAX MACRO-32 compiler because assumptions made by the MACRO-32 programmer cannot be broken. For example, the order of memory reads may not be changed.
Major Architectural Differences in the OpenVMS Kernel
Architectural changes affecting synchronization, memory management, and I/O are not the only architectural differences causing significant changes in the kernel, but they are major ones and are representative of the effort involved in porting OpenVMS to the Alpha AXP architecture.
For the most part, it was possible to isolate architecture-dependent changes to a few major subsystems. However, differences in the memory reference architecture did have a pervasive impact on users of shared data and of many common synchronization techniques. Other differences such as those in memory management, context switching, or access to I/O devices were mostly confined to the relevant subsystems.
The following differences between the VAX and Alpha AXP memory reference architectues affected synchronization [4, 5]:
* Load/store architecture rather than atomic modify Instructions
* Longword and quadword writes with no byte writes
* Read/write ordering not guaranteed
Load/store memory reference instructions are characteristic of most RISC designs. However, the other differences are less typical. In all three cases, the reasons for these differences were hardware simplification and opportunities for increased hardware performance . As we shall see, the consequences are significant changes in system software and many opportunities for subtle errors detected only under heavy loads. Adapting to these architectural changes without significantly impacting performance was one of the major challenges facing us in porting OpenVMS to the Alpha AXP architecture.
A load/store architecture such as Alpha AXP cannot provide the atomic read-modify-write instructions present in the VAX architecture. Thus, instruction sequences are necessary for many operations done by a single atomic VAX instruction. An example is incrementing a memory location. The consequence is a need for an increased awareness of synchronization. Instead of depending on hardware to prevent interference between multiple threads of execution on a single processor, explicit software synchronization may be required. Without this synchronization, the interleaving of independent load-modify-store sequences to a single memory location may result in overwritten stores and other incorrect results.
The lack of byte writes imposes additional synchronization burdens on software. Unlike VAX and most RISC systems, an Alpha AXP system has instructions to write only longwords or 64-bit quadwords, not bytes or words. Thus, byte writes must be done by a sequence which reads the encompassing longword, merges in the byte, and writes the longword to memory. As a consequence, software must be concerned not only with access to independent, but adjacent, variables. Synchronization awareness is now extended from shared data to data items that are merely near each other.
Most problems introduced by the architectural changes discussed earlier were fixed in one of three ways:
* Explicit software synchronization added between threads
* Data items relocated to aligned longwords or quadwords
* Use of Alpha AXP load locked and store conditional instructions
The obvious solution of adding explicit synchronization in the form of a software lock is not always appropriate for several reasons. First, the problem may be independent data items that happen to share a longword. Synchronizing this sort of access in unrelated code paths is very error prone. Explicit synchronization may also have an unacceptable performance impact. Finally, deadlocks are a possibility when one thread interrupts another.
Ensuring that data items are in aligned longwords or quadwords both improves performance and eliminates interactions between unrelated data. This technique is used wherever possible, but there are two major cases where it cannot be used. One is where the replication factor is too large. Expanding an array of thousands of bytes to longwords may simply not be acceptable. The other major problem case is data structures that cannot be changed because of external constraints. For example, data may represent a protocol message or a structure primarily resident on disk. We could have separate internal and external forms of such data structures, but the performance cost of continuous conversions may not be acceptable.
Often, the easiest and highest performance way to solve the synchronization problems is with sequences of load locked and store conditional instructions. The load locked instruction loads an aligned longword or quadword while setting a hardware flag that indicates the physical address that was loaded. The hardware flag is cleared if any other thread, processor, or I/O device writes to the locked memory location. The store conditional instruction stores an aligned longword or quadword only if the hardware lock flag is still set. Otherwise, it returns an error indication without modifying the storage location. With these instructions, it is possible to construct atomic read-modify-write sequences that allow updating any datum contained within a single aligned quadword. These sequences are significantly slower than normal loads and stores due to the necessity of waiting for the write to reach a point in the memory hierarchy where consistency can be guaranteed. In addition, their use may inhibit many compiler optimizations due to restrictions on the instructions between the load and the store. Although faster than most other synchronization methods, this mechanism should be used sparingly.
The lack of guaranteed read/write ordering between multiple processors is another complication for the programmer trying to achieve proper synchronization. Although not visible on a single processor, this lack of ordering means that one processor will not necessarily observe memory operations in the order they were issued by another processor. Thus, many obvious synchronization protocols will not work when writes to the synchronization variable and to the data being protected are observed to occur out of order. A memory barrier instruction is provided in the architecture to ensure ordering. However, its negative performance impact requires that it be used only when necessary.
As described, we used various mechanisms to solve our synchronization problems. Having multiple solutions allowed us to choose the one with the least performance impact for each case. In some cases, explicit synchronization was already in place due to multiprocessor synchronization requirements. In other cases, we expanded data structures at a cost of modest amounts of memory to avoid expensive synchronization when referencing data.
Unlike the pervasive architectural changes described earlier, the privileged-architecture differences had a more limited impact. The primary remaining areas of change are the new page table formats and the details of process context switching. The rest of this section describes memory management as a representative example. However, one must recognize that many limited changes still add up to modifying virtually every source module in the OpenVMS kernel even if only to add compiler directives.
Memory Management. Not surprisingly, the memory management subsystem required the most change when moving from the VAX to the Alpha AXP architecture. Aside from the obvious 64-bit-addressing capability, there are two significant differences between the page table structures on the VAX and the Alpha AXP architectures. First, Alpha AXP does not have an architectural division between shared and process private-address space. Second, the Alpha AXP three-level page table structure allows the sharing of arbitrary subtrees of the page table structure and the efficient creation of large, sparse address spaces. (See Figure 1.) There is also the possibility of larger page sizes on future Alpha AXP processors.
To meet schedule goals, we decided to emulate the VAX architecture's 32-bit address space as closely as possible. This meant creating a 2GB process private address region (VAX P0 and P1) and a 2GB shared address region (VAX S0 and S1) for each process. This is easily accomplished by giving each process a private level-1 page table (L1PT), which contains two entries for level-2 page tables (L2PT). One of these L2PTs is shared and implements the shared system region, while the other is private and implements the process private address regions. Although the smallest allowed page sixe of 8 KB results in an 8 GB region for each level-2 page table, we use only 2 GB of each to keep within our 4 GB, 32-bit limit. As shown in Figure 1, the L2PTs are chosen to place the base address of the shared system region at FFFFFFFF80000000 (hex), the same as the sign-extended address of the top half of the VAX architecture's 32-bit address space.
Although changes were extensive in the memory management subsystem, many were not conceptually difficult. Once we dealt with the new page table structure, most changes were merely for alternative page sizes, new page table entry (PTE) formats, and changes to associated data structures. We did decide to keep the OpenVMS VAX concept of mapping process page tables as a single array in shared system space for our initial implementation. Although not viable in the long term due to the potentially huge size of sparse process page tables, this decision minimized changes to code that references process page tables. Having process page tables visible in shared system space turned out to be a significant complication in paging and in address space but the schedule benefits in avoiding change to other subsystems were considered worthwhile. We expect to change to a more general mechanism of self-mapping process page tables in process space for a subsequent OpenVMS AXP release.
This design allowed us to meet our goals of minimum change outside of the memory management subsystem. Unprivileged code is unaffected by the memory management changes unless it is sensitive to the new page size. Even privileged code is generally unaffected unless it has knowledge of the length or format of page table entries.
Optimized Translation Buffer Use. In previous sections we may have given the impression that architectural changes always created problems for software. This was not universally true; some offered us opportunities for significant gains. One such change was an Alpha AXP translation buffer (TB) feature called granularity hints. TBs are key to performance on any virtual-memory system. Without them, it would be necessary to reference main memory page tables to translate every virtual address. However, there never seem to be enough TB entries. The Alpha AXP architecture allows a single TB entry to map a virtually and physically contiguous block of properly aligned pages, all with identical protection attributes. These pages are marked for the hardware by a flag in the page table entry.
Given granularity hints, near-zero TB miss rates for the kernel became attainable. To this end, the kernel-loading mechanisms were enhanced to create and use granularity hint regions.
The OpenVMS AXP kernel is made up of many separate images, each of which contains several regions of memory with varying protections. For example, there is read-only code, read-only data, and read-write data. Normally, a kernel image is loaded virtually contiguously and relocated so that it can execute at any address. To take advantage of granularity hints, kernel code and data are loaded in pieces and relocated to execute from discontiguous regions of memory. Only very few TB entries are actually used to map the entire nonpaged kernel, and the goal of near-zero TB misses was reached.
The benefits of granularity hints became immediately obvious, and the mechanism has since been expanded. The OpenVMS AXP system now also uses the code region for user images and libaries. This extends the benefits not only to OpenVMS-supplied images, but to customer applications and layered products as well. Of course, use of this feature is only reasonable for images and libraries used so frequently that the permanent allocation of physical memory is warranted. However, most applications are likely to have such images, and the performance advantage can be impressive. [TABULATOR DATA OMITTED]
Unlike the architectural changes described previously, the new I/O architecture structures an area that was previously rather uncontrolled. The goal was to allow more flexibility in defining hardware I/O systems while presenting software with a consistent interface. These seem like contradictory aims, but both must be met to build a range of competitive high-performance hardware that can be supported with a reasonable software effort. [TABULATOR DATA OMITTED]
The Alpha AXP architecture presents a number of differences and challenges that impacted the OpenVMS AXP I/O system. These changes span areas from longword granularity to device control and status register (CSR) access to how adapters may be built.
CSR Access. One of the fundamental elements of I/O is the access of CSRs. On VAX systems, CSR access is accomplished as basically another memory reference that is subject to a few restrictions. Alpha AXP systems present a variety of CSR access models.
Early in the project, the concept of I/O mailboxes was developed as an alternative CSR access model. The I/O mailbox is basically an aligned piece of memory that describes the intended CSR access. Instead of referencing CSRs via instructions, an I/O mailbox is constructed and used as a command packet to an I/O processor. The mailbox solves three problems: the mailbox allows access to an I/O address space larger than the address space of the system; byte and word references may be specified; and the system bus is simplified by not having to accomodate CSR references that may stall the bus. As systems get faster, these bus stalls are increasingly larger impediments to performance. TTABULATOR DATA OMITTED]
Mailboxes are the I/O access mechanism on some, but not all, systems. To preserve investment in driver software, the OpenVMS AXP operating system implemented a number of routines that allow all drivers to be coded as if CSRs were accessed by a mailbox. Systems that do not support mailbox I/O have routines that emulate the access. These routines provide insulation from hardware implementation details at the cost of a slight performance impact. Drivers may be written once and used on a number of differing systems. [TABULAR DATA OMITTED]
Read/Write Ordering. An I/O device is simply another processor, and the sharing of data is a multiprocessing issue. Since Alpha AXP does not provide strict read/write ordering, a number of rules must be followed to prevent incorrect behavior. One of the easiest changes is to use the memory barrier instructions to force ordering. Driver code was modified to insert memory barriers where appropriate. Table 4. SPEC Release 1 benchmark results
Benchmark VAX 7000 DEC 7000 Name and Model 610 Model 610 Relative Number SPECratio SPECratio Performance 001.gcc 34.9 67.5 1.93 008.espresso 28.8 94.7 3.29 013.spice 2g6 30.9 87.7 2.84 015.doduc 42.1 126.3 3.00 020.nasa7 67.2 293.0 4.36 022.li 34.7 100.2 2.89 023.eqntott 38.4 127.6 3.32 030.matrix300 138.8 1219.7 8.79 042.fpppp 48.8 193.3 3.97 047.tomcatv 61.6 276.5 4.49 SPECin89 34.0 95.1 2.80 SPECfp89 57.6 244.2 4.24 SPECmark89 46.6 167.4 3.59
Relative Performance = DEC 7000 Model 610 SPECratio/VAX 7000 Model 610 SPECratio
The devices and adapters must also follow these rules and enforce proper ordering in their interactions with the host. An example of this is the requirement that an interrupt also act like a memory barrier in providing ordering. The device must also insure proper ordering for access to shared data and direct memory access.
Kernel Processes. Another important Alpha AXP difference is the lack of an interrupt stack. On VAX, the interrupt stack is a separate stack for system context. With the new Alpha AXP design, any system code must use the kernel stack of the current process. Therefore, a process kernel stack must be large enough for the process and for any nested system activity. This is an unreasonable burden. A second problem is that the VAX I/O subsystem depends on absolute stack control to implement threads. As a result, most of the I/O code is in MACRO-32, which is a compiled language on the OpenVMS AXP system and does not provide absolute stack control.
These facts led us to create a kernel-threading package for system code at elevated IPLs. This package, called Kernel Processes, provides a simple set of routines that support a private stack for any given thread of execution. The routines include support for starting, terminating, suspending, and resuming a thread of execution.
The private stack is managed and preserved across the suspension with no special measures on the part of the execution thread. Removing requirements for absolute stack control will facilitate the introduction of high-level languages into the I/O system.
As stated earlier, the main purpose of the project was to deliver the performance advantages of RISC to OpenVMS applications. We adopted several methods, including simulation, trace analysis, and a variety of measurements, to track and improve operating system and application-level performance. This section presents data on the performance of OpenVMS service and of the SPEC Release 1 benchmark suit . (All Alpha AXP results are preliminary.)
Performance of OpenVMS Services
To evaluate the performance of the OpenVMS system itself, we used a set of test which measure the CPU processing time of a range of OpenVMS services. These tests are neither exhaustive nor representative of any particular workload. We use relative CPU speed (VAX CPU time/AXP CPU time) as a metric to compare CPU performance. For I/O-related services, a RAM disk was used to eliminate I/O latencies.
The tests were run on a VAX system and an AXP system which are the same except for the CPU. Table 1 shows the configuration details of the two systems. Figure 2 shows the distribution of the relative CPU speed for the OpenVMS services measured. Most tests ran between 0.9 and 1.7 times faster on the AXP system than on the VAX system. Table 2 contains the results for a representative subset of the measured OpenVMS services.
Applications vary in their use of operating system services. Most applications spend the majority of their time performing application-specific work and a small fraction of their time using services in the operating system. Their performance depends mainly on the performance of hardware, compilers, and run-time libraries. We used the SPEC Release 1 benchmarks as representative of such applications. Table 3 shows the details of the VAX and Alpha AXP systems on which the SPEC Release 1 suite was run, and Table 4 contains the results. The SPECmark89 comparison shows that the OpenVMS AXP system outperforms the OpenVMS VAX system by a factor of 3.59.
The OpenVMS services performance and the SPECmark results are consistent with other studies of how operating system primitives and SPECmark results scale between CISC and RISC . Overall, the results are very encouraging for a first-version product in which redesigns were purposely limited to meet an aggressive schedule.
Conclusions and the Future
Some OpenVMS VAX features such as symmetric multiprocessing and VMScluster[TM] support were deffered from the first version of the OpenVMS AXP system and will be in follow-on releases. Beyond this, we anticipate significant steps to exploit the hardware architecture better, including evolving in a staged fashion to a true 64-bit operating system. Also, detailed analysis of performance results has shown the need to alter internal designs to match the RISC architecture better. Finally, a gradual replacement of VAX MACRO-32 source with a high-level language is essential for support of 64-bit virtual-address space and an important element for increasing performance.
The OpenVMS AXP system clearly demonstrates the viability of making dramatic changes in the fundamental assumptions of a mature operating system while preserving the investment in software layered on the system. The future challenge is to continue operating system evolution in order to provide more capabilities to applications while maintaining that essential level of compatibility.
[1.] Anderson, T., Levy, H., Bershad, B. and Lazowska, E. The interaction of architecture and operating system design. In Proceedings of the 4th International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS IV) (Santa Clara, Calif., Apr.). 1991, pp. 108-120.
[2.] Bhandarkar, D. and Clark, D. W. Performance from architecture: Comparing a RISC and a CISC with similar hardware organization. In Proceedings of the 4th International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS IV) (Santa Clara, Calif., Apr.). 1991, pp. 310-319.
[3.] Digital Equipment corporation, OpenVMS Calling Standard. Digital Equipment Corp., Maynard, Mass., 1992.
[4.] Leonard, T., Ed. VAX Architecture Reference Manual. Digital Press, Bedford, Mass., 1987.
[5.] Sites, R. The Alpha AXP architecture. Digital Tech. J. 4, 4 (Jan. 1993). Also this issue.
[6.] Spec Newsletter 4, 1 (Mar. 1992).
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||one of four articles on DEC's Alpha architecture; DEC's OpenVMS operating system, VAX architecture and Alpha AXP architecture|
|Author:||Kronenberg, Nancy; Benson, Thomas R.; Cardoza, Wayne M.; Jagannathan, Ravindran; Thomas, Benjamin J.|
|Publication:||Communications of the ACM|
|Article Type:||Cover Story|
|Date:||Feb 1, 1993|
|Previous Article:||Alpha AXP architecture.|
|Next Article:||Digital's Alpha project.|