History of the PowerPC architecture.
Although confident of the significant advantages of RISC, they also realized that market success ultimately depends on software. To get a running start, it seemed sensible to begin with a RISC architecture that already had a large installed base of software. From that perspective, the obvious choice was IBM's successful line of RISC System/6000 workstations and servers, which are based on the company's POWER architecture.
The PowerPC Architecture reflects the work done by the team of computer architects from Apple, IBM, and Motorola who were tasked with the objective of retargeting the architecture into a form more suitable for high-volume, single-chip microprocessors. The architects also enhanced the architecture with better multiprocessor support features and extended it with a 64-bit address capability in order to ensure its viability into the next century. This article describes the evolution to the PowerPC Architecture and briefly describes some of the changes incorporated relative to the POWER architecture.
The fundamental concepts of RISC were developed by John Cocke in the mid-1970s at IBM's T.J. Watson Research Center, and first embodied in a machine called the IBM 801 minicomputer . These ideas were further refined and articulated by a group at the University of California at Berkeley led by David Patterson, who coined the term "RISC" . These early pioneers realized that RISC represented a substantive departure from the then-popular trend toward more complex instruction sets (embraced by "CISC" architectures such as the VAX, 8086, 32000, and 68000) that promised higher performance, less cost, and faster design time.
Complex instruction set architectures were primarily motivated by a desire to reduce the "semantic gap" between the machine language of the processor and the high-level languages in which people were programming. The theory was that such a processor would have to execute fewer instructions (have a shorter path length) and would, therefore, have better performance. The key observation underlying RISC, however, was that the sequential microcode interpreter required to execute these complex instructions introduced an expensive overhead that actually slowed down execution of the more frequently occurring simple instructions--resulting in a net loss in performance. Furthermore, complex instructions proved to be a rather poor target for compilers that had difficulty using them and in many cases their use precluded optimizing out unnecessary operations.
With the declining cost of memory devices and improved compiler technology, it became feasible to consider simplifying the instruction set, even at the cost of larger code size and higher memory bandwidth requirements. The 801 was the first machine to implement this strategy. It successfully demonstrated that simplifying the instruction set enabled implementations with smoother running (bubble-free) pipelines that could approach the goal of single-cycle instruction throughput. It was also discovered that other architectural features now associated with RISC--such as a large uniform register file--enabled compiler optimizations that actually kept code expansion very low and even reduced data bandwidth relative to existing CISC architectures. On balance, the 801 demonstrated that investing more transistors in instruction throughput, fast cycle times, and more registers, produced a better solution to the computer performance equation that was possible by spending those transistors on more complex instructions.
But the ideas behind RISC turned out to be even more significant than originally thought. Not only did RISC processors demonstrate more parallelism through better pipelining, but the resulting simplification of the hardware made tractable the idea of dispatching multiple instructions simultaneously (superscalar(1)) and enabled implementation of the heretofore mainframe-domain concepts of dynamic instruction reordering and out-of-order instruction execution on single-chip microprocessors. This is the essence of RISC architecture--it allows the execution of more operations in parallel and at a higher rate than possible with a CISC architecture employing similar implementation complexity.
Satisfied that the 801 concepts had made significant improvements in instruction cycle times and pipeline efficiency, IBM set out to improve further on the 801 architecture by: 1) explicitly embodying the concept of superscalar operation in the architecture; 2) improving the architecture as a target for compilers; 3) reducing instruction path lengths; and 4) including floating point as a first-class data type in the architecture. This effort culminated in the development of the POWER architecture  in the late 1980s, which now forms the basis of IBM's RISC System/6000 family of workstations and servers (see Figure 1).
The POWER Architecture
The POWER architecture is a conventional RISC architecture in most respects; it adheres to the most important RISC tenants of fixed-length instructions, register-to-register architecture, simple addressing modes, simple (not requiring microcode interpretation) instructions, a large register file, and a three-operand (non-destructive) instruction format. However, the POWER architecture also has several additional features that set it apart from other RISC architectures.
First, the instruction set was organized around the idea of superscalar instruction dispatch. Conceptually, instructions are dispatched across three independent execution units, a branch unit, a fixed-point unit, and a floating-point unit (see Figure 2). Instructions can be dispatched to each of these units simultaneously where they can execute concurrently and finish out of order. To increase the level of instruction parallelism that can be achieved in practice, the instruction set architecture defines an independent set of register resources for each unit. This minimizes the communication and synchronization required between units, thus allowing execution units to adjust to the dynamic instruction mix by "slipping" past one another. Any data communication required between units must be performed explicitly, exposing it to the compiler, where it can be effectively scheduled. (It is important to realize this is a conceptual model only. Any given processor may implement each of the conceptual units as multiple execution units to support additional instruction parallelism. But the existence of the model led to the consistent design of an instruction set that naturally supported at least degree three parallelism.)
Second, the POWER architecture added several "compound" instructions to reduce instruction path lengths. Perhaps the only drawback to RISC technology relative to CISC is that it sometimes takes more instructions to perform a given task. IBM discovered that most of this code expansion is avoidable with minor enhancements to the instruction set that do not constitute a return to CISC-like complex instructions. For example, a large fraction of the code expansion was found in the prolog and epilog code associated with saving and restoring registers across a procedure call. To eliminate this as a factor, IBM introduced "load-and-store multiple" instructions that allow several registers to be moved to or from memory with a single instruction. The linkage conventions used by the POWER compilers addressed the problems of relocation, shared libraries, and dynamic linkage in one simple, unified mechanism. This is done by indirect addressing through a table of contents (TOC) that is updated at load time. The load-and-store multiple instructions were important to these linkage conventions.
Another example of "compound" instructions is the optional update of the base register on loads and stores with the newly calculated effective address. This instruction eliminates the need for the extra add instruction that would otherwise be required to increment the index for progressive indexing of arrays. Even though this is a compound operation, it does not adversely affect the conventional RISC pipeline flow because the updated address is already computed and a register file write port is normally available while waiting on the memory operation.
The POWER architecture provided a few other path length reducing instructions such as: an extensive set of bit-field manipulation instructions, compound multiply-add floating-point instructions, condition register setting as a side effect of normal instruction execution; and load and store string instructions (which load or store arbitrarily aligned byte strings).
A third factor that differentiates the POWER architecture from many other RISC architectures is the absence of the branch-and-execute capability. Branch-and-execute (sometimes called delayed branching) causes the instruction following a branch to execute before the branch gets taken. This feature worked effectively in early RISC machines to fill the instruction bubble created by branch evaluation and fetching the new instruction stream. However, in more advanced, superscalar machines, this feature is ineffective because a single branch delay cycle induces multiple instruction bubbles that cannot all be covered with a single architectural delay slot. Almost all such machines will implement exotic facilities (e.g., branch target caches) for covering these bubbles. These facilities render the delayed branch useless. Not only is the delayed branch ineffective in such machines, it introduces significant complexity into the instruction sequencing logic. Thus, even though the 801 employed branch-and-execute, it was not included in the POWER architecture. Instead the POWER branch architecture was organized to support branch-lookahead and branch-folding techniques as described in the next paragraph.
The branching technique used in the POWER architecture is the fourth unique feature of the architecture compared to other RISC processors. The POWER architecture defines an enhanced condition register facility. The problem with traditional condition register architectures is that the setting of condition bits as a side effect of instruction execution poses serious limitations on the compiler's ability to reschedule instructions. Additionally, a condition register represents a single architectural resource that causes a serious bottleneck in a machine that executes multiple instructions in parallel or out of order. Some RISC architectures avoid the problem by completely eliminating the condition register and requiring conditions to be explicitly set (by a compare instruction) in a general register and/or by folding the comparisons into the branch instructions themselves. The latter approach potentially overloads the branch-execute pipeline stage. Therefore the POWER architecture chose instead to fix the problems of the traditional condition register approach by: a) providing an opcode bit in each instruction to make the condition register update optional, thereby restoring the compiler's ability to rearrange code, and b) providing multiple condition registers (eight) to avoid the single resource problem and to provide a large condition register namespace, so the compiler can allocate and schedule condition register resources as it does for general registers.
Another reason for selecting the enhanced condition register model was that it is consistent with the organization of the machine into independent execution units. Conceptually, the condition register is local to the branch unit. Consequently, it is not necessary to access the general register file (which is local to the fixed-point unit) to evaluate and execute a conditional branch. To the extent the compiler can schedule condition code updates early (and/or load the branch address registers early) the hardware can lookahead and fold-out resolvable branches from the instruction issue slot normally occupied by the branch instruction, and allows the instruction dispatcher to feed a continuous linear stream of instructions to the computational execution units.
Evolution of the PowerPC Architecture
Satisfying the diverse needs of three major corporations like Apple, IBM, and Motorola to meet their collective long-term vision of computing required some modifications to the POWER architecture. So, with the goal of maintaining RS/6000 software compatibility, a team of architects from IBM, Apple, and Motorola set out to refine the architecture. A number of changes were made to the architecture in the following general categories:
* simplifying the architecture to be more appropriate for low-cost single-chip microprocessors.
* eliminating instructions that might impede clock rates
* removing architecturally imposed barriers to superscalar dispatch and out-of-order execution.
* encouraging symmetric multiprocessor systems by adding multiprocessor support features
* adding new features deemed necessary for anticipated applications
* clearly defining the line between "architecture" and "implementation"
* assuring a long lifetime for the architecture by extending it to a true 64-bit architecture.
These changes resulted in a new architecture, officially called the PowerPC Architecture, which will form the basis for next-generation products not only from the three founding companies but from a large number of other companies as well.
The PowerPC Architecture maintains the same basic programming model and instruction opcode assignments as the POWER architecture. Where changes were made that could potentially prevent PowerPC processors from running existing RS/6000 binaries, care was taken to remove or change the feature in such a way that it could be trapped and emulated in software. To make this approach practical, features were changed only if they were either used infrequently in application code or were isolated in library routines that could easily be replaced.
Some of the more significant changes made in going from the POWER to the PowerPC architectures in the categories listed previously include the following:
--eliminated several bit-field instructions that used three source operands (to avoid the need for an extra general register file port)
--redefined the real-time-clock as a simple binary counter with variable count rate
--eliminated special segments and the associated fine-grain locking
--eliminated the most complex string instruction
* Higher clock rates
--eliminated four instructions whose operation was dependent on the value of the source operand
* Removing superscalar barriers
--eliminated the MQ register and all extended precision shifts, extended integer multiply, and the divide-with-remainder instructions that used it
--added subtract without carry
--added floating-point imprecise exception modes
* Multiprocessor support
--added reservation model for atomically updating shared memory
--defined the weakly ordered storage model
--added new memory transaction ordering instructions
--redefined the user mode cache control instructions for use in multiprocessor systems
--defined memory aliasing rules
--replaced two-level inverted page table with single hashed page table capable of supporting concurrent aliasing
* New features
--added single-precision floating-point instructions
--added unsigned fixed-point multiply and divide
--added new storage attribute controls on memory pages
--added little-endian memory addressing mode
--added variable-size block address translation capability
* Extension to 64 bits
--defined superset architecture that supports full 64-bit liner logical address space and 64-bit integer arithmetic
--defined segment table to provide an 80-bit virtual address space (to replace the segment registers used in 32-bit addressing that provided a 52-bit virtual address space)
--extended the page table formats to support a full 64-bit physical address space
The most far-reaching change to the architecture was the extension to 64 bits, which involved a number of modifications to the user programming model, instruction set, and address translation mechanisms. We defined the PowerPC Architecture as a full 64-bit architecture which has a 32-bit subset. The architecture permits both 32- and 64-bit versions of PowerPC processors, but all processors are required to support 32-bit programs as a minimum. The architecture defines a 32/64-bit mode switch controllable from supervisor code that allows a 64-bit processor implementation to run 32-bit programs.
The primary change to the user-visible architecture was to extend the width of the general registers and the branch address register to 64 bits. On processors that implement the full 64-bit architecture, all instructions now simply work on full 64-bit registers rather than 32-bit wide registers. Nearly all instructions are entirely mode-independent. The only significant effect of the mode switch is to select how much of a 64-bit effective address is used in address translation. (There are a few other minor effects such as selecting the ALU bit from which ALU conditions such as carry and overflow are generated.) The address translation mechanism for 64-bit processors is similar to the translation mechanism used on 32-bit implementations, except that the segment registers are replaced with a segment table to handle the larger logical and virtual address spaces, and the page table format was extended to accommodate the larger virtual and physical address spaces. But these translation changes are not apparent to application software.
The PowerPC 601
The first PowerPC microprocessor, the PowerPC 601 microprocessor, is now available from both IBM and Motorola. The 601 is a medium-sized, medium-performance processor suitable for low- to medium-cost desktop computer systems. It was designed as a transition processor from the POWER architecture to the PowerPC Architecture. Thus it implements a superset of both POWER and PowerPC features so that existing RS/6000 binaries will run at full speed. This provides additional time for the compilers to be retargeted to the PowerPC Architecture and applications to be recompiled before the processors which implement strictly the PowerPC Architecture become widely available.
The 601 was based on an IBM single-chip processor that was being designed when the alliance was first formed. But the 601 underwent major enhancements to improve performance and reduce costs (see sidebar "A Comparison of PowerPC 601 and PowerPC 603 Features.") For example, a more sophisticated branch unit, enhanced with multiprocessor features including Motorola's 88110 high-performance microprocessor bus interface was included. The 601 implements a moderately aggressive superscalar microarchitecture capable of dispatching 3 instructions, possibly out-of-order, on each clock cycle.
Other PowerPC Processors
IBM and Motorola, with Apple engineering participation, have put into operation a new design center to develop future PowerPC microprocessors. The Somerset Design Center is a 37,000 square-foot facility located in Austin, Texas, staffed primarily by Motorola and IBM with approximately 300 engineering professionals. The design center is presently working concurrently on three separate PowerPC microprocessors. The three parts currently in development in the design center include:
* The 603: a processor intended primarily for the cost-sensitive, desktop and portable personal computer systems
* The 604: a high-performance part for uniprocessor or multiprocessor desktop personal computers and workstations
* The 620: a 64-bit high-performance part for high-end workstations, servers, and multiprocessor systems
Engineers at the Somerset Design Center employ a formal VLSI design methodology derived from the best of both IBM's and Motorola's CAD tools. The new designs all use an advanced 0.5 [micro]m CMOS technology using a common set of design rules for both IBM and Motorola semiconductor fabrication facilities.
The first processor designed completely in the new design center was the 603, which is now sampling. The 603 employs a slightly more aggressive microarchitecture than the 601, but has a smaller cache giving it approximately the same performance as the 601 at lower cost (see sidebar "A Comparison of PowerPC 601 and PowerPC 603 Features"). The 603 is also superscalar, capable of dispatching 3 instructions per clock, in-order, into 5 concurrent execution units. It employs register renaming, reservation stations, speculative execution, and out-of-order instruction execution and completion to boost instruction parallelism. The 603 operates at 3.3v and utilizes static design, automatic power-down circuitry, and a number of software-controlled power-saving modes to make it useful in laptop systems as well as low-cost desktop systems.
The 604 and 620 products are not yet officially announced during the development time of this article, but are scheduled to see silicon during 1994. In addition to these four processors, other new PowerPC processors are in development at both Motorola and IBM internal design centers. These designs will target a range of specific markets ranging from very low-cost, high-volume embedded control markets, to very low-power subnotebook computer markets, all the way to very high-end computers. Also, research is under way into advanced microarchitectural techniques for the next generation of billion-instruction-per-second class microprocessors.
The PowerPC Architecture is the product of nearly 20 years of work on RISC architectures beginning with IBM's seminal 801 minicomputer in the 1970s and finally refined by Apple's experience in advanced personal computers and by Motorola's experience in delivering low-cost single-chip microprocessors into high-volume markets. This architecture is currently being used as the base for a wide variety of instruction-set compatible microprocessors. With their combined resources, IBM, Apple, and Motorola intend to deliver an unparalleled range of PowerPC RISC processors into the market, all the way from the very lowest end of the portable computer marketplace to the very highest end of the supercomputer market.
(1) The term "superscalar" is believed to have been coined by T. Agerwala and John Cocke . It refers to machines capable of dispatching multiple instructions per clock from a conventional linear instruction stream.
[1.] Agerwala, T. and Cocke, J. High performance reduced instruction set processors. IBM Tech. Rep., March 1987.
[2.] Gullette, B. The design of the 88110 Bus Interface. In Proceedings of RISC '92, Feb. 1992.
[3.] Oehler, R.R. and Groves, R.D. IBM RISC System/6000 processor architecture. IBM J. Res. Develop. 34 (Jan. 1990), 23-36.
[4.] Patterson, D.S. and Ditzel, D.R. The case for the reduced instruction set computer. Comput. Architecture News (Oct. 15, 1980).
[5.] Radin, G. The 801 Minicomputer. In Proceedings of the Symposium on Architectural Support for Programming Languages (March 1982), 39-47.
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||The Making of the PowerPC|
|Publication:||Communications of the ACM|
|Article Type:||Cover Story|
|Date:||Jun 1, 1994|
|Previous Article:||The PowerPC alliance.|
|Next Article:||The PowerPC 603 microprocessor.|