The Clipper (TM) Processor: instruction set architectures and implementation.
The CLIPPER microprocessor uses caching and virtual memory as the standard mode of operation. The associated CAMMU chips each contain a 4 Kbyte cache, a translation lookaside buffer (TLB), and a translator. One CAMMU is used for instruction references and the other for data; the CAMMUs not only provide caching, but also implement protection, detect page faults, and watch the system bus to ensure multiple cache consistency. A full 32-bit address space is provided for the operating system and for each user process; the address space is not partitioned via high order address bits.
The floating point unit is on the CLIPPER processor chip. Instruction execution is pipelines with up to five instructions in the pipeline. Interlocks and dependency checks are provided in the pipeline hardware, so that no compiler inserted no-ops are needed for correct operation. Some complicated operations and diagnostics are implemented as instruction sequences in a small, on-chip ROM, called the Macro Instruction ROM (MIROM); all other instructions are hardwired. No microcode is used. The machine has 168 instructions, of which 101 are directly hardwired.
Two versions of the CLIPPER processor have been introduced. The C100, first available in 1986 from Fairchild Semiconductor, was implemented in 2 micron CMOS, was 167 K square mils, and used 132,000 transistors. The C300, available in 1988 from Integraph Corporation, is implemented in 1.5 micron CMOS, is 285K square mils, and uses 174,000 transistors. Performance measurements show that the C100 implementation is 3 to 15 times as fast as a VAX 11/780 (averaging more than five times faster) and is somewhat faster than a Vax 8600. The C300 is about twice as fast as the C100. The peak execution rate in CLIPPER instructions for the C100 is 33 MIPS and 50 MIPS for the C300. Additional information on CLIPPER is available in  and .
Motivation and Design Philosophy
CLIPPER was designed and built to fulfill the need for a very high performance, microcomputer chip-based computer. The immediate applications for such a processor are in high performance workstations and "super-minicomputer" shared machines. To introduce some historical perspective, the highest performance commercial mainframe in 1976 was the IBM 370/168, which for the kind of workloads expected on CLIPPER (C, Fortran, Pascal), had performance comparable to that of the C100 CLIPPER.
When the CLIPPER project began 1982-83, no existing commercial computer architecture permitted a high performance implementation on a microprocessor chip with the necessary instruction set and architectural features. At the time, architectures available on microprocessors failed to permit high performance implementations, and most other architectures failed to be easily implementable on a chip or to provide a reasonable range of features. There were also commercial barriers to using existing architecture. The decision was thus made to design a new instruction set architecture, using the previous experience of the designers and the latest thinking in the computer architecture research community.
Fashions in computer architecture have varied widely over the last few years, changing from the "baroque" or "rococo" of the 1970s to the "minimalist" 1980s. It was widely believed in the 1970s that hardware would be very cheap, and software difficult and expensive; therefore as much functionality as possible should be moved to the hardware, resulting in complex architectures such as the DEC Vax. The problems with such a complex architecture are that it is very difficult to obtain good performance as a function of the amount of logic needed, it is difficult to get compilers to actually generate instructions that use the machine features, and the machine is hard (time consuming, expensive) to design, build, and debug.
The popular thinking in computer architectures, shifted in the 1980s toward very simple architectures, as originally implemented in the Cray machines (CDC 6400, 6600, 7600), studied and implemented in the IBM 801, and further studied and popularized by the RISC project at Berkeley and the MIPS project at Stanford. The essence of such machines is a simplified instruction set, which permits a hardwired implementation, a very simple instruction encoding which permits rapid decoding and effective pipelining, a load/store architecture, which greatly simplifies the control logic, and effective use of registers to cut memory traffic. Some such machines, such as RISC and MIPS have carried these concepts to their limits by requiring fixed length instructions, almost all of which execute in one cycle. The fixed length instructions result in a significant increase in code size, increasing memory traffic and cache miss ratios. The single cycle execution requirement increases the machine cycle time; CLIPPER has more compact code and a shorter cycle time than such very simple machines. Some discussion of the RISC/CISC issues appear in  and .
The choice was thus made to design a new instruction set architecture (ISA). The instructions, the module design, and the functional partitioning were chosen to permit mainframe level performance, and to permit future compatible mainframe implementations. The continuing and increasing adoption of the easily ported UNIX as the standard operating system for academic, software development, and workstation environments made a new ISA commercially feasible.
Outline and Context
It is possible to describe a "computer" at many levels. The instruction set architecture (ISAe refers to the computer instruction set as expressed in binary or in assembly language and its functions; the ISA is usually described in the "principles of operation" manual. We use the term design architecture to refer to the highest level description of an implementation, i.e., the block diagram and parameter level. Below that are gate and circuit level descriptions.
This article focuses primarily on CLIPPER's instruction set architecture, and examines the design architecture and related issues such as performance, design tradeoffs, design implications, and areas for possible future expansion.
MEMORY ARCHITECTURE AND DATA TYPES
First, we'll provide a brief overview of the memory architecture of the CLIPPER microprocessor. A much more detailed description, including a discussion of the CAMMU, is provided in .
In normal operation, CLIPPER uses virtual memory, although unmapped (real memory) mode is also possible. The supervisor and each user process has its own 32-bit virtual address space, defined by the PDO (page directory origin) register in the CAMMU, which contains the physical memory address of the base of the first level of the page map for the process. The page map is implemented in two levels: the first level is the page directory, and the second level contains the page tables. The page size is 4 Kbytes, which is large enough for efficient I/O, keeps the TLB miss ratio down, and provides enough unmapped bits that set selection in the 4 Kbyte caches can be effectively overlapped with translation. The page size is also small enough to avoid unreasonable levels of internal fragmentation. No address bits are used to partition the address space, as in the Vax and MIPS machines, so such a partitioning isn't an obstacle to increased address space size as technology evolves.
Two cache and memory management chips (see Figure 1) provide most of the support for the memory architecture; one is used for data and the other for instructions; each is connected to the processor by its own 32-bit address/data bus. Each CAMMU has a TLB and a translator. The TLB is set associative with 128 entries organized as 64 sets of 2 elements each. Protection is provided on a page basis, with each page table entry specifying permission for the process to read, write, and/or execute from the page in supervisor and/or user state; protection bits are cached in the TLB. Page faults, protection faults, and memory errors are detected by the CAMMU and a trap code is returned to the processor for supervisor action.
Each CAMMU also contains a 4 Kbyte cache memory, organized as 128 sets of two 16-byte lines. The caching policy (copy back, write through, uncacheable) is defined on a per page basis and can vary from page to page; caching policy bits are attached to each page table and TLB entry. The CAMMU is capable of "watching" the system bus and acting to maintain cache consistency when there are multiple CAMMUs on the bus and/or when I/O operations reference data resident in the local cache. Specifically, shared data is marked "shared" and is cached write-through. Bus operations labeled as "I/O" or "shared write-through" are recognized by the CAMMU. I/O writes and shared write-throughs in the cache are preempted and the cache supplies the data. Single word I/O writes and shared write-throughs on the bus update the local copy, if any, and quad-word writes invalidate the local copy.
The low order eight pages of the supervisor address are permanently mapped by the CAMMU to provides access to Boot ROM (residing on the system bus), I/O, which is addressed via reads and writes to memory addresses, and low main memory. Trap and interrupt vectors reside in low memory. The CAMMUS are controlled by reads and writes to the I/O region of memory.
The C100 model of the CLIPPER was designed to use a consistent, "little endian", numbering system for bits, bytes, and words, in which the most significatn bit is in the highest numbered bit of the highest numbered byte, and internally, CLIPPER remains little endian. Figure 2 shows the instruction formats, in which the bit, byte, and word nubering may be observed. The "first parcel" is the first two bytes of the instruction stream; the remaining bytes of the instruction or the bytes of the following instruction(s) will appear in the second, third, and fourth parcels. This numbering system is also used in the Dec VAX, Intel 80386, and National 32000. This contrasts with the System/370 in which the most significant bit is the lowest numbered bit of the lowest numbered byte; bits, bytes and words are numbered in increasing order from left to right, with the MSB at the left. The Motorola 68000 also uses a "big endian" scheme, but numbers bits in the opposite order from bytes and words.
In the C300 version, CLIPPER can function in either a little-endian or big-endian mode, although internally the little-endian-ness is retained. The appropriate byte order is selected at power-up time by tying a pin to either +5v or ground. When operating in big-endian mode, CLIPPER internally reverses the order of half words in the instruction buffer, reverses the order in which double word operands are loaded/stored, and changes the byte and half-word addressing to reference the correct byte or half word within a word. As a result, data can be exchanged with a big-endian machine without reversing the bytes or changing the byte numbering. It also facilitates upgrading low performance (big-endian) machines with higher performance, CLIPPER-based products. (In contrast, when data is exchanged between a Vax and an IBM 370, bytes must be explicitly swapped.)
The selection of data types represents a compromise between apparent functionality, which is enhanced by a large number od data types, and implementability, which is easisest when the number of types is small. The datay types supported by the CLIPPER architecture include signed and unsigned bytes, half words (2 bytes), words (4 bytes) and and long words (8 bytes). There are also single and double precision (4a dn8 bytes, respectively) floating point numbers. This set of data types is sufficient to implement programming languages such as C, Fortran, and Pascal with direct hardware support provided for most language operations. (Initially, as suggested in , little support for bytes or half words was intended, but further examination of programming needs showed that more direct hardware support was required.)
At this time CLIPPER does not provide decimal numbers, strings, or precision beyond that of long words or double precision floaring point as hardwae specified data types. Strings can be easily implemented via software; CLIPPER also provides three string manipulation instructions (move, compare, fill) as Macro ROM sequences. Extended precision can be obtained via software when needed.
CLIPPER also imposes alignment restrictions on data items, as do other RISC and RISC-like processors. All data items must be stored on a boundary which is a multiple of its size. This restriction generally causes little difficulty, and simplifies the processor implementation considerably. For CLIPPER, thee is no implementation problem with line corssers (fetch or store requests spanning a pair of cache lines) or page crossers (fetch or store requests spanning a page boundary), since line and page crossers are impossible for data loads and stores. Instructions can span page boundaries, but no problem occurs since the instruction stream is fetched sequentially, four (aligned) bytes at a time.
REGISTERS AND MODES OF OPERATION
User and Supervisor General Purposes Registers
There are two sets of 16 general purpose registers (GPRs), one referenced by user mode programs and one by supervisor mode programs. The mode of the program is determined by a bit in the sytem status word (SSW). Two privileged instructions allow data transfers between user and supervisor registers.
Using separate user and supervisor register sets speeds up interrupt and trap handling, and makes CLIPPER especially suitable for real time applications, since registers don't need to be sotred or restored when interrpts occur. The selection of 16 registers was determined by several factors, including the number of bits conveniently available for register addressing and the fact that 16 registers represent a good tradeoff; 16 registers are enough for local working storage without inducing unreasonable overhead for saving and restoring them at procedure call time. The C compiler provided by Intergraph saves and restores only those registers that have been modified, and passes the first two arguments in registers. For comparison, we note that both the Vax and the IBM 370 have 16 GPRs. Lunde's results suggest that 8 to 10 registers are almost always sufficient. Analyses in show that with intra-procedure register allocation, no improvement in load/store traffic is obtained with more than 16 registers; even with intyerprocedural register allocation, minimal improvement is obtained with moe than 16 registers. Eight registers, however, are too few.
The idea of register windows was first proposed by Baskett and was implemented in the Berkeley RISC project; the motivation was that loads and stores due to procedure calls and returns could be avoided by simply moving to a new set of registers, using shared registers to pass parameters and results. Analyses in  show that with fewer than 100 registers, interprocedural register allocation results in less memory traffic than register windows; even with a total of 256 registers, register windows only outperform interprocedural register allocation by a small amount. Large register sets, such as those used in register windows, however, have a number of disadvantages: they require substantial chip area, only a small fraction of the registers are in use at any one time, process switching time is much larger since all registers need to be stored and restored, and larger register files are slower due to distance and circuit drive requirements. Register windows also require a mechanisms to address across windows, so that nonlocal variables can be referenced. for some projects (RISC II, SOAR), register access time has been a primary determinant of cycle time. The decision, therefore, to use 16 user and 16 supervisor GPRs seems to be fully justified.
Floating Point Registers
CLIPPER provides a set of eight double precision floating point (FP) registers accessible inboth user and supervisor states; floating point instructions refer to these. This is similar to the IBM 370 design, in which there are four FP registers. Eight registers provide sufficient storage for temporary operands, whereas four are insufficient in the absence of memory to register operations other than load and store. Four registers are clearly insufficient to permit interprocedural register allocation. (For non-numerically intensive programs, Lunde found that three floating point registers were usually sufficient. We expect a workload that is more numerically intensive than that analyzed by Lunde.)
Processor Status Registers
Three additional program addressable registers are provided, the program counter (PC), the program status word (PSW), and the system status word (SSW). The program counter contains the address of the instruction about to be issued, i.e., the instruction inthe pipeline that will be released and allowed to modify the processor state (write into a register or store a result). The internal registers containing addresses of instructions following the currently issued instruction in the pipe are not user addressable.
The program status word (PSW) is primarily used to hold status information (condition codes, trap codes) and to set those aspects of the processor state that the user process is permitted to modify, such as floating point trap enables. Four bits of condition code are provided (negative, zero, overflow, carry), and five bits of floating point exception status, as required by IEEE 754 standard, are also available. Six bits are used to enable/disable floating point traps, and two more to specify the floating point rounding mode. A trace trap bit is available. Four bits are used to record program traps (e.g., trace trap, illegal operation), and four more to record system trap types (memory error, page fault, etc.). The PSW may be read or written by the user process.
The last status register is the system status word (SSW). The SSW is used, among other things, to record the interrupt number and level, to enable interrupts, to set the mode (user/supervisor) and to set the protection key. The SSW may only be written in supervisor state. Its use is further described in .
INSTRUCTION FORMATS AND ADDRESSING
The CLIPPER microprocessor has a load/store architecture; i.e., most of the references to memory are via load and store instructions in contrast to both the IBM 370 and DEC Vax which make extensive use of their register/memory operations (370 RX type instructions) and memory-to-memory (370 SS type) instructions. Eliminating most RX and SS instructions substantially simplifies the processor implementation by eliminating control logic and especially by simplifying recovery from traps and interrupts such as page faults and memory errors. As noted in , all modern, simplified architectures are load/store. The lack of RX and most SS-type instructions increases CLIPPER code size above that for such densely encoded CISC (complex instruction set computer) processors such as the Vax, the National 32000 and the Intel 80386, but provides considerably denser code than RISC processors such as the SUN Sparc and the IBM ROMP. (CLIPPER does have some SS operations implemented in the MIROM.) For RISC-I , a 2/3 increase in number of instructions over the Vax was observed, using a very primitive compiler for RISC. Table I shows static code sizes (the size of the text segment of the object file) for a number of standard benchmarks compiled on a number of machines; data in  shows that static and dynamic code sizes are very closely correlated. There are two advantages to small code sizes: there is less memory traffic, which is a limiting factor in most multiprocessor designs, and cache miss ratios are lower, since working sets are smaller; see  for analyses and comparative miss ratios.
For load and store instructions, CLIPPER provide nine addressing modes, which appear in Figure 2. These nine address modes represent those judged to be important for convenient programming plus those that "come for free;" i.e., those that can be trivially generated with the logic and data paths already available. For a 32-bit architecture, a register + 32-bit displacement mode (relative with 32-bit displacement) is very useful. The long 32-bit displacement eliminates the aggravating addressability problem posed by the 12-bit displacement of the IBM 370. The register + 12-bit displacement mode saves 4 bytes, if only a short displacement is needed, and the relative (register with no displacement) mode requires two bytes less. Register + displacement addressing is often used for array and stack references, and local variables.
Absolute addressing is provided with 16-bit or 32-bit address constants. Absolute addressing is typically used for references (e.g., calls) to independently compiled code segments, and in the 16-bit form, for references to low memory and within small programs.
A PC-relative address mode would have been very useful in the IBM 370 , and such modes are provided by CLIPPER. The PC can be used with 16- or 32-bit displacement or with a register (GPR) displacement. Most of the time, the short displacement should be sufficient; in  99 percent of the branches were expressible in 16 bits or less as an offset from the PC. PC relative addressing is used primarily for branches and the PC + GPR mode for computed gotos and case statements.
Finally, a two register address mode (relative indexed) is provided, which facilitates addressing when both the base and index addresses are in registers, as well as when an array is passed as a parameter.
Four important aspects of the way the address mode is specified are evident in Figure 2. First, the address mode and opcode are always defined in the first instruction parcel (first two bytes), so there is no (slow) sequential decoding of the instruction; subsequent bytes can be immediately routed (as to the adder) without further examination. This encoding provides many of the supposed advantages of fixed length instructions that are used in RISC and MIPS. Second, 4 bits are used to specify the addressing mode, and only 8 of the 16 possible combinations are currently assigned, leaving the remainder available for future extensions. Third, there is no indirect addressing mode, a mode which is very difficult to implement efficiently. Finally, some of the address modes result in unused bits in some fields, which could be used in the future to generate more than 32 bits of virtual address.
To estimate the frequency of use of the various addressing modes, we examined data from the literature. In , addressing calculations for System/370 RX type instructions used no register 1.1 percent of the time, one register 85.6 percent of the time, and two registers 13.3 percent of the time; the RX type instruction forms an effective address as the sum of a 12-bit displacement and the contents of up to two registers. Data in  indicates that for the Vax, 61 percent of the operand addresses were displacement + register, and 23 percent were just register. Displacements from a register were most often one byte long. For the PDP-11 , most of the operand addresses were specified in a register (with or without increment or decrement), and most of the remainder were displacement + register. Based on the data cited and further data in  and , we would expect the relative [(R], relative with 12-bit displacement [(R) + disp], and PC Relative with 16-bit Displacement [(PC) + disp] to account for the bulk of the address mode use. In fact, as shown in Table VII for one (unrepresentative) benchmark, those address modes are common, as are also the PC Relative with 32-bit Displacement [(PC) + disp)] and Relative Indexed [(RX) + (R)]. The former is appropriate to large programs, such as Spice, and the latter for numerical programs making many array references. We again note that many of the address modes provided "come for free"; e.g., the relative address mode is a displacement mode with no displacement. If each address mode had required significant additional logic, fewer modes would have been justified or included.
Figure 2 shows the available instruction formats. Those instructions using addresses have already been discussed; next we'll comment on instructions which do not contain memory addresses.
Register-to-register instructions are specified in two bytes. Register-immediate operations can be specified in 2, 4, or 6 bytes, depending on the size of the immediate constant. Immediate constants are often small; 69 percent of the immediate operands can be encoded in 4 or fewer bits and 96 percent in 8 or fewer bits ; the corresponding figures from  are 60 percent and 70 percent. The availability of the quick format (which provides a 4-bit unsigned constant) and the 16-bit immediate format aid code density.
The control opcode is used when the operation requires a small (8-bit) constant only, as for the calls (system call) instruction. The macro opcodes are used to invoke operations implemented via instruction sequences in the on-chip ROM, such as the string move (movc) instruction.
The CLIPPER instruction set is fairly conventional and reflects the experience of its designers with respect to two factors: what is needed for convenient and efficient programmability, and what can be easily implemented in hardware. Table II shows the set of opcodes. Most of the entries are self-explanatory, and we will discuss only those that are interesting or worth explaining.
The CLIPPER microprocessor is unusual in that its floating point unit is on the processor chip; the floating point execution unit is also used to compute the integer multiplication, division and mod operations. Floating point arithmetic operations are performed as specified in the IEEE 754 standard. As noted earlier, there is a separate set of eight floating point registers, and all floating point operations are register to register. The floating registers may loaded or stored from/to main memory, or from/to the general purpose registers.
Branches and Condition Codes
The approach chosen for CLIPPER for controlling program execution is that of condition codes, which are set by one instruction and read and used by a subsequent instruction; this is similar to what is done on the IBM 370. Using condition codes for branching yields better performance and less complexity than an instruction that both tests and branches.
Four standard condition codes--N (negative), Z (zero), V (overflow) and C (carry)--are set in the PSW after certain operations. There are five floating point exception signalling codes: FX (floating inexact), FU (floating underflow), FD (floating divide by zero), FV (floating overflow), and FI (floating invalid op). Compare instructions normally set the N and Z flags; since the compare is executed by performing a subtraction, V and C may also be set.
There are two standard branch instructions. Branch on condition tests the NZVC PSW bits; the list of possiblities is shown in Table II. The branch on floating exception tests either for any exception or for a bad result (floating invalid, divide by zero, overflow). Branch instructions use the standard addressing modes, as defined in Figure 2, where the R2 field holds the condition code field that specifies the type of branch.
Implemented directly in the hardwired instruction set are the call and return (ret) instructions. The call instruction decrements the stack pointer (defined by the register in the R2 field), pushes the address of the next instruction onto the stack, and then loads the PC with the target address. Return reverses the process.
The CLIPPER processor chip includes a small ROM (known as the Macro Instruction ROM), which holds various useful code sequences. The MIROM contents are regular instructions, not microcode. Microcode requires a two-level decode  (instructions need to be decoded into microinstructions, and then decoded and executed), and microcoded machines tend to be slower than hardwired ones. Approximately half of the MIROM is devoted to diagnostic code to be used for chip testing and sorting during manufacturing. The remainder implements complex operations that are often found as single (usually microcoded) instructions on CISC machines. Implementing these functions as MIROM sequences increases code density and readability, instruction fetch penalties (misses, sequential fetch delays) and memory traffic decrease, and less instruction cache space is used. The MIROM concept has other advantages: (1) new instructions can be easily added; and (2) custom versions of the processor can be easily designed and implemented.
A Macro instruction actually represents a branch into the ROM; the instruction fetch unit starts fetching instructions from the ROM at the address specified by the macro opcode. Next, we'll briefly discuss the instructions implemented in the MIROM; the operation of the MIROM is described in more detail later.
Instructions to save and restore general registers (save registers (savewn), restore registers (restwn), save floating registers (savedn), and save user registers (saveur)) are implemented in the MIROM as a sequence of consecutive store (or load) operations, starting from a given register number and continuing through register 14. The floating point register saves and restores are implemented similarly.
Three string (storage to storage) instructions are currently implemented in the MIROM: movc (copy a string of characters from/to nonoverlapping fields), initc (initialize a string with the contents of a register; primarily used for clearing buffers), and cmpc (compare two character strings). These instructions may be interrupted and restarted.
All of the conversion operations, and negate floating, scale by, and load floating status (see Table II) are implemented in the ROM.
The return from interrupt (reti) instruction restores the processor state after trap or interrupt processing. The wait for interrupt (wait) instruction causes the processor to halt pending the arrival of an enabled interrupt. The interrupt routine then determines whether to continue execution.
Test and Set
The cost and performance advantages of multiple microprocessor computer systems sharing a common memory are currently quite compelling . The Test and Set (tsts) instruction is the instruction chosen for CLIPPER to implement the locks used in multiprocessor and multiprocess synchronization. As a single, indivisible operation, it loads the contents of a main memory location into a specified GPR, and sets bit 31 of the given main memory word to 1. Indivisibility is achieved by making the lock word noncacheable, and holding the main memory bus for the entire operation (which is a read/modify/write). A processor may either loop, continually testing the lock until it is released, use the wait instruction to sleep, or task switch. Test and set is also used by the IBM 370 and the M68000; the Vax provides seven instructions for locking and synchronization, some of which are equivalent to test and set. Test and set locks may be either cacheable or noncacheable. If they are cacheable, the local copy is updated and any remote copies are invalidated; in any case, the tsts operation always references main memory.
As shown earlier in Figure 2, the high order byte of the first parcel of each instruction always contains the instruction opcode. As noted earlier, this greatly facilitates rapid execution, by always permitting immediate instruction decode. The assignment of bits to opcodes is shown in Figure 3. Of the possible 256 operation codes available from 8 bits, 85 instructions (including sets of instructions) are defined, and 104 of the bit combinations are used. (Some opcodes used to implement instructions that may be executed only from the MIROM are not shown in Figure 3.) That leaves over 140 possible opcodes for future expansion. In general, we have made a conscious effort to allow the CLIPPER architecture to evolve with user needs and technology trends; reserving a significant number of opcodes is one part of that effort.
INTERRUPTS, TRAPS AND SUPERVISOR CALLS
The CLIPPER microprocessor provides for 402 exception conditions: 18 hardware traps, 128 programmable supervisor calls and 256 vectored interrupts. The number of hardware traps can be expanded to 128.
A trap is an exception that relates to a condition of a single instruction, e.g., page fault, memory error, overflow, etc. Interrupts are event ssignalled by devices external to the CLIPPER module.
Intrap and Return Sequences
The recognition by the hardware of a trap or interrupt causes entry to a macro instruction sequence, INTRAP, which is noninterruptible mode performs a context switch to supervisor mode, stores the PC, PSW, and SSW on the supervisor stack, and transfers control to the trap or interrupt handler through the vector table. The vector table is a table in low memory containing two-word entries; each entry contains the address of the trap or interrupt handler and the new SSW. The reti (return from interrupt) sequence is a noninterruptible sequence which restores the system to the correct user or supervisor environment. Interrupts and traps are prioritized, with logic within the processor giving service to the highest priority event. Traps are permitted during interrupt and trap handling but result in an unrecoverable fault; page fault traps must be avoided during fault handling.
When a trap occurs, all instructions prior to the trapping instruction are completed (including those in the floating point unit), and all instructions that follow the trapping instruction are flushed from the pipeline.
Traps can be classified into several groups: data memory, floating point arithmetic, integer arithmetic, instruction memory, illegal operation, diagnostics, and supervisor calls.
Data memory and instruction memory traps include correctable and uncorrectable memory errors, page faults, and protection faults. In each case, the CAMMU recognizes the exception and maintains copies of the protection bits taken from the page table entries in the TLB.
The five floating point arithmetic traps are invalid operation, inexact result, overflow, underflow, and divide by zero. There are trap enable flags for each of these in the PSW, as well as exception flags in the PSW which are set when the corresponding events occur. An overall floating point trap enable flag (also in the PSW) can be used to disable all floating point traps.
The trace trap causes a trap at the end of the current instruction. A MIROM sequence is considered to be a single instruction for tracing purposes. Tracing is disabled on entry to the INTRAP sequence and trace trap handler.
Supervisor calls are implemented as traps triggered by the calls instruction. There are potentially 128 supervisor call codes; the CLIX system (the Integraph port of Unix)  uses approximately 60 of them.
Interrupts are signalled externally to the processor and appear as signals on the interrupt pins of the system bus. An interrupt is taken only when no traps are pending except the trace trap, interrupts are enabled, all instructions currently in the pipeline have completed, and string instructions have either completed or have saved sufficient state to be able to restart. (Long string instructions periodically test for pending interrupts, and if there are any, save their state and permit the interrupt to be processed.) With the exception of the string instructions, interrupts are not accepted during MIROM sequences.
There are 16 prioritized interrupt levels, with 16 interrupts of equal priority within each level. Interrupt processing can be interrupted by an event of higher priority.
As explained earlier, the term design architecture refers to the architectural implementation at a fairly high level. Figure 4 shows the major components of the CLIPPER processor and the major interconnections in a simplified fashion. Somewhat more details is shown in Figure 5. As can be seen from those figures, the processor is divided into six major sections: the instruction bus interface (including an instruction prefetch buffer), the macro instruction unit, the instruction control unit, the floating point unit, the integer execution unit, and the data bus interface. Table III shows the fraction of the chip area occupied by various processor sections; the remainder of the area is occupied by other minor components or empty space.
Instruction Bus Interface
The instruction bus (described in more detail in ) is a bi-directional 45-line bus connecting the CPU chip to the Instruction CAMMU. The interface containes receivers (RCV) and drivers (DRV), and a 64-bit (8-byte) instruction buffer on the processor chip. Instructions are prefetched into this buffer, and are then fed into the instruction control unit as needed. A branch never hits in this buffer because there is no mechanism to detect that a branch target address is within the buffer; on a successful branch, the instruction buffer is cleared. The Instruction CAMMU contains its own instruction counter, and will feed the next 4 bytes of the instruction stream into the instruction buffer every time the next instruction line of the instruction bus is clocked. While within a cache line, the ICAMMU can deliver 4 bytes every 2CPU cycles (60 ns), and the CPU can at its maximum rate execute 2 bytes (one parcel, or one 2-byte instruction) every CPU cycle (30 ns).
A multiplexor (MUX) that can accept instructions from either the instruction buffer or the Macro Instruction ROM and feed them to the instruction control unit is also associated with the instruction bus interface.
Macro Instruction Unit
The Macro Instruction ROM (MIROM) is an on-chip ROM (1 K entries X 47 bits) that implements complicated instructions as sequences of simpler hardwired instructions; the opcode for the MIROM implemented instruction is effectively a branch target address into the ROM; the MIROM does not contain microcode. Each entry in the MIROM contains two instruction parcels plus the next instruction address and a stop bit.
The set of legal opcodes for ROM instructions is a superset of the standard instruction set, including, for example, the conditional branch within the MIROM itself; those ROM-only instructions are not shown in Table II or Figure 3.
In addition to the regular registers, there are 16 scratch registers (12 regular and 4 floating point) accessible only from instruction in the MIROM. The instructions in the MIRO, also have a mechanism to reference the registers specified by the R1 and r2 fields of the Macro instruction (see Figure 2).
Integer Execution Unit
The integer execution unit contains the general register file (16 user GPRs, 16 supervisor GPRs, and 12 scratch registers), the shifter, and the ALU. The register file has three ports, permitting two reads and one write during the same machine cycle.
The shifter implements the shift and rotate instructions and isdesigned as a serial double bit shifter. Single and double bit shifts occur in one cycle; larger shifts require multiple cycles. Data in  shows that for a particular System/370 workload, only 1.9 percent of all shifts were for more than 3 bits.
The ALU (arithmetic/logic unit) implements integer addition and subtraction, bitwise logical operations, and register-to-register transfers. The address mode additions are also performed by the ALU; each requires only one pass through the ALU, since no address computation requires more than one add.
Floating Point Unit
CLIPPER is unusual among current microprocessors in having its floating point unit (FPU) on chip. Multiplication uses a Booth algorithm which produces products iteratively, two bits per clock cycle for single precision (2 bits/3 cycles for double precision) in the C100 and 8 bits per cycle in the C300. Typically, one clock time is needed for round and one (3 in the C300) for normalize. division uses a nonrestoring shift and subtract algorithm, producing 1 bit per three clocks in the C100 and 8 bits per seven clocks in the C300. Associated with the FPU is the floating point register file, which contains eight regular and four scratch-pad 64-bit floating point registers; the latter are accessible only from code running in the Macro Instruction ROM. The floating point unit is also used to perform integer multiply and divide.
The floating point unit operates in parallel with respect to the rest of CLIPPER. Although only one floating point operation can be executed at a time, operations that neither use the FPU nor rely on its output can be issued steadily while the PFU completes the current operation. AS a result, much of the execution time for floating point operations will overlap that of other instructions.
Floating point exceptions may be out of sequence with respect to the rest of the instruction stream. When a floating point trap occurs, the address of the floating point instruction may be recovered from a special register; the PC value pushed on the system stack can potentially be quite far from the address of the trapping instruction.
Data Bus Interface
The data bus interface consists principally of receiver and driver circuits for the data bus, and a shifter for aligning byte and half word operands. It is connected to all of the major functional units of the CPU via the S-bus so it can receive and deliver operands in the most expeditious manner.
Instruction Control Unit and CPU Pipeline
The heart of the CLIPPER processor is the instruction control unit (ICU), which is responsible for decoding instructions and controlling instruction execution. The ICU is shown in Figure 5, and the instruction execution pipeline is shown in Figure 6.
The ICU has several components. The program counter contains the address of the instruction about to be issued; to issue an instruction means to allow it to run to completion (i.e., modify registers or memory), provided no traps occur. Figure 6 shows two boxes, called the "B stage" and "C stage." Each consists of a set of decoding logic and registers for holding partially decoded instructions and the corresponding instruction address. The B stage is responsible for instruction decoding and resource management; resource management keeps track of which functional units are busy and allows instructions to advance to the issue stage only if the necessary units are available. The C stage holds the fully decoded instruction, and controls the operation of the integer execution unit and the floating point unit. The J register (Figure 5) is used to hold immediate values (including address offsets and address constants). The PSW and SSW registers are also located in the ICU.
There can be one instruction in each of the B and C stages. Shown preceding the B stage (figure 6) is the instruction buffer (IB), which holds 4 parcels (8 bytes) of instructions, or up to four instructions.
The last stage of the pipeline consists of parallel integer and floating point execution units. These two execution units can operate in parallel, with one active instruction in the FPU and one instruction in each of the three stages of the integer execution unit (IEU). Those three stages are operand fetch (L stage), arithmetic (A stage: ALU or shifter) and operand write (O stage--to either registers or elsewhere). It takes three cycles for an instruction to pass through the IEU--one to read from the registers into the ALU, one to pass through the ALU or shifter, and one to write the results. There is a bypass from the output of the ALU to the input, so that results can be immediately reused in the next instruction.
LAYOUT, AREA, AND PHYSICAL PARAMETERS
Table III shows the fraction of the chip used for various purposes. The C100 (and C300) are implemented respectively using 2-micron (1.5-micron) CMOS, with two levels of metal interconnect with a 6.5 micron (5.2 micron) pitch, one polysilicon level with 2.0 micron (1.5 micron) gates and a 4.0 micron (3.2 micron) pitch, a 250 [angstroms] thick gate oxide, and 2.0 micron contacts and vias. Transistor switching speeds range from 0.5 ns (0.35 ns) to 3.0 ns, depending on gate size and load. The chip dissipates 0.5 (1.5) watts. The processor cycle time is 30 ns (20 ns), which is also the minimum time to execute an instruction. The power supply is required to provide 0 and +5 volts. The processor chip has 132 (144) pins. The chip size is 10.55 X 10.24 (13.45 X 14.12) millimeters; the package is 0.9 in.sup.2 (1.025) and is surface mounted.
CLIPPER was conceived of and designed as a hig performance processor, and design decisions and tradeoffs have been made whenever possible to achieve higher performance. That high performance has indeed been achieved is evident from the instruction execution times shown in Table IV. The minimum instruction execution time is one CPU cycle time, or 30 ns in the C100 and 20 ns in the C300. The peak program execution rate is thus 50 MIPS on the C300.
Benchmark results have been obtained both from real machines running current software and from an instruction set timing simulator. The simulator shows an average of 5 to 6 clock cycles per instruction including memory delays for typical integer programs on the C100. That works out to about 5 to 7 MIPS on the C100 and 1.8 to 2.0 times that for the C300.
Table V shows the results of the Dhrystone, Whetstone, Linpack, Livermore Loops, Stanford, Smith and Doduc benchmarks on the C100 (33 MHz) and C300 (50 MHz) CLIPPER, the Vax 8600, 8800, and 11/785, and the SUN 3/50 (with 68881), 3/280 (with 68881), 386i/250 (with 80387) and 4/280. Whetstone and Dhrystone are in C; the others are in Fortran. All runs were with unpotimized code; published data usually shows optimized results. All runs were made by one of the authors personally, using the same source code in all cases, and should be comparable. Results have been normalized to the Vax 8600, since we no longer have access to a Vax 11/780. The Vax 11/780 is typically considered to be a 1 MIPS (millions of instructions per second) machine, and the Vax 8600 is approximately four times as fast, or 4 MIPS. (Actually, the Vax has a CISC instruction set, and thus generally runs at about 0.5 MIPS. The Vax 11/780 runs about as fast as an IBM System/370 machine running at 1 MIPS on a scientific workload.)
While there is considerable variation among the various benchmarks, the C100 CLIPPER is approximately 1.3 times as fast as a Vax 8600, or a little over 5 MIPS. The C300 CLIPPER is about 2.5 times as fast as a Vax 8600, or about 10 MIPS. Performance ratings of all machines shown would be higher with fully optimized code.
Hardware Monitor Measurements
A limited number of programs have been run on a C100 CLIPPER and measured with a hardware monitor, and also traced. Here we summarize the measurements taken from an execution of the SPICE circuit simulator on an MOS memory cell circuit. SPICE is a large double precision numerical program, and the results are not representative for other workloads.
The execution time was 8.64 seconds at 33MHz; 1.37 seconds of system time and 7.27 seconds of user time. The instruction cache miss ratio in user state was 14 percent and the data cache miss ratio 10 percent; for system state, the miss ratios were 2.3 percent and 3.9 percent. User state data references were 69 percent read and 31 percent write; in supervisor state, the figures were 54 percent and 46 percent. Instructions were 85 percent fixed point, 12 percent point, 10 percent branch and call, and 3 percent other. The percentages of the most common instructions in user state are shown in Table VI. 33 percent of the branches were unconditional, and 67 percent were conditional. The frequencies of the various address modes are shown in Table VII. Data types for compares were 47 percent quick, 32 percent double, 13 percent word, and 9 percent immediate. Floating point instructions were 28 percent add double, 31 percent subtract double, 30 percent multiply double and 9 percent divide double. 11.7 percent of the instructions were "quick" types.
Performance versus Cycle Time and
For a given instruction set architecture, CPU performance is inversely proportional to the product of cycle_time and cycles/instruction. CLIPPER achieves its high level of performance via a careful tradeoff of these two factors, rather than forcing all instructions to execute in one cycle, as is suggested by many RISC proponents.
The disadvantage to the single cycle per instruction approach is that not all instructions are equally complex, and the cycle time must accomodate the longest single cycle instruction; conversely, partitioning an instruction into a larger number of sequential phases provides more possibilities for overlap. For these reasons, the CLIPPER designers chose to implement the instruction set in the manner of a traditional mainframe, whereby the longer and more complex instructions are permitted more cycles to complete. The CPU cycle time in the C100 (30 ns) was chosen as a design goal, on the basis that the technology available at the time of chip fabrication would permit the basic instructions (e.g., add, logical operations) to complete in one cycle. Longer instructions were allowed to take as many cycles as necessary, and the appropriate hardware support was place on-chip to ensure that the instructions executed correctly in the presence of traps, interrupts, and data and register dependencies.
As a result, in 1986 it was possible to build a 33 Mhz part and in 1988, a 50 MHz part. This compares with speds of about 16 MHz for the initial Sparc implementation (1987), and 8 MHz for the initial MIPS Corp. implementation (1986). The minimum instruction time for those machines is one cycle, so the peak instruction rate of CLIPPER is substantially higher.
There are two approaches to improving the performance of an implementation to a given instruction set architecture. The first is technology scaling, by which faster technology and denser packaging (or a smaller chip) permit the machine to run faster without any changes in the design architecture or even in the circuit diagram.
For the most part, performance improvements in scaling from one technology (e.g., 2-micron CMOS) to another (e.g., 1.5-micron CMOS) are independent of the actual absolute value of the cycle time. The cycle time in a machine is limited by the longest signal path (including gate delays) within a cycle; halving the longest path almost halves the cycle time. CLIPPER has already improved in performance significantly through the scaling and semiconductor process improvements that occurred in going to the C300, which also has a much improved floating point unit relative to the C100, as well as other minor functional changes.
In considering the performance of CLIPPER, the factor most strictly limiting performance on a high performance microprocessor is the memory interface. As is discussed in more detail in, CLIPPER is most strictly limited by memory delays, despite the two buses (one each for instructions and data), and the fact that those busses are short and that each is dedicated to communication between a pair of chips. In scaling any processor, the limiting factor will continue to be the memory interface, which does not scale as well as other aspects of the machine.
The other approach to improved performance is a redesign which decreases the number of cycles per instruction. In general, this can be accomplished by the use of more logic. This type of redesign has already occurred in going from the C100 to the C300, as is shown in Table IV. There we see that by redesigning the floating point unit, floating instruction times have decreased significantly. Similar improvements are possible in other multicycle instructions. In comparison, the Amdahl 470V/6 required 5 to 6 cycles per instruction, and that was roughly halved for the 580. The DEc Vax 11/780 needed about 10 cylcles per instruction, which was reduced to about 6 cycles for the 8600; the cycle time was only reduced from 200 ns to 80 ns, but the total performance was improved by a factor of almost five. The next versions of CLIPPEr will be complete reimplementations with the mean number of cycles per instruction reduced substantially.
The intergraph CLIPPER microprocessor was designed from scratch to provide high performance, cost effectiveness, convenient programmability, and an architecture that can be expanded as technology improves and the art of computer architecture design advances.
Among the important characteristics of CLIPPER are a load/store, fully hardwired architecture, full feature instruction set with complex instructions implemented in an on-chip ROM, an instruction se encoding that permits very fast decode, compact code, very fast cycle time, a sophisticated pipeline, on-chip floating point, and high performance. To minimize the costs of using CLIPPER in a product, CLIPPER is available as a small module containing the processor, two cache and memory management units, and the clock; thus the user doesn't have to build his own cache or memory management system. Opcodes and address modes have been left available, so that the instruction set and address space may be easily expanded.
We believe that CLIPPER represents a good set of design choices.
Acknowledgements. The Advanced Professor Division of Intergraph Corporation (formerly part of Fairchild Semiconductor) consists of over one hundred people, including those doing architecture, software, circuits, CAD, marketing, and manufacturing, all of whom contributed to this project. we want to especially note and thank Vern Brethour, James Cho, Rich Dickson, Duncan Gurley, John Kellu, Kevin Kissell, David Neff, and Ray Ryan, all of whom had major design and implementation roles throughout most of the project.
Alan Jay Smithhs research in computer architecture and computer system performance is supported in part by the National Science Foundation under grants CCR-8202591 and MIP-8713274. Some research results obtained under this funding are presented in this article.
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||product announcement|
|Author:||Hollingsworth, Walter; Sachs, Howard; Smith, Alan Jay|
|Publication:||Communications of the ACM|
|Date:||Feb 1, 1989|
|Previous Article:||Programmable execution of multi-layered networks for automatic speech recognition.|
|Next Article:||Computerization, productivity, and quality of work-life.|