Printer Friendly

Specifying representations of machine instructions.

1. INTRODUCTION

This article describes SLED - Specification Language for Encoding and Decoding - and its implementation in The New Jersey Machine-Code Toolkit. SLED specifications define mappings between symbolic, assembly-language, and binary representations of machine instructions. We have specified representations of MIPS R3000, SPARC, Alpha, and Intel Pentium instructions; toolkit users have written specifications for the Power PC and Motorola 68000. The specifications are simple, which makes it practical to use the toolkit to write applications for multiple architectures. The toolkit uses SLED specifications to help programmers write applications that process machine code - assemblers, disassemblers, code generators, tracers, profilers, and debuggers. The toolkit lets programmers encode and decode machine instructions symbolically. Guided by a SLED specification, it transforms symbolic manipulations into bit manipulations.

Traditional applications that process machine code include compilers, assemblers, linkers, and debuggers. Newer applications include profiling and tracing tools [Ball and Larus 1994; Cmelik and Keppel 1994], testing tools [Hastings and Joyce 1992], protection enforcers [Wahbe et al. 1993], run-time code generators [George et al. 1994], and link-time optimizers [Fernandez 1995; Srivastava and Wall 1993]. There are even some frameworks for creating applications that manipulate executable files, although none that work on more than one machine [Johnson 1990; Larus and Schnarr 1995; Srivastava and Eustace 1994]. Graham et al. [1995] describe auxiliary information needed to facilitate machine-code manipulations; they report support for the MIPS and SPARC architectures.

A few applications avoid machine code by using assembly language, e.g., many Unix compilers emit assembly language, not object code. It is not always practical, however, to use an assembler, e.g., when generating code at run time or adding instrumentation after code generation. Some machine-code applications can be duplicated by source-code transformation; such applications usually work on many machines, but they cannot be used as often as applications that work on object code, because source code is not always available. Our toolkit makes it easier to build applications and frameworks that work with object code and therefore can be used on any executable file.

Applications that cannot use an assembler currently implement encoding and decoding by hand. Different ad hoc techniques are used for different architectures. The task is not intellectually demanding, but it is error prone; bit-manipulating code usually harbors lingering bugs. Our toolkit automates encoding and decoding, providing a single, reliable technique that can be used on a variety of architectures. It is intended less to support traditional compilation than to support nontraditional operations like rewriting executable files or run-time code generation.

Applications use the toolkit for encoding, decoding, or both. For example, assemblers encode; disassemblers decode; and some profilers do both. All applications work with streams of instructions. Decoding applications use matching statements to read instructions from a stream and identify them. A matching statement is like a case statement, except its alternatives are labeled with patterns that match instructions or sequences of instructions. Encoding applications call C procedures generated by the toolkit. These procedures encode instructions and emit them into a stream, e.g., the SPARC call fnegs(r2, r7) emits the word 0x8fa000a2. Streams can take many forms; for example, a debugger can treat the text segment of a target process as an instruction stream. The toolkit's library provides a representation of streams that should be convenient for many encoding applications.

The toolkit has four parts. The translator takes a program with embedded matching statements and translates these statements into ordinary code. It handles programs written in C or Modula-3 [Nelson 1991]. The generator generates encoding and relocation procedures in C. These procedures call code in the library. The library implements both instruction streams and relocatable addresses, which refer to locations within the streams. The checker checks specifications for consistency with existing assemblers. The translator, generator, and checker need an instruction specification; encoding procedures and checking code are generated from the specification, and matching statements can match the instructions or parts thereof defined in the specification. The library is machine independent.

The SLED specification language is simple, and it is designed so that specifications can resemble instruction descriptions found in architecture manuals. SLED uses a single, bidirectional construct to describe both encoding and decoding, so their consistency is guaranteed. The toolkit checks specifications for unused constructs, underspecified instructions, and internal inconsistencies. An instruction encoding can be specified with modest effort; our Alpha, MIPS, SPARC, and Pentium specifications are 118, 127, 193, and 460 lines. The SLED specification language is the primary subject of this article.

Simplicity in specification is more than a personal preference. Simple specifications are more likely to be correct, and correct specifications are more valuable if they can be used in a variety of applications. To make the toolkit simple and general, we avoid describing the semantics of instructions, because too often semantic information is both hard to get right and of use only to a single application. Instead, SLED focuses describing an abstract representation of instructions and on automating the translation to and from that abstract representation.

We have personal experience with two applications that use the toolkit. mId, a retargetable, optimizing linker [Fernandez 1995], uses the toolkit to encode instructions and emit executable files. ldb [Ramsey 1992; Ramsey and Hanson 1992], a retargetable debugger, uses the toolkit to decode instructions and to implement breakpoints. Others have used the toolkit to help develop a run-time code generator, a decompiler, an execution-time analyzer [Braun 1996], and an optimizing compiler for object-oriented languages [Dean et al. 1996].

Using the toolkit reduces retargeting effort and makes code more reliable. For example, ldb's disassembler for the MIPS requires less than 100 lines of code, and told has replaced 450 lines of hand-written MIPS code with generated encoding and relocation procedures. By hiding shift and mask operations, by replacing case statements with matching statements, and by checking specifications for consistency, the toolkit reduces the possibility of error. The toolkit can speed up applications that would otherwise have to generate assembly language instead of binary code. For example, told creates executable files 1.7 to 2 times faster when using toolkit-generated encoding procedures than when using assembly language and calling a native assembler. To realize such speedups without the toolkit, mld would need hand-written encoding and relocation procedures for each target architecture.

The primary contribution of our work is the SLED specification language, which is expressive enough to write clear, concise, and reusable specifications of instruction representations for a variety of widely used architectures. Our processor for these specifications derives code for both encoding and decoding problems, eliminating a significant source of retargeting effort. Our model of machine instructions makes several machine-level concepts general enough to be specified or implemented in a machine-independent way. These concepts include conditional assembly, span-dependent instructions, relocatable addresses, object code, sections, and relocation.

Most of this article is devoted to SLED. We begin with an extended example: a specification for a representative subset of the SPARC instruction set. This example shows how a typical specification is structured and how SLED is used idiomatically. We then cover the details of syntax, semantics, and implementation, followed by smaller examples from our Pentium specification, which show CISC addressing modes and variable-sized operands. We explain how applications use the code generated by the toolkit, and we conclude with a discussion of related work and an evaluation of the toolkit and its specification language.

2. SPECIFYING INSTRUCTION REPRESENTATIONS

To illustrate SLED, we specify a subset of the SPARC instruction set. The illustration is drawn from our complete, annotated specification of the SPARC [Ramsey and Fernandez 1994a]. It includes the SPARC's integer instructions, but it omits floating-point instructions, several types of load and store, and many synthetic instructions. Before beginning the illustration, we explain the elements of the specification language and our strategy for using the language to describe a machine.

Because machine instructions do not always fit in a machine word, the toolkit works with streams of instructions, not individual instructions. An instruction stream is like a byte stream, except that the units may be "tokens" of any size, not just 8-bit bytes. An instruction is a sequence of one or more tokens, so "token stream" might be a more precise term. Tokens may come in any number of classes, which help distinguish different parts of complex instructions. For example, a Pentium instruction might include several 8-bit prefixes, an 8-bit opcode, 8-bit format bytes, and a 16-bit immediate operand. Most likely, the prefixes and opcode would be tokens from the same class, but the format bytes and operand would be from different classes.

Each token is partitioned into fields; a field is a contiguous range of bits within a token. Fields contain opcodes, operands, modes, or other information. Tokens of a single class may be partitioned in more than one way. Patterns constrain the values of fields; they may constrain fields in a single token or in a sequence of tokens. Patterns describe binary representations of instructions, groups of instructions, or parts of instructions. For example, simple patterns can be used to specify opcodes, and more complex patterns can be used to specify addressing modes or to specify a group of three-operand arithmetic instructions.

Constructors connect abstract, binary, and assembly-language representations of instructions. At an abstract level, an instruction is a function (the constructor) applied to a list of operands. An operand may be as simple as a single field, or as complex as a set of fields taken from several tokens in sequence. Applying the constructor produces a pattern that gives the instruction's binary representation, which is typically a sequence of tokens. Each constructor is also associated with a function that produces a string, which is the instruction's assembly-language representation. Specification writers use constructors to define an abstract equivalent of an assembly language. Application programmers use constructors to emit instructions, by calling procedures derived from constructor specifications, and to decode instructions, by using constructors in matching statements to match instructions and extract their operands.

Machine designers might expect binary representations to be untyped. We have found it useful to associate type information with binary representations or with fragments of binary representations, for the same reason that programming languages do so - to help detect and prevent errors. The classes of tokens are like types. We also require that each constructor have a type. We provide a predefined, anonymous type for constructors that produce whole instructions, and specification writers may introduce more constructor types. We typically use such types to describe effective addresses or structured operands. When used in this way, the constructor type corresponds to the "operand class" of Cattell [1980], and each constructor of the type corresponds to one "access mode." The toolkit maps constructor types onto types in the code it generates, which helps find errors in application code as well as in specifications.

To describe a machine, we begin by specifying tokens and fields, which are the basic components of instructions. Next come patterns that specify opcodes and groups of related opcodes, then constructors that specify structured operands, like effective addresses. Having specified opcodes and operands, we define constructors that specify instructions. When possible, we specify many constructors concisely by using "opcode patterns," which group related instructions.

Many architecture manuals use the term "synthetic" to describe instructions that are supported by an assembler, but not directly by the hardware. The assembler synthesizes such instructions by using special cases or combinations of other instructions. SLED specifications can include synthetic instructions, for which binary representations are given by applying previously defined constructors. We typically specify synthetic instructions in a separate file, since they are useful only in some applications.

The rest of this section gives excerpts from our specification of the SPARC. We have engineered SLED's syntax to foster resemblances between specifications and architecture manuals, and we refer to relevant pages of the SPARC manual [SPARC International 1992] by page number. When concatenated, the excerpts form a complete SLED specification for a subset of the SPARC architecture. The specification is included in the toolkit's source distribution.

We use bit numbers to specify the positions of fields within tokens. Since different manuals use different conventions, the toolkit supports both little-endian and big-endian bit numberings. The SPARC manual uses the little-endian numbering.

bit 0 is least significant

Architecture manuals usually have informal field specifications. For example, the fields for some SPARC load instructions are [SPARC International 1992, p. 90]:

[TABULAR DATA OMITTED]

fields declarations give the locations and sizes of fields. The declaration below defines the fields used in all SPARC instructions. The first line defines the fields in the picture above. The remaining lines define all the other fields used in the SPARC manual, even those used only in floating-point instructions, which are otherwise omitted from this article.

fields of itoken (32)

op 30:31 rd 25:29 op3 19:24 rs1 14:18 i 13:13 simm13 0:12 disp30 0:29 op2 22:24 imm22 0:21 a 29:29 cond 25:28 disp22 0:21 asi 5:12 rs2 0:4 opf 5:13 fd 25:29 cd 25:29 fs1 14:18 fs2 0:4

We often want to give auxiliary information about some fields, which we do with fieldinfo directives. This directive gives mnemonic names to the two possible values of the a field.

fieldinfo a is [ names [ "" ", a" ] ]

a is the "annul" bit, and the toolkit uses its names below to help derive the names of branch constructors.

Architecture manuals often define opcodes in tables. The SPARC manual uses a hierarchy of tables; we show specifications for several. Tables F-1 and F-2 [SPARC International 1992, p. 227] are specified by

patterns

[ TABLE_F2 call TABLE_F3 TABLE_F4 ] is op = {0 to 3}

[ unimp _ Bicc _ sethi _ fbfcc cbccc ] is TABLE_F2 & op2 = {0 to 7}

The expressions in braces generate lists of patterns, and each pattern name in the bracketed list is bound to the corresponding pattern on the right. For example, call is bound to the pattern op = 1, and Bicc is bound to op = 0 & op2 = 2. Bindings to the wildcard "_" are ignored. The second line of the excerpt corresponds to Table F-l, but the identifier TABLE_F1 does not appear, because there are no references to Table F-1 from other tables.

Table F-3 [SPARC International 1992, p. 228] defines opcodes for integer arithmetic; it is specified by
patterns

[ add      addcc    taddcc     wrxxx
and        andcc    tsubcc     wrpsr
or         orcc     taddcctv   wrwim
xor        xorcc    tsubcctv   wrtbr
sub        subcc    mulscc     fpop1
andn       andncc   sll        fpop2
orn        orncc    srl        cpop1
xnor       xnorcc   sra        cpop2
addx       addxcc   rdxxx      jmpl
_          _        rdpsr      rett
umul       umulcc   rdwim      ticc
smul       smulcc   rdtbr      flush
subx       subxcc   _          save
_          _        _          restore
udiv       udivcc   _          _
sdiv       sdivcc   _          _ ]

is

TABLE_F3 & op3 = { 0 to 63 columns 4 }


The toolkit can handle opcode tables in row-major or column-major form. The expression {0 to 63 columns 4} generates the integers from 0 to 63 in the sequence (0, 16, 32, 48, 1, 17, 33, ..., 63), so that, for example, addcc is bound to the pattern op = 2 & op3 = 16, effectively using a column-major numbering.

Table F-4 [SPARC International 1992, p. 229] defines the load and store opcodes; it is specified by
[ ld        lda          ldf        ldc
ldub        lduba        ldfsr      ldcsr
lduh        lduha        _          _
ldd         ldda         lddf       lddc
st          sta          stf        stc
stb         stba         stfsr      stcsr
sth         stha         stdfq      stdcq
std         stda         stdf       stdc
_           _            _          _
ldsb        ldsba        _          _
ldsh        ldsha        _          _
_           _            _          _
_           _            _          _
ldstub      ldstuba      _          _
_           _            _          _
swap        swapa        _          _ ]

is

TABLE_F4 & op3 = {0 to 63 columns 4}


Most operands to instructions are fields or integers, but some operands, like effective addresses, have more structure. We use typed constructors to define such operands. The address operands [SPARC International 1992, p. 84] have four possible formats:

constructors

dispA rs1 + simm13! : Address is i = 1 & rs1 & simm13 absoluteA simm13! : Address is i = 1 & rs1 = 0 & simm13 indexA rs1 + rs2 : Address is i = 0 & rs1 & rs2 inditectA rs1 : Address is i = 0 & rs2 = 0 & rs1

Each line specifies a constructor by giving its opcode, operands, type, and pattern. Usually, as here, the opcode is simply the constructor's name. The plus signs among the operands indicate the preferred rendering of these constructors in assembly language. The operand specification simm13! indicates a signed integer operand destined for field simm13. Each of these constructors has type Address, which is effectively a disjoint union type containing an element for each constructor. We use the Address type below to specify operands of constructors for load and store instructions. When a field name is used as a pattern, as is rs1 on the right-hand side of the dispa constructor, it is an abbreviation for the more verbose pattern rs1 = rs1, which forces the field rs1 to be equal to the operand named rs1. This abbreviation appears frequently because operands are often placed directly in fields.

We also use typed constructors to specify "register or immediate" operands:

constructors

rmode rs2 : reg_or_imm is i = 0 & rs2

imode simm13! : reg_or_imm is i = 1 & simm13

Architecture manuals often group definitions of related instructions, like the load-integer instructions in the SPARC manual [SPARC International 1992, p. 90]. We use disjunctions of patterns to represent such groupings, which can make specifications more concise. The specification

patterns loadg is ldsb | ldsh | ldub | lduh | id | ldstub | swap constructors

loadg [Address], rd

defines a group of untyped constructors, one for each general-purpose load instruction. The specification demonstrates two features of SLED: opcode expansion and implicit patterns. When the pattern loadg is given as the opcode in a constructor specification, it is expanded into individual disjuncts, and the construct is equivalent to repeated specifications of ldsb, ldsh, etc. Omitting the right-hand side tells the toolkit to compute a pattern by conjoining the opcode and all the operands. This idiom is ubiquitous in specifications of RISC machines. Finally, the square brackets and comma indicate assembly-language syntax.

These examples show how different elements of the specification interact. The constructor type Address is an abstraction representing "addressing mode." The four constructors of that type specify the different operands of addressing modes as well as their representations. The type Address is used in the loadg specification, so the load constructors take a first operand that represents an addressing mode. That operand must be the result of applying one of the four constructors of type Address defined above. For example, to load register %10 from a location on the stack, a compiler might make the call loadg(dispA(r_fp, -12), r_10). This example assumes that r_fp and r_10 are suitably defined constants.

We use the same techniques to specify the logical, shift, and arithmetic instructions, which take two register operands and one operand of type reg_or_imm. The last line specifies 38 constructors at once:

[TABULAR DATA OMITTED]

Using reg_or_imm as an operand means that the second operand to any of these constructors must have been produced by applying either the imode constructor or the rmode constructor defined above.

The first column of Table F-7 [SPARC International 1992, p. 231] defines branch opcodes:

patterns

branch is any of

[ bn be ble bl bleu bcs bneg bvs ba bne bg bge bgu bgeu bpos bvc ],

which is Bicc & cond = {0 to 15}

This compound binding is a notational abbreviation that relieves us from writing the names in square brackets ("bn be... ") twice. It both defines these names and makes branch stand for the pattern matching any of them.

To specify the branch instructions, we need two more features of SLED: relocatable operands and sets of equations. Designating an operand as relocatable means its value may be unknown at encoding time:

relocatable addr

If an application tries to encode an instruction with such an operand, and if the operand's value is unknown, the encoding procedure emits a placeholder for the instruction, together with a relocation closure that can be used to overwrite the placeholder when the missing value becomes known [Ramsey 1996a]. The most common example of such an instruction is a branch to an unknown label.

For convenience, we choose an invalid instruction as a placeholder. Because the execution of an invalid instruction causes a fault, it is easy to detect application bugs that cause placeholders to be executed:

placeholder for itoken is unimp & imm22 = 0xbad

Although the target address is an operand to a branch, it is not found in any field of the instruction; instead, it is computed by adding a displacement to the program counter. The equation in curly braces shows the relationship, which is taken from SPARC International [1992, pp. 119-120]:

constructors

branch^a addr { addr = L + 4 * disp22! } is L: branch & disp22 & a

The label L refers to the location of the instruction, and the exclamation point is a sign-extension operator. The toolkit solves the equation so that the encoding procedure can compute disp22 in terms of addr and the program counter. The toolkit expands the 16 alternatives for branch and the two alternatives for a, so this line specifies 32 constructors.

We specify synthetic instructions by applying the constructors that correspond to the instructions from which they are synthesized. Here are definitions of bset (bit set) and dec (decrement) [SPARC International 1992, p. 86]:

constructors

bset reg_or_imm, rd is or(rd, reg_or_imm, rd)

dec val!, rd is sub(rd, imode(val), rd)

The patterns on the right-hand sides are notated as constructor applications.

Some synthetic instructions may stand for more than one instruction sequence, depending on the values of operands. We specify such instructions by putting alternative branches on the right-hand side of a constructor specification. Each branch may have its own set of equations. The toolkit encodes the first possible branch whose equations have a solution and whose operand values fit in the fields to which they are bound. For example, the synthetic instruction set [SPARC International 1992, p. 84] expands to a single instruction when possible, but requires two in the general case:

constructors

sethi val!, rd is sethi & rd & imm22 = val@[10:31]

set val!, rd

when { val@[0:9] = 0 } is sethi(val, rd)

otherwise is or(0, imode(val), rd)

otherwise is sethi(val, rd); or(rd, imode(val@ [0:9]), rd)

The bit-extraction operator, @ [low:high], extracts the bits in the positions from low to high. The first branch, sethi, can be encoded whenever the least-significant 10 bits of val are zero. The second branch works when imode (val) can be encoded, i.e., when val fits in 13 signed bits. The final branch can always be encoded.

3. SLED SYNTAX AND SEMANTICS

Now that we have illustrated SLED with an extended example, we present its syntax and semantics in detail. We also describe the toolkit's internal representation in enough detail so that our techniques could be used in other systems.

SLED solves not only the intellectual problem of describing instruction representations, but also several practical problems in the generation of encoding and decoding applications. Throughout this section, we associate language constructs with problems that they solve, and we identify constructs that are motivated by the special needs of encoding, decoding, or other applications.

To describe syntax, we use an EBNF grammar with standard metasymbols for

{sequences}, [optional constructs], and (alternative [choices).

We use large metasymbols to help distinguish them from literals. Terminal symbols given literally appear in typewriter font. Other terminal symbols and all nonterminals appear in italic font. Excerpts from the grammar always begin with the name of a nonterminal followed by the [implies] ("produces") symbol.

Specification is the grammatical start symbol for SLED specifications. Within a specification, definitions must appear before uses, but otherwise the parts of a specification may appear in any order; so a specification is a list of spec:

specification [implies] {spec}

3.1 Tokens and fields

The toolkit supports both little-endian and big-endian bit numberings.

spec [implies] bit 0 is (most | least) significant

The default numbering makes bit 0 the least-significant bit.

fields declarations specify how to divide tokens into fields. One fields declaration is given for each class of tokens; only fields named in the declaration can be extracted from tokens of that class. Each field appears in tokens of exactly one class. The fields declaration binds field names to bit ranges and specifies the number of bits in tokens of its class. The toolkit generates the shifts and masks needed to manipulate the value of a field in a token. The fields syntax is as follows:

spec [implies] fields of class-name (width) (field-name low-bit:high-bit)

Field values are always unsigned; storing signed values in fields requires the explicit sign-extension operator, a postfix exclamation point. For example, this operator is applied to the displacement field disp22 in the definition of the SPARC branch constructors. We make all field values unsigned because implicit sign extension can be confusing - people reading specifications should not have to remember which fields are signed and which are unsigned. Explicit sign extension also supports the use of the same field in different contexts with or without sign extension.

Fields solve the problem of specifying binary representations at the lowest level. They offer several advantages over bit strings, a more usual alternative. To make a token from bit strings, the strings must be concatenated in the right order; the order of fields is implicit in their declarations. One cannot assign the wrong number of bits to a field, and the toolkit detects cases in which fields overlap or leave gaps.

When instructions vary in size, more than one class of tokens may be needed. On the Intel Pentium, instructions are composed of 8-, 16- and 32-bit tokens, which must be given different classes because they are of different sizes. It can even be useful to put tokens of the same size in different classes. For example, the Pentium uses a "ModR/M" byte to specify addressing modes and an "SIB" byte to identify index registers [Intel Corp. 1993, p. 26-3]:

[TABULAR DATA OMITTED]

The fields declarations for these bytes are

fields of ModRM (8) mod 6:7 reg_opcode 3:5 r_m 0:2

fields of SIB (8) ss 6:7 index 3:5 base 0:2

Dividing tokens into classes helps detect errors in specifications. For example, putting the ModR/M and SIB tokens in different classes ensures that a user cannot mistakenly match both a mod field and an index field in the same byte.

One could also divide SPARC tokens into classes, e.g., by using a different class for each instruction format. One would have to define several replicas of fields that, like op, are common to multiple formats, because each field belongs to exactly one class. We judge that the extra effort would not pay off; the toolkit checks that the fields used in instructions partition the instructions' tokens, and this check seems adequate to detect errors on machines like the SPARC.

SLED specifications can include information about the names of field values and about the way fields are expected to be used in an application. The syntax used is as follows:

spec [implies] fieldinfo {field-specifier is [ {field-item} ] }

field-specifier [implies] field-name | [ (field-name} ]

field-item [implies] sparse [ binding {, binding} ]

| names [ {Ident | String} ]

| checked | unchecked | guaranteed

binding [implies] (Ident | String) = integer

sparse and names specify names of fields. names is used when all values have names; sparse is used otherwise. Naming field values solves no single problem; the names are used in a variety of ways. The most unusual use may be SLED's use of field names in constructor specifications; when fields are used to specify constructor opcodes, the names of the values become part of the names of constructors. For example, our SPARC specification uses the names "" and ", a" for the values 0 and 1 of the a field, and these names become part of the names of branch constructors. The toolkit also uses the names when generating encoding procedures that emit assembly language and when generating disassemblers. Finally, the toolkit can generate tables of field names so applications can print names of field values.

The other information about fields helps solve the problem of generating efficient encoders. The toolkit normally checks field values at encoding time to be sure they can be represented in the number of bits available. These safety checks are needed only when field values are supplied by an application; no safety checks are generated when the toolkit can infer that values are representable. The checks can be fine-tuned using the checked, unchecked, and guaranteed attributes of fields. Application writers unwilling to pay for a compare and branch can designate fields as unchecked, in which case encoding procedures do not check their values but simply mask out high bits so tokens are not corrupted by bogus values. Those unwilling to pay even the cost of masking can designate fields as guaranteed, in which case their values are used without checking or masking; the application guarantees that the value fits. For example, code generators typically guarantee fields denoting registers, since the register allocator can easily ensure that register numbers fall in the proper range. Such a guarantee could be added to our SPARC example by writing

fieldinfo [ rs1 rs2 rd fs1 rs2 fd cd ] is [ guaranteed ]

Fields are checked by default.

3.2 Patterns

Patterns constrain both the division of streams into tokens and the values of the fields in those tokens. When instructions are decoded, patterns in matching statements identify interesting inputs; for example, a pattern can be defined that matches any branch instruction. When instructions are encoded, patterns in the machine specification specify what tokens are written into the stream.

Patterns are composed from constraints on fields. A constraint fixes the values a field may have. Constraints come in two forms: range constraints and field bindings. Range constraints are used when the values permitted in a field are known statically. Range constraints are represented internally in the form lo [less than or equal to] f [less than] hi, forcing the value of the field to fall in a range. The external syntax is more restrictive; it requires that the field name be to the left of a single relational operator. The general form can be obtained by conjoining two constraints on the same field. The restricted syntax presents no burden in practice, because almost all range constraints use a range that contains one value, and we write them with an equals sign, e.g., op = 1.

Field bindings are used when the value of a field is not known until encoding time. A field binding forces a field to be equal to a value computed dynamically, and the dynamic computation is represented as an expression containing free variables. Field bindings are also written with equals signs.

Patterns are composed by conjunction (&), concatenation (;), and disjunction (|). They can also be computed by applying constructors. The syntax for patterns is as follows:

[TABULAR DATA OMITTED]

Patterns and their composition are most easily understood by looking at the rules for matching patterns. Patterns are tested for matching against sequences of tokens; the special pattern epsilon matches the empty sequence. For each constraint, the toolkit checks the field named therein to see if it fails in the range specified in a range constraint or is equal to the value bound in a field binding.

Patterns can be combined by conjunction, concatenation, or disjunction. When p and q are patterns, a conjunction "p & q" matches if both p and q match. We typically use conjunction to constrain multiple fields within a single token. A concatenation "p; q" matches if p matches an initial sequence of tokens and if q matches the following tokens. We typically use concatenation to build up patterns matching sequences of more than one token, for example, to match effective addresses on the Pentium. A disjunction "p | q" matches if either p or q matches. We typically use disjunction to group patterns for instructions that are related, e.g., to group the SPARC integer-arithmetic instructions.

The wildcard constraint "some class" matches any token of class class; for example, on the SPARC, "some itoken" matches any 32-bit token.

The labeled pattern L: p matches whenever p matches, and it binds the identifier L to the location in the instruction stream where p matches.

The ellipsis has no effect on matching, but it relaxes restrictions on conjunction, as described below.

Patterns solve the intellectual problem of describing binary representations. Each composition operator addresses a different need. Conjunction specifies how values in fields are combined to form tokens. Concatenation describes representations containing multiple tokens in sequence. Disjunction describes alternatives. Concatenation and disjunction operators are found in regular expressions. Unlike regular expressions, patterns do not have a Kleene closure (repetition) operator. This omission, together with the ability to examine fields in any order, [TABULAR DATA FOR FIGURE 1 OMITTED] distinguishes the problem of matching patterns from the problem of matching regular expressions.

3.2.1 Representing Patterns. This section presents a detailed description of the toolkit's representation of patterns. Studying the details of the representation is the best way to understand the meanings of patterns and the pattern operators and to understand the utility of patterns in generating encoders and decoders. The details can be confusing, because we use similar but not identical list structures at several levels, and because the structures play different roles in different contexts. Suggestive terminology helps distinguish structures and roles at each level.

Patterns are represented in a disjunctive normal form. The normal form has a three-level structure; the levels correspond to the three ways to combine patterns. Figure I shows the components of the normal form, the terminology used to refer to them, and the rules for matching them. We use several synonyms for each component, changing synonyms as we shift our focus from the component's role on its own to the component's relationship with the component above.

Every pattern is represented as a disjunction, that is, a list of alternatives. An empty list is permitted, even though the empty disjunction never matches.(1) Each disjunct, or alternative, is a sequence. Each item in a sequence is a conjunction of constraints. A pattern matches a sequence of tokens when one of its disjuncts (alternatives) matches. That disjunct matches a sequence of tokens when every sequent (conjunction) matches the corresponding token. The empty sequence, denoted by epsilon, always matches, consuming no tokens. Finally, a conjunction matches a token if the token satisfies all of the constraints in the conjunction. Each conjunction applies to a particular class of tokens, and all the constraints in the conjunction must constrain fields from that class. The empty conjunction, which is denoted by some class, is permitted; it matches any token of the associated class.

We define the shape of a sequence to be the list of token classes associated with the conjunctions of that sequence. Encoding and decoding choose a particular disjunct (sequence) to emit or match, and the shape of the sequence determines which tokens are emitted or matched when that sequence is encoded or decoded.

We can define simple constraints and the pattern operators in terms of the normal form of patterns. It is not hard to show that these definitions, combined with the rules for matching in normal form, imply the matching properties described above.

The normal form of a simple constraint is a pattern with a single disjunct, which is a sequence of length 1, in which the single sequent contains the constraint. (A wildcard constraint has a form in which the sequent contains no constraints, i.e., it is the empty conjunction.) The normal forms of p | q and p; q are straightforward. We form p | q by concatenating the disjuncts of p and q to form one large disjunction. We form p; q by distributing concatenation over disjunction; and we concatenate two sequences by concatenating their sequents.

We also form p & q by distributing over disjunction, but the rules for conjoining two sequences are more complicated. The basic rule is that the sequences to be conjoined must have the same shape, i.e., they must be the same length, and the classes associated with corresponding sequents must be the same. For example, all of the conjunctions in the SPARC example operate on sequences of length 1, and each sequent comes from the itoken class. The Pentium is more complicated. For example, the pattern mod = 0 & r_m = 5 is permitted, because both conjuncts constrain fields from the ModRM class. The pattern mod = 0 & index = 2 is not permitted, because mod is from the ModRM class, but index is from the SIB class. We conjoin two sequences of identical shape by conjoining their individual sequents, elementwise. Conjoining two sequents simply means conjoining their constraints; if both sequents constrain the same field, their conjunction constrains the field to lie in the intersection of the two ranges.

The basic rule for conjunction is too restrictive on a machine like the Pentium, in which effective addresses of varying shapes must be conjoined with opcodes of a fixed shape. If the shape of one sequence is a prefix of the shape of another, we can conjoin two sequences elementwise until we run out of elements in the shorter sequence, and then we can take the remaining elements from the longer sequence unmodified. A similar technique works when one sequence is a suffix of another.

If the toolkit used prefixes or suffixes automatically, it might silently accept an unintended, incorrect conjunction, so it uses them only when told to do so. The specification writer uses an ellipsis (". . .") before or after any pattern to liberalize conjunctions with that pattern. The pattern p & q . . . is defined whenever q's shape is a prefix of p's shape. q is conjoined with the prefix of p whose shape matches its shape, and the rest of p is concatenated to the result. Similarly, p & . . . q is defined whenever q's shape is a suffix of p's shape, and the patterns are aligned at the end instead of the beginning. The ellipsis has the effect of making a pattern "lose its shape" where the ellipsis appears; so p . . . & . . . q is never legal, because p . . . has no well-defined suffix, and . . . q has no well-defined prefix.

The restrictions on conjunction, with or without the ellipsis, guarantee that each disjunct in a valid pattern corresponds to a sequence of tokens. The toolkit uses this invariant to generate both encoders and decoders. These rules prohibit "mixing" tokens of different classes; in each instruction, each sequence of bits comes from a token of a unique class.

3.2.2 Conditions and Names. Free variables may appear not only in field bindings, but also in conditions associated with a pattern. No conditions appear in the grammar for patterns; instead, conditions are implicit in other parts of the specification and are associated with patterns in the toolkit's internal representation. For example, encoding of a field binding is subject to the condition that the computed value fit in the field; the condition becomes part of the pattern in which the field binding appears. Internally, this condition is derived from an operator that narrows a value to fit in the number of bits available. The toolkit uses [TABULAR DATA FOR FIGURE 2 OMITTED] a signed narrow for sign-extended fields and an unsigned narrow for other fields. From the unsigned narrow, the toolkit derives the condition 0 [less than or equal to] f [less than] [2.sup.n], for a value f put into a field of n bits. From the signed narrow, the toolkit derives the condition -[2.sup.n-1] [less than or equal to] f [less than] [2.sup.n-1]. Other conditions may be derived from equations in a constructor definition. For example, most RISC branch instructions are described by equations that have solutions only under the condition that the target address differs from the program counter by a multiple of the word size.

We associate conditions with each disjunct. Although conditions could be associated with each constraint or each sequent, the disjunct is a better choice, because it is the largest component of a pattern that must be matched in its entirety. The disjunct is also the natural place to put conditions associated with constructor definitions. For example, the binary representation of a SPARC branch instruction is represented by a pattern of one disjunct; the disjunct includes the condition (addr - L) mod 4 = 0, where L represents the location of the instruction, and addr represents the target address of the branch instruction.

Both patterns and disjuncts have names. A pattern's name can be used wherever a pattern is expected. Disjunct names are used to compute constructors' names when patterns are used in constructor opcodes.

Figure 2 shows the full representation of patterns, together with the rules for matching and encoding them. As an example, the alu pattern from the SPARC specification has 38 disjuncts and the name alu. The first disjunct has no conditions, one sequent, no labels, no ellipses, and the name add. The single sequent of that disjunct is a sequent of class itoken. It has two range constraints, 2 [less than or equal to] op [less than] 3 and 0 [less than or equal to] op3 [less than] 1, and no field bindings.

3.2.3 Using and Naming Patterns. Patterns are used in specifications in two ways. Opcodes are defined by binding names to pattern values, which contain no field bindings and are computed statically. Constructors and matching statements are defined using pattern expressions, which may contain free variables whose values are not known until encoding or decoding time. Such variables must be operands of the constructor; that is, they must be bound by the constructor's definition.

The patterns declaration binds names to pattern values; pattern expressions are used in constructor definitions and matching statements, which are described below. Pattern bindings are typically used to define opcodes and to group related opcodes, e.g., they are used to define the SPARC opcodes. Their syntax is

spec [implies] patterns {pattern-binding}

pattern-binding [implies] pattern-name is pattern

| [ {pattern-name} ] is pattern

| pattern-name is any of [ {pattern-name} ],

which is pattern

Patterns bound to the special name "_" are ignored. Such patterns may correspond to unused opcodes, as in Table F-3 in the SPARC example. A pattern binding can bind one name to one pattern or each of a list of names to one of a list of patterns. Lists of patterns are created by using generating expressions in constraints. Generating expressions are modeled on expressions in Icon, which can produce more than one value [Griswold and Griswold 1990]. They are ranges or lists:

generating-expression [implies] { lo to hi [columns n ] } | [ {integer} ]

The values generated are enumerated in left-to-right LIFO order. For example, the SPARC example's declaration for Table F-1 binds the names TABLE_F2, call, TABLE_F3, and TABLE_F4 to the patterns op = 0, op = 1, op = 2, and op = 3, respectively.

3.3 Constructors

A constructor maps a list of operands to a pattern, which stands for the binary representation of an operand or instruction. Typed constructors produce operands; untyped constructors produce instructions. Because most manuals describe instructions in terms of their assembly-language syntax, we designed constructor specifications to resemble that syntax. A constructor specification begins with an opcode and a list of operands. It also gives a type and zero or more "branches," which designate possible representations.

spec [implies] constructors {constructor}

constructor [implies] opcode ( operand } [ : type-name ] [branches]

A constructor without explicit branches is given the representation obtained by conjoining the opcode with the operands.

The type of a constructor determines how the corresponding encoding procedure can be used. Although a constructor with no explicit type is called "untyped," in fact it has a predefined, anonymous type - the type of instructions. Corresponding encoding procedures emit instructions. The encoding procedures for explicitly typed constructors produce values that can be used as operands to other constructors, as described below.

Opcodes are tricky. They can be simple strings, or they can be combinations of strings, pattern names, and field names, which are expanded to define multiple constructors with one specification. For example, the SPARC alu constructor specification expands the alu pattern to define 34 constructors at once. Compound opcodes are formed by joining strings or names using the ^ symbol.

opcode [implies] opname {^ opname }

An opname can be the name of a field or pattern, or it can be an unbound name or a string. Unbound names mean the same as strings; for example, in the SPARC example, because the opname dispa is not previously defined, it is equivalent to "dispA". This notational convenience means that the names of constructors seldom need to be quoted.

When any opname is the name of a pattern or field, the toolkit expands opcodes by enumerating the disjuncts of patterns and the named values of fields. For example, the toolkit expands the branch^a opcode by expanding the pattern branch to the 16 disjuncts named in its definitions, and it expands the field a to the two named values "" and ", a". The SPARC example's single constructor definition of branch^a is therefore equivalent to a series of 32 definitions:

constructors

"bn" addr { addr = L + 4 * disp22! } is L: bn & disp22 & a = 0

"bn,a" addr { addr = L + 4 * disp22! } is L: bn & disp22 & a = 1

. . .

"bvc,a" addr { addr = L + 4 * disp22! } is L: bvc & disp22 & a = 1

Because architecture manuals often use the same name to refer both to an opcode and to its instruction, we put constructors in a separate name space, so the same name can be used to refer both to constructors and to patterns.

Operands may be fields, integers, or patterns. Field and integer operands may be signed or unsigned, and they may be designated relocatable. Pattern-valued operands must result from applying constructors of a designated type. Operand types are distinguished by their names; an operand is a field or pattern if its name is that of a field or a constructor type, and it is an integer otherwise.

The type of an operand determines how its name can be used on the right-hand side of a constructor. Integer operands can be used only in integer expressions, which appear in field bindings. Field operands can be used as integers, but they can also be used as patterns, in which case the field name stands for the pattern binding that field to the corresponding operand, as shown in the SPARC example. Finally, pattern-valued operands can be used only as patterns.

A list of operands may be decorated with spaces, commas, brackets, quoted strings, and other punctuation. The punctuation represents assembly-language syntax, and the toolkit uses it to generate encoding procedures that emit assembly language and to generate a grammar that recognizes assembly language.

Constructors solve several intellectual problems. They give an abstract structure to an instruction set, they connect that structure both to binary representations and to assembly language, and they formalize instructions as functions mapping operands to binary representations. An instruction set's abstract structure comes from the types of the constructors and their operands. This structure is isomorphic to a grammar in which the start nonterminal corresponds to the anonymous type "instruction," and in which each explicit constructor type corresponds to an additional nonterminal. Each constructor corresponds to a production in which the constructor's type appears on the left-hand side, and its operands appear on the right. The terminal symbols of the grammar are the operands that are fields, integers, or relocatable addresses. The patterns on the right-hand sides of constructor definitions are equivalent to synthesized attributes of the grammar. Field names, constructor names, and punctuation define an assembly-language representation that is implicit in every constructor definition, and these representations are also equivalent to synthesized attributes of the grammar.

Relocatable addresses are not essential to the intellectual task of specifying representations; instead, they support separate compilation in encoding applications. Any field or integer operand can be designated relocatable by

spec [implies] relocatable {identifier}

For example, the addr operand of the SPARC branch constructor is declared relocatable. Labels that appear in constructors' patterns are also relocatable. Applications typically use relocatable addresses to refer to locations bound after encoding, at link time. Allowing any operand to be relocatable simplifies implementation of applications that usually emit assembly language. For example, it simplifies construction of mld's code generators, because it enables automatic translation of existing assembly-emitting code generators into mld's binary-emitting code generators. Without the ability to make any operand relocatable, large parts of mld's code generators would have to be written by hand.

When a constructor that uses relocatable operands is applied, it checks to see if their values are known (e.g., they have been assigned absolute addresses). If so, it treats them as ordinary integers and emits the instruction. Otherwise, it emits placeholder patterns and creates a relocation closure [Ramsey 1996a]. The application holds the closure until the addresses on which it depends become known, at which point it applies the closure to overwrite the placeholder with the correct encoding of the instruction. Alternatively, the toolkit provides a machine-independent representation that can be used to write the closure to a file, from which another application could read and apply it.

Placeholder patterns are associated with token classes:

spec [implies] placeholder for class-name is pattern

The toolkit uses the shape of a constructor's pattern to compute its placeholder, so the placeholder is the same size as the relocated instruction that will overwrite it.

The branches of a constructor specification contain equations and patterns. The patterns specify binary representations, and the equations relate the constructor's operands to the variables used in the patterns.

branches [implies] [ { equations } ] [is pattern]

| { when { equations } is pattern | otherwise is pattern }

When a constructor has a single branch, the pattern can be omitted, in which case it is taken to be the conjunction of the constructor's opcode with its operands. The ability to specify multiple branches supports conditional assembly, as with the SPARC set constructor. When encoding, the toolkit emits the first branch for which all conditions are satisfied. As explained above, conditions are satisfied when (1) all values bound to fields fit in those fields and (2) all equations used in the branch have solutions. When decoding, the toolkit matches any of the branches. Otherwise is syntactic sugar for when { }.

Equations express relationships between operands and fields. As written, they relate sums of terms with integer coefficients. Terms include field and integer variables, from which one can extract bits by n@[lo:hi]. One can also sign-extend a variable or extracted bits with the postfix exclamation point, as shown in the descriptions of the SPARC branch constructors. Equations may include inequalities, which become conditions attached to disjuncts of a branch's pattern. Conditions may also arise from solving equations; for example, the condition (addr - L) mod 4 = 0, which is attached to the patterns in the SPARC branch constructors, is derived from the equation for those constructors. All conditions must be satisfied for the constructor to be matched or encoded.

The toolkit uses a simple equation solver [Ramsey 1996b]. To encode, the toolkit takes operands as known and solves for fields. To decode, the toolkit takes fields as known and solves for operands.

Constructors are represented essentially as lambda terms mapping operands to patterns. The results of solving equations are represented in the patterns as conditions or as expressions in field bindings, so the only free variables in a constructor's pattern are the constructor's operands. Constructors with multiple branches, like the set constructor in the SPARC example, result in patterns with multiple disjuncts. The encoding procedure associated with the constructor emits the first disjunct whose conditions are known to be satisfied. If a condition depends on the value of an unknown relocatable operand, the toolkit conservatively assumes that the eventual value may not satisfy the condition, and it moves on to the next disjunct. If all disjuncts depend on relocatable operands, the toolkit uses the final disjunct. This technique, while safe, is unsuitable for emitting span-dependent instructions; for example, it uses the most general representation for all forward branches. We believe that standard techniques for resolving span-dependent instructions [Szymanski 1978] can be applied to our specifications.

3.4 Matching Statements and Decoding

Decoding applications use the toolkit's matching statements. Matching statements provide a notation for writing instruction recognizers that are efficient and easily understood. Matching statements resemble ordinary case statements, but their arms are labeled with patterns. The first arm whose pattern matches is executed. The syntax for matching statements is

matching-statement [implies] match code to

{ | pattern [ { equations } ] [ [ name ] ] = [greater than] code}

[else code]

endmatch

The terminal symbol code stands for a fragment of Modula-3 or C code. The code next to match evaluates to a location in an instruction stream. The representation of the instruction stream is implicit in code templates supplied by the application writer, as described below. Each arm may include equations that must be satisfied for the arm to match. A name in square brackets is bound to the name of the [TABULAR DATA FOR FIGURE 3 OMITTED] pattern that matched. If an arm's pattern matches, the code on the right-hand side of =[greater than] is executed.

Matching-statement is itself a grammatical start symbol; it cannot be derived from specification. When generating decoders, the toolkit's translator reads a specification from one file, then transforms a different file containing one or more matching statements.

In a matching statement, every free variable in a pattern is a binding instance; the toolkit computes a value for each such variable, and the values can be used in the host-language code on the right-hand side of the arm labeled by the pattern. Free variables associated with typed constructors are bound to locations in the instruction stream. The generated decoder converts such bound locations to integers.

Matching statements can be embedded in programs written in Modula-3 or in C. The toolkit's translator acts as a simple preprocessor - it finds embedded matching statements and rewrites them into pure Modula-3 or C code.

Matching statements make an application's decoding code clear and concise. For example, ldb, a retargetable debugger for ANSI C, uses matching statements to implement control-flow analysis. Most of ldb's breakpoint implementation is machine independent; the only machine-dependent part is the analysis of control flow [Ramsey 1994a]. Figure 3 shows a simplified version of the SPARC code in ldb's breakpoint implementation, omitting subtleties associated with delayed branches. This code finds which instructions could be executed immediately after an instruction at which a breakpoint has been planted [Ramsey 1994a]. After an ordinary instruction, the only instruction that can follow is its inline successor, as computed by the first arm of the matching statement. FollowSet.T{L} is a set of addresses containing the single element L, which is the location of the successor instruction. Calls and unconditional branches also have only one instruction in their "follow set," but conditional branches have two. The two jmpl patterns are indirect jumps through registers; the GetReg procedure gets the value in the register in order to compute the target address. The matching statement in Figure 3 expands to nested case statements totaling about 90 lines of Modula-3 code. The count does not convey the difficulty of writing the code by hand, because the toolkit eliminates unnecessary tests by combining seemingly unrelated opcodes if they result in execution of the same code.

Application writers can use any representation of instruction streams; in particular, the toolkit does not constrain the application to use integers to represent locations. An application writer specifies a representation by supplying the toolkit with four code fragments: the data type used to represent locations, a template used to add an integer offset to a location, a template used to convert a location to an unsigned integer, and a template used to fetch a token of a specified width from a location. The templates are specified by

spec [implies] fetch (width | any) using template | address type is template

| address add using template

| address to integer using template

The template symbols stand for quoted strings containing fragments of Modula-3 or C code mixed with escape sequences that stand for addresses, widths, and offsets. Widths are measured in bits; offsets are measured in units of pc_unit_bits:

spec [implies] pc_unit_bits width

This size must evenly divide the width of every token; the default size is 8 bits.

The toolkit builds a decision tree for each matching statement. The decision tree checks all applicable range constraints while examining each field at most once. If patterns in two arms use the same range constraints but have different conditions, the toolkit checks conditions sequentially, but this situation is rare. The toolkit tries to minimize the number of tests needed to identify an arm. No polynomial-time algorithm is known for this problem, and even though the toolkit builds decision trees at tool-compile time, it would take too long to generate and evaluate all possible decision trees. Our heuristics yield trees that are at least as good as trees we would write by hand.

4. SPECIFYING CISC INSTRUCTIONS

Tools may work well for RISC architectures without being very useful for CISC architectures. To demonstrate the utility of our specification language, we show two complex aspects of our Pentium specification: addressing modes and variable-sized operands. Figure 4 shows constructor specifications for the Pentium's addressing modes. We have given each constructor the type Eaddr, which we have chosen to represent effective addresses. Values of type Eaddr are used as operands to untyped constructors, as shown below. Again, the brackets and asterisks in the specification are punctuation indicating suggested assembly-language syntax. Figure 5 depicts the structures of the patterns used in Figure 4.

Effective addresses begin with a one-byte ModR/M token, which contains an addressing mode and a register. In indexed modes, the ModR/M token is followed by a one-byte SIB token, which holds index and base registers and a scale factor ss.

Finally, some modes take immediate displacements [Intel Corp. 1993, Tables 26-2 to 26-4]. The tokens and fields used in effective addresses are as follows:

fields of ModRM (8) mod 6:7 reg_opcode 3:5 r_m 0:2

fields of SIB (8) ss 6:7 index 3:5 base 0:2

fields of I8 (8) i8 0:7

fields of I16 (18) il6 0:15

fields of I32 (32) i32 0:31

The fields i8, i16, and i32 occupy whole tokens.

We define constructors of type Eaddr to create effective addresses in 32-bit mode. The first group of constructors specifies the nonindexed addressing modes. The simplest mode is encoded by rood = 3; it is a register-direct mode that can refer to any of the machine's 8 general registers. The next 3 modes are register-indirect modes with no displacement, 8-bit displacement, and 32-bit displacement. The 8bit displacement is computed by sign-extending the i8 field. Semicolons separate ModR/M tokens from the displacement tokens that follow.

The inequality reg ! = 5 shows that r_m may not take the value 5 in simple indirect mode. Instead of denoting indirect use of the base pointer, which is the register normally encoded by 5, the combination mod = 0 & r_m = 5 encodes a 32-bit absolute mode. The inequality reg ! = 4 shows that the value 4 may not be used to encode indirect use of the stack pointer, which is the register normally encoded by 4. This value is used instead to encode the indexed modes, which use an SIB token as well as the ModR/M token.

The indexed modes are the second group in Figures 4 and 5. The ModR/M token in which r_m = 4 is followed by an SIB token. The stack pointer may not be used as an index register (index !: 4). Depending on the value of rood in the ModR/M token, the SIB token may end the address, or an 8-bit or 32-bit displacement may follow. Finally, "mod = 0 & base = 5" denotes an indexed address with no base register and a 32-bit displacement.

None of the addressing modes specifies a value for the reg_opcode (middle) field of the ModR/M token. This field is not part of the effective address; depending on the instruction, it can be part of the opcode, or it can denote a register operand. Effective addresses are used by conjoining them with a pattern that constrains reg_opcode; the resulting pattern specifies every bit of the ModR/M token. We need the ellipsis operator to make the conjunction work. Even though effective addresses have several different shapes, all the shapes begin with ModRM, so it is legal to write Eaddr & p ... whenever p's shape is ModRM. The move-byte and move-byte-immediate instructions show the use of the ellipsis:

constructors

MOV^"mrb" Eaddr, reg is MOV & Eb. Gb; Eaddr & reg_opcode = reg ... MOV.Eb.Ib Eaddr, i8! is MOV.Eb.Ib; Eaddr & reg_opcode = 0 ...; i8

Our specifications of the Pentium's opcodes, which are not shown in this article, mimic the tables in the manual [Intel Corp. 1993]. The manual uses families of opcodes (ADD, MOV, etc.) that are distinguished by suffixes indicating the locations and sizes of the destination and source operands. The suffix "Eb,Gb" indicates that the destination is given by an effective address, that the source is in a general-purpose register, and that both source and destination operand are one byte wide. In many cases, as with "MOV Eb,Gb", we specify the operation and the suffix separately, then conjoin them to get an opcode, thereby writing m + n specifications instead of m x n specifications. The "Eb,Ib" suffix, which uses an immediate operand as the source, cannot use this scheme, so we specify the full opcode as MOV.Eb.lb.

The Pentium uses an unusual method of identifying the sizes of operands. Most instructions come in three variants: one each for 8-bit, 16-bit, and 32-bit operands. Typically the 8-bit variant has a distinct opcode, but the 16- and 32-bit variants share an opcode and are distinguished by the presence or absence of an instruction prefix. We specify an "object varying" pattern as a sequence that is empty or that contains the prefix

patterns ow is OpPrefix od is epsilon ov is ow I od

where ow is mnemonic for "object word" and od for "object doubleword." This specification assumes that the hardware codes for 32-bit doubleword operands by default; the alternate assumption could be specified by exchanging the definitions of od and ow. To specify both the 16- and 32-bit variants of the memory-to-register move instruction, we write

constructors

MOV^"mr"^ov Eaddr, reg is ov; MOV & Ev. Gv; Eaddr & reg_opcode = reg ...

This specification differs from the move-byte specification in that we have used the suffix "Ev,Gv", which codes for operands of either word or longword ("variable") size, depending on the presence or absence of a prefix. The pattern ov expands to the prefix for the 16-bit variant and to the empty sequence for the 32-bit variant.

When immediate operands are used, all three variants must have separate specifications, because the operands are different sizes. The 8-bit move-immediate instruction appears above; the remaining variants are specified by

constructors

MOV.Eb. Iv^ow Eaddr, i16! is

ow; MOV.Ev.Iv; Eaddr & reg_opcode = 0 ...; i16

MOV.Eb.Iv^od Eaddr, i32! is

od; MOV.Ev.Iv; Eaddr & reg_opcode = 0 ...; i32

Again, only one of these instructions has a prefix, since od stands for the empty sequence.

Two features of SLED exist only to enable the description of CISC machines. One, the ability to define tokens of different sizes and classes, is used only to describe the Pentium and the Motorola 68000. The other, the ability to form sequences of tokens, is used in both CISC and RISC specifications, but we have used it only rarely in RISC specifications, typically to synthesize "instructions" from multi-instruction sequences.

Owen C. Braun's description of the 68000 [Braun 1996] exposes several shortcomings of SLED. Some addressing modes have different representations, depending on where they are used; currently, they must be associated with distinct sets of constructors of distinct types. For example, a compiler writer must call one of two procedures to encode a register-direct mode, depending on whether it is to be the source or the destination operand of a move instruction. Not all of the 68000's addressing modes are valid in all instructions; there are several different subsets, such as the "data-alterable" modes, for example. Our (incomplete) specification of the DSP56000 exhibits similar problems. These problems can be handled by defining multiple sets of constructors, but the resulting specifications are ugly and difficult to maintain.

We are considering two extensions that would help improve specifications of the 68000 and the DSP56000 and would help specify address prefixes on the Pentium. One would enable us to attach multiple pattern-valued attributes to constructors and to use different attributes to specify alternate representations or parts of representations. Another would support simple specification of subsets of typed constructors, which we could use to specify restrictions on addressing modes. In both cases, we believe that simplifications in CISC specifications will justify the extra complexity in SLED. Because we have not implemented these extensions, we consider the details beyond the scope of this article.

5. IMPLEMENTATION

The toolkit's translator, generator, and checker are combined in a single Icon program [Griswold and Griswold 1990] of about 10,000 lines. We omit the details of the implementation, but we do explain what the implementation does, what it assumes, and how the toolkit's library supports those assumptions.

For each matching statement, the toolkit generates an efficient decoder using nested case statements. These decoders manipulate instruction streams using the code templates supplied by the application writer. Because the decoders need only what is in the templates, they are isolated from other properties of the decoding application, including byte order. They are also independent of any generated encoders and of the toolkit's library.

The toolkit creates an encoding procedure from each constructor in a specification. Procedures that come from typed constructors are useful only for producing operands to be passed to other encoding procedures. In particular, such procedures never have side effects; they return values. Procedures generated from untyped constructors do have side effects; they emit instructions. If the constructor's pattern has no disjunct whose conditions are satisfied, the encoding procedure calls an error handler supplied by the application. Here are signatures for the C procedures that are generated from the Address constructor dispA and the untyped constructor 1dsb, which appear in the SPARC example:

Address_Instance dispA(unsigned rs1, int simm13);

void ldsb(Address_Instance Address, unsigned rd);

The result of dispA could be used as the first argument to ldsb.

Normal encoding procedures emit binary representations, as determined by the encoding rules in Figure 2. The toolkit can also generate "encoding" procedures that emit assembly language. The assembly language is usually inferred from punctuation in constructor specifications, but it is possible to specify assembly syntax separately, as described in the toolkit's reference manual [Ramsey and Fernandez 1994b]. This ability is useful when several assembly languages are in common use for a single architecture, as is the case for the Pentium.

The toolkit can generate direct or indirect interfaces to encoding procedures. Indirect interfaces use interface records - structures containing function pointers. Applications can choose binary or assembly language at run time by using a pointer to the appropriate interface record.

Binary encoding procedures have side effects on a global instruction stream. When values of relocatable operands are not available, they also create relocation information in the form of closures. The encoding procedures make certain assumptions about instruction streams and relocatable operands. Here we enumerate the assumptions and explain the implementations in the toolkit's library, which satisfy the assumptions.

A relocatable address represents the value of a relocatable operand. It is an abstraction with two operations: force and known. Force takes a relocatable address and produces an (integer) absolute address. Known tells whether force can be applied. Generated encoding procedures use known to decide whether to emit tokens or to create relocation closures, and they use force to get the values of the operands themselves.

An instruction stream holds tokens emitted by encoding procedures. It has a location counter that marks the location of the next token to be emitted. Like relocatable addresses, the location counter supports the known and force operations. Encoding procedures assume that they can manipulate the location counter and that they can call emitters to put tokens into the instruction stream. Emitters write bits and advance the location counter. The library includes a little-endian emitter, a big-endian emitter, and two emitters that use the native byte order of the host machine, or an application can supply its own emitters. One of the native emitters is faster than the other, but it requires that the location counter always be aligned on a multiple of the token size.

Most encoding applications need a richer model of instruction stream than that assumed by the toolkit's encoding procedures. The toolkit's library provides relocatable blocks, which implement the instruction-stream abstraction. They also support many other operations, including changing blocks and locations, assigning addresses to blocks, emitting tokens into blocks, and writing blocks into files or memory. An application can use any number of relocatable blocks, and it can emit tokens into a block before the block's address has been assigned. For example, a UNIX assembler might use three blocks, one each for the code, initialized data, and uninitialized data sections. The assembler would let the linker determine and assign the addresses of those blocks.

A label, which points to a location in a relocatable block, provides the basic known and force operations.(2) The toolkit does not associate names with labels; applications can use any method to name and find labels. For more flexibility, the library also provides an implementation of relocatable addresses that represents an address as the sum of a label and a signed offset. This representation is adequate for applications like compilers and linkers. Authors of other applications can use more sophisticated representations (e.g., linear expressions over addresses and labels) without changing the code generated by the toolkit.

The toolkit needs little support from applications. Applications' primary obligations are to manage memory and to supply or select code for fetching and storing tokens. Encoding applications must supply a routine that the library uses to allocate memory for closures, labels, and relocatable blocks. Saving, applying, writing, and discarding closures are also the application's responsibility. In return, the application can choose its own policies for allocating memory and for managing closures. The toolkit is careful not to require large blocks of contiguous memory, not even to store large relocatable blocks. Finally, the toolkit provides no code to associate names with relocatable blocks, labels, or other abstractions; applications must supply their own.

The toolkit generates efficient code. When safety checks are elided, each encoding procedure executes about a dozen instructions. Generated decoders test each field at most once, and they test them in an order that quickly identifies the right arm of the matching statement.

The toolkit's generator can detect many internal inconsistencies in specifications, but it cannot identify specifications that are internally consistent but do not match a target machine. There are several ways to write such incorrect specifications, for example, by getting operand order wrong or by interchanging names in an opcode table. The toolkit's checker [Fernandez and Ramsey 1997] finds inconsistencies between the mapping specified in SLED and the mapping implemented by a trusted, independent assembler. The checker exploits the generator's ability to create encoding procedures for both binary and assembly representations. It exercises each constructor at least once, emitting both representations. The trusted assembler translates the assembly into binary, and the checker compares the two binary representations. If they are identical, the toolkit's specification is probably consistent with the assembler; if not, the toolkit and the assembler encode some instruction differently, and there is probably an error in the specification. A disassembler, which can be generated by the toolkit, makes it easier to find the source of the error.

6. RELATED WORK

Ferguson [1966] describes the "meta-assembler," which creates assemblers for new architectures. A meta-assembler works not from a declarative machine description but from macros that pack fields into words and emit them; it is essentially a macro processor with bit-manipulation operators and special support for different integer representations.

Most architecture-description languages emphasize the instruction semantics necessary for building tools that verify and simulate an instruction set, not the encoding and decoding descriptions necessary for building tools that process machine code.

Wick [1975] describes a tool that generates assemblers based on descriptions written in a modified form of ISP [Bell and Newell 1971]. His work investigates a different part of the design space; his machine descriptions are complex and comprehensive. For example, they describe machine organization (e.g., registers) and instruction semantics as well as instruction encoding.

LISAS [Cook and Harcourt 1994] is another specification language that includes distinct semantic and syntactic descriptions. It specifies binary representations by mapping sequences of named fields onto sequence of bits, a technique that works well for RISC machines, but is awkward for CISC.

The nML specification language [Fauth et al. 1995] uses an attribute grammar to specify instruction sets. The underlying grammar, without attributes, should be the same as the grammar induced by our constructors and their types. For specification, nML uses "OR-rules" and "AND-rules." The OR-rules are sums. They correspond to our constructor types when viewed as disjoint unions, and they also correspond to alternatives in a grammar. The AND-rules are products. They correspond to Cartesian products of operands of our constructors, and they also correspond to sequences of symbols in a production of a grammar.

nML and SLED use different notations and types to associate information with instruction sets. nML uses synthesized attributes to represent register-transfer semantics, assembly-language syntax, and binary representations. Writers can introduce extra attributes to represent things like addressing modes. The values of attributes may be integers, character strings, bit strings, or "register-transfer sequences." Binary representations are represented as bit strings. Attribute values are specified by writing explicit attribute equations for every production in the grammar, and they can be computed using C-like arithmetic functions, a printf-like formatting function, and a special notation for register-transfer sequences. An nML description can be used to build a simulator, which includes an instruction decoder, and a code generator, which includes a binary encoder. Using nML attribute equations to build an encoder appears straightforward, but the authors seem not to have published a description of how they invert the equations to produce a decoder.

SLED provides a more concise and less error-prone way of specifying binary representations than nML's binary-string attributes. SLED's generating expressions and constructor opcodes make it easy to specify many representations with few integer literals. Using patterns instead of bit strings relieves the specification writer from having to get the fields in the right order, and it helps the toolkit detect missing and duplicate fields. Finally, SLED specifications resemble architecture manuals; nML specifications do not. Our ideas could be exploited in the nML framework by including the pattern sublanguage (tokens, fields, and patterns) in nML and using pattern-valued attributes to specify binary representations. Conversely, nML's ideas could be exploited in our framework by adding nML's register-transfer sublanguage and by permitting users to attach arbitrary attributes to constructors and their operands. We expect that named, pattern-valued attributes would help users describe machines like the 68000 and DSP56000.

The GNU assembler provides assembly and disassembly for many targets, but different techniques are applied ad hoc to support different architectures [Elsner et al. 1993]. For example, Pentium instructions are recognized by hand-written C code, but MIPS instructions are recognized by selecting a mask and a sample from a table, applying the mask to the word in question, then comparing the result against the sample. On both targets, operands are recognized by short programs written for abstract machines, but a different abstract machine is used for each target. Another set of abstract machines is used to encode instructions during assembly. The implementations of the abstract machines contain magic numbers and handwritten bit operations. The programs interpreted by the abstract machines are represented as strings, and they appear to have been written by hand.

Larus and Schnarr [1995] use a machine description related to ours to provide machine-independent primitives that query instructions. The syntactic part of their machine description is derived from a subset of our language having only fields and patterns. They have added semantic information by associating register-transfer semantics with particular pattern names. From this combined syntactic and semantic information, the spawn tool generates classifiers that put instructions into categories like jump, call, store, invalid, etc. It finds the registers that each instruction reads and writes, and it generates C++ code to replicate such computations as finding target addresses. The descriptions used by spawn are both more and less powerful than ours. The semantic information makes it possible to derive a variety of predicates and transformations that are indispensable for instrumenting object code. The limited syntactic specification assumes there is only a single token (the "current instruction"), and it has no notion comparable to constructor, which makes it more difficult to understand how specifications are factored. Finally, spawn descriptions do not support encoding; instrumenters must provide preencoded "snippets" of machine code. The encoding is done by standalone compilers or assemblers, and the snippets are extracted from the resulting object code.

In spirit, our work is like ASN.1 [ISO 1987], which is used to create symbolic descriptions of messages in network protocols, but there are many differences. ASN.1 data can be encoded in more than one way, and in principle, writers of ASN.1 specifications are uninterested in the details of the encoding. ASN.1 encodings are byte-level, not bit-level encodings; ASN.1 contains an "escape hatch" (OCTET STRING) for strings of bytes in which individual bits may represent different values. Finally, ASN. 1 is far more complex than our language; for example, it contains constructs that represent structured values like sequences, records, and unions, that describe optional, default, or required elements of messages, and that distinguish between tagged and "implicit" encodings of data.

7. EVALUATION

For code generation in traditional compilers, the toolkit is somewhat less suitable than a vendor's assembler. The toolkit does not easily support standard, machine-dependent formats for relocatable object code, and it does not provide optimizations that vendors may build into assemblers, like MIPS instruction scheduling.

SLED evolved from a simpler language used to recognize RISC instructions in a retargetable debugger [Ramsey 1992, Appendix BI. That language had field constraints and patterns built with conjunction and disjunction, but no concatenation and no constructors. There was no notion of instruction stream; instructions were values that fit in a machine word. We extended that language to specify encoding procedures by writing a constructor name and a list of field operands to be conjoined. This extension sufficed to describe all of the MIPS and most of the SPARC, and we used it to generate encoding procedures for mld. It could not, however, describe all of the SPARC, and it was completely unable to describe the Pentium, even after we added concatenation to the pattern operators. Two changes solved all our problems: making patterns explicit on the right-hand sides of constructor specifications and using constructor types to permit patterns as operands. We then realized there was no reason to restrict constructors to specifying encoding procedures, so we made it possible to apply constructors both in pattern definitions and in matching statements, yielding SLED as described in this article.

Patterns are a simple yet powerful way to describe binary representations. Field constraints, conjunction, and concatenation are all found in architecture manuals, and together they can describe any instruction on any of the four machines we have specified, as well as four other machines whose specifications are incomplete or have been written by our users. Patterns are not limited to traditional instruction sets in which opcode and operand are clearly separated; the machines we have described use instruction formats in which opcode bits are scattered throughout the instruction. Disjunction does not make it possible to specify new instructions, but it improves specifications by making it possible to combine descriptions of related instructions. By removing the need to specify each instruction individually, disjunction eliminates a potential source of error.

Constructor specifications provide clean, abstract representations of instructions and their operands, and they connect these abstractions to binary representations and to assembly language. Equations, though seldom used, are needed to describe instructions like relative branches, whose assembly-level operands differ from their machine-level fields. Equations can also express restrictions on operands, which are part of the definitions of some architectures, like the Intel Pentium.

We maximize SLED's expressive power by minimizing restrictions on the way patterns, constructors, and equations can be combined. For example, patterns and constructors can be used in each other's definitions, which makes it possible to factor complex architectures like the Pentium. Equations in constructor specifications are used for both encoding and decoding, and equations can also be used in matching statements. Because the elements of SLED work together, it is hard to see how the language could be simplified without destroying it. The simplicity of the specifications and the checking done by the toolkit combine to give users confidence in the correctness of the generated code.

8. AVAILABILITY

Version 0.5 of the toolkit implements SLED as described in this article, except that integer operands of constructors are always signed. It is available by anonymous ftp from ftp.cs.princeton.edu in directory pub/toolkit. The toolkit also has a home page at http://www.cs.princeton.edu/software/toolkit.

9. PRODUCTION NOTE

We prepared this article using the noweb tools for literate programming [Ramsey 1994b]. The examples have been extracted from this article and run through the toolkit, and they work with version 0.5.

ACKNOWLEDGEMENTS

The editor and anonymous referees suggested a restructuring that helped improve the article. We are especially grateful for Referee 2's thorough reading and pointers to related work.

1 One can obtain an empty disjunction by conjoining mutually exclusive constraints.

2 This "label" is different from the labels introduced by the L: p construct, although both kinds serve the same function.

REFERENCES

BALL, T. AND LARUS, J. R. 1994. Optimally profiling and tracing programs. ACM Trans. Program. Lang. Syst. 16, 4 (July), 1319-1360.

BELL, C. G. AND NEWELL, A. 1971. Computer Structures: Readings and Examples. McGraw-Hill, New York.

BRAUN, O. C. 1996. Retargetability issues in worst-case timing analysis of embedded systems. Bachelor's thesis, Dept. of Computer Science, Princeton Univ., Princeton, N.J.

CATTELL, R. G. G. 1980. Automatic derivation of code generators from machine descriptions.

ACM Trans. Program. Lang. Syst. 2, 2 (Apr.), 173-190.

CMELIK, S. AND KEPPEL, D. 1994. Shade: A fast instruction-set simulator for execution profiling. In Proceedings of the 1994 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. ACM, New York, 128-137.

COOK, T. AND HARCOURT, E. 1994. A functional specification language for instruction set architectures. In Proceedings of the 1999 International Conference on Computer Languages. ACM, New York, 11-19.

DEAN, J., DEFOUW, G., GROVE, D., LITVINOV, V., AND CHAMBERS, C. 1996. Vortex: An optimizing compiler for object-oriented languages. In OOPSLA '96 Conference Proceedings. SIGPLAN Not. 31, 10 (Oct.), 83-100.

ELSNER, D., FENLASON, J., ET AL. 1993. Using as: The GNU Assembler. Free Software Foundation, Cambridge, Mass.

FAUTH, A., PRAET, J. V., AND FREERICKS, M. 1995. Describing instruction set processors using nML. In The European Design and Test Conference. IEEE Computer Society, Washington, D.C., 503-507.

FERGUSON, D. E. 1966. The evolution of the meta-assembly program. Commun. ACM 9, 3, 190-193.

FERNANDEZ, M. F. 1995. Simple and effective link-time optimization of Modula-3 programs. In Proceedings of the ACM SIGPLAN '95 Conference on Programming Language Design and Implementation. SIGPLAN Not. 30, 6 (June), 103-115.

FERNANDEZ, M. F. AND RAMSEY, N. 1997. Automatic checking of instruction specifications. In Proceedings of the 19th International Conference on Software Engineering. ACM, New York, 326-336.

GEORGE, L., GUILLAME, F., AND REPPY, J. H. 1994. A portable and optimizing back end for the SML/NJ compiler. In the 5rh International Conference on Compiler Construction. 83-97.

GRAHAM, S. L., LUCCO, S., AND WAHBE, R. 1995. Adaptable binary programs. In Proceedings of the 1995 USENIX Technical Conference. USENIX Assoc., Berkeley, Calif., 315-325.

GRISWOLD, R. E. AND GRISWOLD, M. T. 1990. The Icon Programming Language. 2nd ed. Prentice-Hall, Englewood Cliffs, N.J.

HASTINGS, R. AND JOYCE, B. 1992. Purify: Fast detection of memory leaks and access errors. In Proceedings of the Winter USENIX Conference. USENIX Assoc., Berkeley, Calif., 125-136.

INTEL CORP. 1993. Architecture and Programming Manual. Intel Corp., Mount Prospect, Ill.

ISO. 1987. Information Processing - Open Systems Interconnection - Specification of Abstract Syntax Notation One (ASN. 1). ISO 8824 (CCITT X.208). International Standards Organization, Geneva, Switzerland.

JOHNSON, S.C. 1990. Postloading for fun and profit. In Proceedings of the Winter USENIX Conference. USENIX Assoc., Berkeley, Calif., 325-330.

LARUS, J. R. AND SCHNARR, E. 1995. EEL: Machine-independent executable editing. In Proceedings of the ACM SIGPLAN '95 Conference on Programming Language Design and Implementation. SIGPLAN Not. 30, 6 (June), 291-300.

NELSON, G., Ed. 1991. Systems Programming with Modula-3. Prentice-Hall, Englewood Cliffs, N.J.

RAMSEY, N. 1992. A retargetable debugger. Ph.D. thesis, Dept. of Computer Science, Princeton Univ., Princeton, N.J. Also available as Princeton. Univ. Tech. Rep. CS-TR-403-92.

RAMSEY, N. 1994a. Correctness of trap-based breakpoint implementations. In Proceedings of the 21st ACM Symposium on the Principles of Programming Languages. ACM, New York, 15-24.

RAMSEY, N. 1994b. Literate programming simplified. IEEE Softw. 11, 5 (Sept.), 97-105.

RAMSEY, N. 1996a. Relocating machine instructions by currying. In the ACM SIGPLAN '96 Conference on Programming Language Design and Implementation. SIGPLAN Not. 31, 5 (May), 226-236.

RAMSEY, N. 1996b. A simple solver for linear equations containing nonlinear operators. Softw. Pract. Exp. 26, 4 (Apr.), 467-487.

RAMSEY, N. AND FERNANDEZ, M. F. 1994a. New Jersey Machine-Code Toolkit architecture specifications. Tech. Rep. TR-470-94, Dept. of Computer Science, Princeton Univ., Princeton, N.J. Oct. Revised Dec., 1996.

RAMSEY, N. AND FERNANDEZ, M. F. 1994b. New Jersey Machine-Code Toolkit reference manual. Tech. Rep. TR-471-94, Dept. of Computer Science, Princeton Univ., Princeton, N.J. Oct. Revised Dec., 1996.

RAMSEY, N. AND HANSON, D. R. 1992. A retargetable debugger. In the ACM SIGPLAN '92 Conference on Programming Language Design and Implementation. SIGPLAN Not. 27, 7 (July), 22-31.

SPARC INTERNATIONAL. 1992. The SPARC Architecture Manual. Version 8. SPARC International, Englewood Cliffs, N.J.

SRIVASTAVA, A. AND EUSTACE, A. 1994. ATOM: A system for building customized program analysis tools. In Proceedings of the ACM SIGPLAN '94 Conference on Programming Language Design and Implementation. SIGPLAN Not. 29, 6 (June), 196-205.

SRIVASTAVA, A. AND WALL, D. W. 1993. A practical system for intermodule code optimization. J. Program. Lang. 1, 1-18. Also available as WRL Res. Rep. 92/6, Dec. 1992.

SZYMANSKI, T. G. 1978. Assembling code for machines with span-dependent instructions. Commun. ACM 21, 4 (Apr.), 300-308.

WAHBE, R., LUCCO, S., ANDERSON, T. E., AND GRAHAM, S. L. 1993. Efficient software-based fault isolation. In Proceedings of the 14th ACM Symposium on Operating System Principles. ACM, New York, 203-216.

WICK, J. D. 1975. Automatic generation of assemblers. Ph.D. thesis, Yale Univ., New Haven, Conn.

@203785274 0TPL MAY0097SRPF 0525 1. INTRODUCTION

A real-time application is characterized by the existence of two competing factors: its functional specification and its temporal requirements. Functional specifications define valid translations from inputs into outputs. As such they are realized by a set of programs, which consume CPU time. Temporal requirements, on the other hand, place upper and lower bounds between occurrences of events [Dasarathy 1985; Jahanian and Mok 1986]. An example is "the robot arm must receive a next-position update every 10 ms." Such a constraint arises from the system's requirements or from a detailed analysis of the application environment. Temporal requirements implicitly limit the time that can be provided by the system's resources.

Thus, the "art" of real-time development lies in balancing the implementation's resource demands on one hand and its temporal requirements on the other. If the desired balance cannot be achieved, the result is usually a costly process of low-level system tuning, involving expensive hardware monitors (e.g., in-circuit emulators, logic analyzers, etc.), taking careful measurements, and then reordering (or restructuring) various key operations. As a last resort, entire subsystems may have to be redesigned altogether.

In this article we present a static alternative to this process, which is based on two interrelated components: a compiler transformation known as program slicing [Ottenstein and Ottenstein 1984; Venkatesh 1991; Weiser 1984] and priority-based scheduling theory. Surprisingly, while our use of static program slicing often leads to longer execution times - and even higher utilizations - it simultaneously helps achieve real-time correctness and schedulability for the entire system. For this reason we call the transformation real-time task slicing.

1.1 Programming Model

We address the problem of optimizing the most classical type of real-time application-a set of periodic tasks, scheduled to execute on a single CPU. This model can be described by the following terms.

* A task (denoted by the letter T) is a single-threaded program, which executes with positive periodicity and never terminates. This type of program is the most common building block found in real-time applications. In real-time programming environments (e.g., see the experimental languages in Kligerman and Stoyenko [1986], Lin and Natarajan [1988], Nirkhe [1992], and Wolfe et al. [1991]) a construct such as "every t do B" denotes that the program fragment B is dispatched at times 0, t, 2t, etc.

* A task set (denoted [Gamma] = {[[Tau].sub.1], ..., [[Tau].sub.n]}) is a collection of time-multiplexed, periodic tasks, which are dispatched starting at time 0. A real-time scheduler decides when each task gets the CPU. The particular scheduling discipline can either be preemptive or nonpreemptive. Other than that, the tasks are considered mutually independent - although nonblocking communication can often be enforced through the assignment of timing constraints and via copy-in/copy-out operations [Gerber et al. 1995].

This traditional model, while very simple, has been the reference point for most advances in the area of real-time systems. It has been generalized to accommodate distributed systems, network protocols, shared databases [Jeffay 1983; Sha et al. 1986], etc.

1.2 Real-Time Task Slicing

Consider a construct such as "every 10ms do B," which denotes that the block B is dispatched at times 0, 10, 20, 30, etc. In most programming environments for real time (including the above-cited experimental languages), there is also an additional constraint - that the ith task invocation always finishes before period i + 1 starts. We call this a "code-based" semantics, since it mandates that all of the code in B must fit properly within each 10ms time frame. If this fails to occur, even once, then the entire application is considered unstable. Since there is no distinction made between operations within B, all operations are considered time dependent - and all may have an influence on the physical process being controlled.

Our approach is different: to use program annotations denoting "observable operations" within B. Since externally observable events are the only operations having a visible effect on the external world, we ensure only that B's observable events must fit properly within the 10ms time frame. This looser semantics yields an immediate benefit: if necessary, unobservable code can slip past the end of the time frame, and the application will remain safe. As a consequence, we may be able to automatically tune a program by moving some of its unobservable code after the observable operations.

For simplicity, in this article's sequel we consider all "input" and "output" operations to be observable. In practice any relevant instruction can be annotated as triggering an observable event; thus the approach can be extended to most notions of communication. For example, an event can be a message-passing operation, an access to memory-mapped I/O, an instruction that induces side-effects on other tasks, or for that matter, a reference to any designated function call or variable.

Consider the following set of requirements, which are typical of real-time control loops.

(1) Every 25ms, read a new measurement from an external sensor.

(2) Using the new sensor reading and the current state, produce an actuator command and update the state.

(3) Send the actuator its output before taking the next sensor reading.

The program fragment below realizes the specification.

L1: every 25ms

{

L2: input(Sensor, &data);

L3: cmd = nextCmd(state, data);

L4: state = nextState(state, data);

L5: output(Actuator, cmd);

}

Our event-based semantics only requires that the observable operations L2 and L5 execute within every 25ms, while allowing the executions of local computations L3 and L4 to stretch over the 25ms time frame. Unless the program's control and data dependences are violated, this transformation should always be safe. The following code shows the result of such transformation.

{

/* Time-critical code */

L2: input(Sensor, &data);

L3: cmd = nextCmd(state, data);

L5: output(Actuator, cmd);

}

{

/* State update code */

L4: state = nextState(state, data);

}

The code fragment in the left column is time critical and should be executed within every 25ms. Note that local computation L3 appears in the time-critical code, since it generates a command which is used by an observable statement L5. On the other hand, the code fragment in the right column can be delayed.

The distinction between observable and unobservable operations becomes apparent after the tuning process starts. Specifically, it enables real-time task slicing to automatically achieve schedulability. And while the transformation is applied to each individual task, its effect is global, on the entire system.

The key idea behind this method is based on a simple fact:

An application's schedulability improves whenever we can increase the deadlines (or periods) of its constituent tasks.

The same effect is achieved by allowing a task to slide past its deadline, while maintaining the original event-based semantics. We can realize this benefit by transforming a task, so that its time-sensitive component always executes within its frame, while its unobservable part can be postponed.

To accomplish this, the transform tool decomposes an unschedulable task into two slices: one that is "time critical" and the other "unobservable." Then it "glues" the unobservable slice to the end of the time-critical slice and substitutes the resulting code for the original task.

Figure 1 pictorially illustrates the net effect of this transformation. The downward and upward arrows represent the execution of input and output events within the same time frame. In the (k + 1)st period, the entire task execution does not finish within its allowable time frame. However, such an execution is acceptable, since all observable events meet the deadline.

Although task slicing is applied locally, the problem inherently requires complex interaction between a scheduling analyzer and the task slicer. Figure 2 illustrates the relationships between the scheduling analyzer/priority assigner and the realtime task slicer. If the task set is schedulable, the priority algorithm determines a priority assignment for it. If no feasible priority assignment can be found, the scheduling analyzer helps identify the tasks whose transformation will most likely lead to a schedulable system. Then they are fed to the slicer, after which the entire task set is again tested for schedulability.

While our use of the "classical real-time model" [Liu and Layland 1973] allows us to concentrate on the issue of real-time slicing, it does not reduce the utility of our approach. For example, most distributed scheduling algorithms use uniprocessor algorithms at each node(l) - and our method can be adapted there as well.

1.3 Remainder of the Article

The remainder of the article is organized as follows. In Section 2 we survey related techniques in both program analysis and real-time systems. In Section 3 we motivate our transformation algorithm via a high-level characterization of discrete-control loops, and we describe some typical scheduling methods used to dispatch them. In Section 4 we provide a technical treatment of program slicing that forms the crux of our transformation. In Section 5 we give an overview of scheduling analysis for the sliced task model, and in Section 6 we put the analysis method to use - and harness it in a priority-ordering algorithm. The algorithm decides which are the best tasks to get sliced and determines the resulting priority order for the entire task set. We conclude the article in Section 7.

2. RELATED WORK

Our approach is strongly motivated by recent advances in real-time fixed-priority scheduling theory and static compiler transformation techniques. In this section we survey these areas, as well as compiler-based techniques for real-time programming.

2.1 Preemptive Fixed-Priority Scheduling

In any nontrivial real-time system, there are usually many periodic tasks that share the CPU and other resources. The tasks must be scheduled in a way that allows all of them to adhere to their timing constraints. This is often done via a fixed-priority, preemptive scheduler, which is the method we use in the article. A priority-based scheduler is quite simple to implement in most kernels, and it typically requires little if any extra hardware support. Also, these algorithms are often amenable to analytical analysis before runtime - a key ingredient in our approach.

A fixed-priority scheduler usually consists of offline and online components. The role of the offline component is to determine whether a task set is schedulable and, if so, to assign each task a single static priority. Then, starting at time 0, the online component repeatedly invokes each task at its specified rate, and it ensures that the CPU is always used by the highest-priority task that requires it. If necessary, this invariant is maintained by preempting a currently running, lower-priority task, which is then resumed later. We elaborate on the details of this paradigm in Section 3.2.

Once tasks have been assigned priorities, it is easy to analyze their behavior under this type of scheduling scheme. One can just run a simulation of a timeline starting at 0 and assume that tasks always require their worst-case execution time. The simulation ends when the timeline reaches the least-common multiple of the task periods. If all task instances are able to finish within their time frames, then the priority assignment is considered safe.

Analytical results in scheduling theory attempt to capture the essence of this simulation method, without actually having to carrying it out. This provides a quick way to test whether a given priority assignment will work; hence, it can also guide assigning the priorities in the first place.

Rate-monotonic scheduling, originally developed by Liu and Layland, was the first well-known result of this kind. In their seminal paper, Liu and Layland [1973] proposed a scheme in which tasks are assigned priorities in inverse order to their periods (hence the name rate-monotonic scheduling, or RMS). They also showed that such priority assignment is optimal, in the sense that whenever it fails to find a feasible priority ordering, so will any other static priority ordering.

Recent research has generalized this result. Leung and Merill [1980] showed that a deadline-monotonic priority assignment is also optimal where deadlines may be shorter than periods. (This case arises where a task is invoked periodically, but where it also possesses an "internal deadline" within its period. The deadline is measured relative to its release time.) Sha et al. [1990] presented two protocols which enable tasks to interact via shared resources, while still guaranteeing the tasks' deadlines. Resource sharing may cause a high-priority task to be blocked, while a lower-priority task uses a critical section. (This is a form of "priority inversion.") These protocols allow one to predict the worst-case blocking times in such situations, which can then be factored into the scheduling analysis.

Recently, a group of researchers at the University of York developed a set of analytical techniques which can provide schedulability tests for broad classes of tasks, including those whose deadlines are greater than their periods [Burns 1994; Tindell et al. 1994]. (Note that the deadline and the period of a task are two distinct timing constraints, e.g., the period denotes the spacing between task instances, whereas the deadline denotes the permissible input-to-output propagation delay for each instance.)

But a scheduling analyzer can do little when it determines that a task set is not schedulable. Also, it can rarely help a programmer tune the system, which is the inevitable next step. This is because the "task" is considered simply as an uninterpreted block of execution time with some timing constraints, but no other semantics to speak of. Our approach is different: to "open up" the task to consider its event-based semantics and then convert an unschedulable application into a schedulable one.

2.2 Program Slicing

We have found program slicing a valuable tool for the task transformation problem at hand. Weiser [1979] first formulated the definition of program slicing and presented two of its potential applications in Weiser [1984]: program debugging and parallelizing execution [Weiser 1983]. Slicing isolates the statements which (may) affect a value computed at a specific program point - exactly the type of exercise which one often pursues in program debugging. Slicing can isolate the flows of control leading to a suspected bug; thus it can help a programmer track down the bug's potential sources in the code.

Weiser also proposed slicing as a tool to help convert single-threaded programs into parallel ones. The idea is to independently execute several slices in parallel - each corresponding to different parts of the program - and then to dynamically correlate their outputs in a way that preserves the original semantics [Weiser 1983]. Our task transformation is somewhat influenced by this approach, and it shares the essential code transformation strategy: slicing and then splicing two parts of a program together. But thanks to the sequential nature of our real-time tasks, we only need to statically splice the sliced program parts to solve our problem. Observe that we do not have to dynamically reconstruct the outputs of the transformed tasks, since all output-generating statements are in one slice, and local computations are in the other. Since our transformed tasks run sequentially and do not execute any code speculatively, all slicing and splicing decisions are carried out statically, before the application is run. Also, rather than producing multiple slices based on several criteria, we generate only two slices per task, and we statically splice them by sequentially composing one after another. This type of static transformation allows one to statically bound the worst-case execution time and to achieve the kind of timing determinism that is essential in a hard real-time system. However, static splicing comes at some cost. Since the transformation changes the statement ordering, our slices must take into account output and antidependences, as well as flow dependences.

Since the introduction by Weiser, program slicing has been used in a wide spectrum of applications and has appeared in several variations. In the taxonomy proposed by Venkatesh [1991], slicing may assume eight possible forms - a number computed by altering three two-valued parameters. Although the taxonomy does not cover every form of program slicing, it provides a means to classify most of the conventional program-slicing techniques. First, a slicing algorithm is either static or dynamic. A static slicer is run before a program is executed, and it makes no assumptions about input values. A dynamic slicer produces a slice based on a specific trace, corresponding to a given input state. Next, a slice is either backward or forward. A backward slice is produced by traversing control and data flows backward, from the last program position to the first. (A forward traversal produces a forward slice). Finally, an executable slice is an executable subset of the original program; a nonexecutable subset is called a closure. Our algorithm produces static, backward slices, which are indeed executable.

Static slicing requires a convenient way of storing and then accessing a program's control and data dependences. For this purpose we use a data structure called a program dependence graph (PDG), which represents all essential dependences between statements in a single graph. Since Ottenstein and Ottenstein [1984] first proposed to use PDGs to help compute program slices, PDG-based slicing algorithms have been popular, mainly because the PDG converts slicing into a simple graph walk problem. This convenience is even more manifest in the case of real-time task slicing, where we have to include all three types of data dependences in a single PDG (and not just flow dependences). Moreover, our PDGs are representations of the programs themselves, in that they maintain a one-to-one correspondence between a node and a source line.

Other algorithms compute program slices using different techniques, such as dataflow equations (e.g., Weiser [1984], etc.). Tip [1995] presents a comprehensive survey on program slicing and covers a variety of issues including interprocedural slicing, slicing in the presence of gotos and parametric slicing. But in this article we restrict ourselves to a simple PDG-based slicing algorithm and concentrate only on real-time issues.

2.3 Compiler-Based Techniques for Real-Time Programming

There has been a variety of work within the category of code transformation for real-time programming. These approaches, while addressing different problems associated with real-time programming, share the goal of enhancing the predictability and schedulability of programs.

In Gopinath and Gupta [1990] a compiler tool classifies an application program on the basis of its predictability and monotonicity and creates partitions which have a higher degree of adaptability. Specifically, the tool denotes whether a piece of code belongs in one of four classes; based on this classification, programs are rearranged to help support adaptive run-time scheduling. The objective is to produce a transformed program possessing a smaller variance in its execution time. In Nirkhe [1992] a partial evaluator is applied to a source program, which produces residual code that is both more optimized and more deterministic. In Younis et al. [1994] an approach to speculative execution is postulated for distributed real-time systems. The idea is that the speculative "shadow threads" are forked off to execute on available resources. This type of transformation can be viewed as an application of Weiser's slicing-and-splicing technique [Weiser 1983], as described above.

Finally, Gerber and Hong [1993; 1995] show how to use event-based semantics to help correct faults within a single task. (Such a problem can occur when the task's execution time stretches over its specified deadline.) The transformation algorithm corrects such faults using a variant of trace scheduling [Fisher 1981], in which worst-case paths of the infeasible task are selected, and then unobservable code is moved to shorten their execution time. However the approach in Hong and Gerber [1993] and Gerber and Hong [1995] is limited to intratask code scheduling and does not address the global types of real-time task-scheduling issues in this article.

In fact, real-time task slicing differs from all of the above-mentioned approaches, in that it quantitatively addresses the most fundamental and classical problem in real-time systems - that of real-time scheduling. Its transformations are guided by an analytic measure of schedulability, which is used as its primary optimization criterion. All of the above approaches may contribute to achieving schedulability, and some are complementary to ours. For example, the partial evaluator in Nirkhe [1992] will propagate compile-time values through the source, hopefully producing a residual which takes less time. And our prior work on correcting timing faults within single tasks [Gerber and Hong 1995; Hong and Gerber 1993] is certainly a precondition to achieving global schedulability. But to our knowledge, real-time task slicing is the first method which works hand-in-hand with a scheduling analyzer, with the goal of achieving deterministic, global schedulability of an entire task set.

2.4 Timing Analysis

Timing analysis is a key step in building a real-time application; this is especially true in a hard real-time system, in which all deadlines must be met. Many analysis techniques have been proposed, ranging from static, source-based methods to profilers and testing tools, through some combination thereof. While profiling usually produces the tightest results, it presupposes a completely developed system - as well as a test suite that achieves pessimistic worst-case coverage. Static, compiler-based analysis can be used much earlier in the design cycle, and it can usually yield conservative worst-case coverage. But this is also its downside: the result can be a "worst of all worst cases," i.e., an experimentally unachievable measurement. Yet static analysis is developing at a rapid pace, and tools are now being produced which can yield tighter results.

The technique reported in Park and Shaw [1990] is based on a simple source-level timing schema, and it is fairly straightforward to implement in a tool. In Harmon et al. [1992] another approach for more accurate timing is proposed; the resulting tool was able to analyze microinstruction streams using machine description rules, and thus it was retargetable to various architectures. On the other hand, neither approach addresses the problem of predicting architecture-specific timing behavior due to the various latencies inherent in memory hierarchies and pipelines.

New results have begun to account for this timing variance. Zhang et al. [1993] presented a timing analyzer based on a mathematical model of the pipelined Intel 80C188 processor. This analysis method is able to take into account the overlap between instruction execution and fetching, which is an improvement over schemes where instruction executions are treated individually. Arnold et al. [1994] presented a timing prediction method called static cache simulation to statically analyze memory and cache reference patterns. Lim et al. [1994] described a timing analysis tool based on a model called the extended timing schema. This approach essentially relies on attribute grammars [Aho et al. 1986] to backward-propagate cache and pipeline information in the program's syntax tree. Using this technique, the tool rules out provably ineligible execution paths from a task, and this streamlines the process of finding its real worst-case paths - and hence, bounding its execution time.

However, no static timing tool is precise enough to be used with complete confidence for developing production-quality software. Moreover, even sophisticated timing analysis methods such as Lim et al. [1994] and Arnold et al. [1994] are not appropriate for fine-grained instruction timing. In Section 4.2 we explain how one can use these tools in spite of the limitations, by taking advantage of software profiling, as well as static timing prediction. Specifically, our slicing technique does not require any static analyzer: it can be used to first transform the program, with the timing carried out later by a runtime profiler.

3. OVERVIEW OF THE APPROACH

As we have discussed, "schedulability" is, by definition, a key metric that drives our real-time task slicing, and thus it requires cooperation between the real-time scheduler and the compiler tool. But this presents two competing demands: (1) it is desirable to maintain the traditional separation of concerns between compilation and scheduling and (2) schedulability depends on complex task interactions that are often exposed at runtime.

In this section we show how real-time task slicing satisfies these demands. In doing so, we discuss the characteristics of a sample target domain - discrete control applications. Since discrete control software possesses many representative properties that can be found in other applications (e.g., multimedia, vision, etc.), this discussion has close analogues in other types of real-time systems.

3.1 Characterization of Discrete Control Software

Many discrete control algorithms possess computations that fit a fixed-rate algorithm paradigm [Krause 1991], i.e., control-loops which execute repetitively with fixed periods. During each period, the physical-world measurement data are sampled, and then actuator commands are computed. Meanwhile, a set of states is updated based on the current state and the sampled data.

The dynamic behavior of a first-order discrete control system can be expressed by the following equations:

[Output.sub.k] = g([State.sub.k], [Input.sub.k])

[State.sub.k+1] = h([State.sub.k], [Input.sub.k])

In these equations, [Input.sub.k], [State.sub.k], and [Output.sub.k] respectively represent the input, current state, and output of the kth period; g is an output generation function, and h is a state evolution function.

Control equations are thought of as simultaneous relationships (and not as a computation procedure); thus the functions g and h can be implemented in a variety of different ways. The usual practice is to choose a single ordering and then to code it up as a cyclic control-loop. The actual loop structure is driven by one's personal programming style, or perhaps the availability of generic code modules. But regardless of the choice (unless the underlying control laws are stateless), g and h mandate key precedence constraints, denoted by "[predecessor]":

[Input.sub.k] [predecessor] [Output.sub.k]

[State.sub.k] [predecessor] [Output.sub.k]

[Input.sub.k] [predecessor] [State.sub.k+1]

[State.sub.k] [predecessor] [State.sub.k+1]

The typical way to enforce these constraints is to use the "code-based" semantics and ensure that each iteration of the control-loop completes by the end of its period. This means that the (k + 1)st iteration starts only after the kth iteration ends. Figure 3 illustrates the effect, while the kth iteration is "blown up" in Figure 4.

3.2 Fixed-Priority Preemptive Scheduling

A real-time system typically consists of more than a single control loops, and hence, the CPU has to be multiplexed so that all loops meet their timing constraints. This is the role of the real-time scheduler.

In this article we assume a fixed-priority, preemptive-scheduling model, which can easily accommodate a set of periodic loops. A fixed-priority scheme has several advantages over other real-time scheduling methods. Its runtime mechanism is simple and efficient, and it can be implemented in many off-the-shelf operating systems. Also, its associated runtime queues are quite small (i.e., a maximum of one slot for each task). On the other hand, calendar-based scheduling [Xu and Parnas 1991] may require storing large task schedules (i.e., a maximum of one slot for each instance of each task invocation, up to the least-common multiple of all periods). Finally, fixed-priority algorithms lend themselves to relatively fast analytical methods for offline schedulability tests.

The following scenario gets played out at runtime. Starting at time 0, each task is invoked at its specified rate. The online dispatcher makes sure that the highest-priority active task always has use of the CPU; if necessary, this is achieved by preempting a lower-priority task. As depicted in Figure 5, this scheme can easily be implemented by using two kernel-level queues: a run queue and a delay queue [Burns et al. 1995; Katcher et al. 1993]. The run queue is sorted by priority, and it holds tasks which have been invoked to execute, but are not yet finished - where the head of the queue contains the currently running task. The delay queue holds tasks waiting to be released, and they are ordered by start time. Thus, when the dispatcher releases a task, it simply moves it from the delay queue into the run queue. If the task's place is at the head of the queue, then it preempts the currently running task, which gets resumed later. This type of dispatching is supported with the aid of a hardware interval timer (which gets handled by an interrupt service routine).

The offline analyzer's job is to assign a priority to each task and to check whether the task set can can be scheduled. The problem components are as follows:

* [Gamma] = {[[Tau].sub.1], [[Tau].sub.2], ..., [[Tau].sub.n]}, the set of n tasks to be scheduled.

* [T.sub.i] denotes the period of task [[Tau].sub.i].

* [c.sub.i] denotes the worst-case computation time of [[Tau].sub.i]'s single invocation, which does not include the time [[Tau].sub.i] is blocked.

While task [[Tau].sub.i] does not always realize its worst-case execution time, when analyzing hard real-time tasks we have to assume that it does. Our objective is to provide a guarantee based on worst-case (rather than average-case) behavior.

In this article we make two additional abstractions, which simplify the analysis of the online dispatcher's operation. First, we assume that the runtime overhead induced by the dispatcher itself is zero. (Usually context switching and clock interrupt-handling latencies inject some penalty.) Second, we assume a task gets released into the system as soon as its period starts (and that it does not suffer the delay imposed by the the kernel's reaction time). There are some simple engineering solutions which can be used to avoid these two simplifications, and they explicitly incorporate kernel overhead into the offline scheduling analysis [Burns et al. 1995; Katcher et al. 1993].

Given our problem components, assigning a static priority ordering to the tasks is easy - Liu and Layland's [1973] rate-monotonic result tells us that priorities should be assigned in rate order, with the highest-rate task getting the highest priority, and so on. This assignment is optimal, i.e., whenever the tasks can be scheduled with some other priority ordering, the rate-monotonic assignment would have worked, too.

Once the priorities are assigned, some simple analysis can verify whether the task set is schedulable. Static techniques attempt to analytically capture a worst-case simulated timeline - i.e., the time during which the system's runtime utilization will be the greatest. For the scheduling algorithms that we examine in this article, this "worst-case duration" starts when all of the tasks are released simultaneously. If a task can meet its deadline during this period, it will succeed in all other task phasings, too. (This fact is used in the proof of Liu and Layland's [1973] result; it has also led to necessary and sufficient tests for general static-priority schedulers [Lehoczky et al. 1989; Tindell et al. 1994].)
Table I. An Example Task Set With Three Tasks (in units of
10[[micro]seconds])

Task            Execution Time         Period

[[Tau].sub.1]   [c.sub.1] = 400   [T.sub.1] = 1000

[[Tau].sub.2]   [c.sub.2] = 400   [T.sub.2] = 1600

[[Tau].sub.3]   [c.sub.3] = 570   [T.sub.3] = 2500


Using this fact, we can determine the schedulability of the entire task set by incrementally checking whether each task is schedulable. We check [[Tau].sub.i]'s schedulability as follows: (1) we set an imaginary clock to time 0; (2) assume that [[Tau].sub.i] gets released with all of the system's higher-priority tasks at time 0; and then (3) calculate the time it takes for [[Tau].sub.i]'s to finish under these circumstances. This worst-case finish time is called [[Tau].sub.i]'s response time.

Since [[Tau].sub.i] can run only when it has the highest priority, its response time will consist of two factors: its execution time [c.sub.i] and the time during which [[Tau].sub.i] is preempted by the higher-priority tasks. This second factor is known as interference.

If we can confirm that the maximum response time of [[Tau].sub.i] is no greater than [T.sub.i], we can guarantee that [[Tau].sub.i] will meet its deadline even in the worst case. We first consider the typical "code-based" case where a task must finish within its period.

Let hp(i) denote the set of tasks which have higher priority than [[Tau].sub.i]. Then [R.sub.i], denoting the maximum response time of task [[Tau].sub.i], is computed as follows:

[Mathematical Expression Omitted] (1)

The second term of Eq. (1) is [[Tau].sub.i]'s interference - i.e., the time during which it gets preempted by the higher-priority tasks. For example, [[R.sub.i]/[T.sub.j]] denotes the maximum number of [[Tau].sub.j] instances executed during the interval [R.sub.i]; thus [[R.sub.i]/[T.sub.j]] [c.sub.j] is the maximum interference due to [[Tau].sub.j]. As Eq. (1) is a recurrence equation on [R.sub.i], a recursive algorithm can compute [R.sub.i] by initially assigning it [c.sub.i] and then generating new values until it converges on a fix-point (or exceeds [T.sub.i]).(2)

Consider the case of three periodic tasks in Table 1 to see how the schedulability analysis works. We assume the source of task [[Tau].sub.2] is given in Figure 6. Since the periods are equal to the deadlines, rate-monotonic priority assignment is a natural choice. In Table 1 the row order corresponds to the priority order - i.e., [[Tau].sub.1] is assigned the highest priority. We can carry out the response time analysis for these tasks using Eq. (1) as follows:

For [[Tau].sub.1]: [R.sub.1] = 400 [less than] [T.sub.1] = 1000

[Mathematical Expression Omitted]

[Mathematical Expression Omitted]

Observe that the two high-priority tasks [[Tau].sub.1] and [[Tau].sub.2] are schedulable, while [[Tau].sub.3] is not. ([R.sub.3] is greater than [T.sub.3] when [[Tau].sub.3] runs at priority 3.) The simulated timeline in Figure 7 pictorially illustrates what the analysis above tells us - that when all three tasks get released simultaneously, [[Tau].sub.1] and [[Tau].sub.2] repeatedly preempt [[Tau].sub.3], which ends up missing its deadline of 2500. And since the rate-monotonic assignment is optimal, no other fixed-priority assignment will suffice either - unless the code-based semantics is abandoned!

3.3 Scheduling with Compiler Transformations

When a system is found to be unschedulable, current engineering practice forces programmers to manually pick some critical tasks from the task set and then to hand-optimize them. Such system tuning is often repeated many times, until the entire task set finally achieves schedulability. We aim to ease this process by providing a semiautomatic task transformation method: real-time task slicing. Real-time task slicing is based on the following simple observation:

Traditional scheduling techniques implicitly assume that each control-loop properly executes within its time frame. But the event-based semantics mandates only that the observable event-generating operations get finished within the originally specified deadline.

This observation leads us to the following method. We decompose a task [Tau] into two fragments - one containing all observable event operations and the other containing all remaining local operations. We call the former the IO-handler and the latter the state-update component, and we denote them [[Tau].sup.IO] and [[Tau].sup.State], respectively. Figure 8 demonstrates the decomposition of the control-loop task originally shown in Figure 4.

After the decomposition, we ensure that the IO-handler will execute within its allowable time frame. On the other hand, we may occasionally have to postpone the execution of the state-update part. Finally, we maintain precedence constraints between [[Tau].sup.IO] and [[Tau].sup.State], which are originally induced by the task's data and control dependences.

The task decomposition itself is carried out by static program slicing. As we stressed above, we put the greatest emphasis on preserving the timing behavior of observable events and the precedence constraints derived in Section 3.1.

Before systematically presenting our slicing procedure, we show its ultimate effect on our example task set. Assume that slicing [[Tau].sub.2] yields the greatest benefit in schedulability. We then decompose [[Tau].sub.2]'s code into its IO component [Mathematical Expression Omitted] and its state-update part [Mathematical Expression Omitted], as shown in Figure 9. (While both slices are individually executable, note that both are dependent on each other - hence, neither can produce a desired output or state in isolation.) Their computation times are separately calculated as follows:

[Mathematical Expression Omitted]

where the time unit is 10[[micro]seconds]. Note that the sum of the two execution times is slightly greater than the original execution time 400 of [[Tau].sub.2]. This is due to replicated code, additional register loads, etc. Also note that the conditional statement "if (!null(data))" is preprocessed by a source-level transformer; it is split into a store of its outcome (in the variable "c_1") and subsequent tests of "c_1" to control the associated conditionals. We revisit this issue shortly in Section 4.

To enforce that the precedence constraints we splice them together via sequential composition. The net result is the following execution behavior:

[Mathematical Expression Omitted]

Our runtime support must guarantee exactly one instance of [Mathematical Expression Omitted] within each [T.sub.2] time frame. On the other hand, it can let iterations of [Mathematical Expression Omitted] slide between period boundaries.

4. AUTOMATIC TASK DECOMPOSITION BY PROGRAM SLICING

Our transformation slices a task into two components: one containing all observable events and the other containing only local state-update operations. This easily stated objective translates into a highly difficult compiler problem - in part due to a program's intertwined threads of control, nested control structures, complex data dependences among statements, etc. To deal with such problems in a systematic manner, we adopt a well-known compiler technique called program slicing.

Slicing has greatly evolved since its birth in Weiser [1984], and many newer algorithms effectively address the problems mentioned above. We have chosen a slicing algorithm based on a formalism known as the program dependence graph [Ferrante and Ottenstein 1987; Horwitz et al. 1990; Ottenstein and Ottenstein 1984]. A program dependence graph is a suitable data structure for slicing, mainly because it integrates both data and control flows into a single graph, and thus reduces program slicing into a graph walk problem [Ottenstein and Ottenstein 1984].

To concentrate on the issue at hand - i.e., real time - in this article, we make the following simplifying assumptions on the underlying program structure:

* Each conditional contains only a simple boolean variable as its guard.

* Unrestricted jump statements such as goto and break are not present in a task's code.

The first assumption is easily satisfied by a simple source-level code transformation on conditionals (if-then-else), as follows:

(1) Predicate "P" in a conditional "if (P)" is taken out and is modified into an assignment on a temporary boolean variable such as "c_i = P."

(2) The assignment is plugged in right before the conditional.

(3) The original conditional is changed to "if (c_i)."

This transformation allows the resultant code to evaluate a conditional guard only once, and if necessary, to reuse its result. This allows us to repeat conditional structures in the spliced code (as in Figure 9) and to break the antidependences that could potentially inhibit this transformation. (If guards were not stored, then a conditional structure such as "if (P) ..." would often not be a candidate for slicing and splicing, since any variable used in P could easily get redefined before the second test of P.)

The second assumption ensures that the program is written in a well-structured form, i.e., using assignment statements, if-then-else, while loops, compound statements, etc., but not gotos or breaks. As a consequence we will avoid discussing the complications associated with slicing in the presence of unrestricted control-flow. Also, our definition of control dependence can be made simpler than that found in Ferrante and Ottenstein [1987]. The price we may pay, however, is a lack of generality. But since many real-time programming languages allow only "structured" programs without unrestricted gotos, this does not impose serious restrictions on our approach.

4.1 The Program-Slicing Algorithm

We are concerned with executable, backward, static program slicing, i.e., (1) each slice is executable as an individual program and (2) slices are computed according to statically available, backward data and control-flow information.

Conceptually, a backward slice of a program P, with respect to program point p and expression e, consists of P's statements and control predicates that may affect the value of e at point p. A pair <p, e> is often called a slicing criterion, and its associated slice is denoted as P/<p, e>. The idea is that to determine the value of e at p, we need only execute the slice P/<p, e>.

However, a slice computed in our approach may include more code than this minimum amount. Recall our periodic controller task [[Tau].sub.2] of Figure 6. The following fragment is the slice [[Tau].sub.2]/<L10, cmd>, which ends up being [Mathematical Expression Omitted].

L1: input(Sensor, &data); L2[prime]: c_1 = !null(data); L2: if (c_1) { L3: t1 = F1(state); L4: t2 = F2(state, t5); L6: t4 = F4(data); L7: t5 = F5(t1, data); L9: cmd = F7(t1, t4, t5); L10: output(Actuator, cmd); }

Statements L1, L3, L6, L7, and L9 are included in the slice, because variable "cmd" depends on their computations (this is called flow dependence). On the other hand, statement L4 is included, though the value computed by L4 does not affect the value of cmd on L10. The reason is that if it were placed in the second slice ([Mathematical Expression Omitted]), t5 of L4 would get an incorrect value from L7 (this is called antidependence). Also, statement L10 is included because it generates an observable event. Finally, the predicate on line L2 and the assignment L2[prime] are included, because the execution of statements L3, L6, L7, L9, and L10 (hence the value of "cmd") depends on the boolean outcome of the predicate (this is called control dependence).

Thus, unlike many other slicing algorithms, we need to consider all three kinds of data dependences - flow, anti-, and output dependences. This can be done in a straightforward manner by extending the definition of the program dependence graph accordingly, as shown below.

Definition 4.1.1 (EPDG). The extended program dependence graph EPDG = (V, E), where we have the following;

* The vertices V represent the task's statements, i.e., assignments, control predicates, and observable statements (such as output and input). In addition there is a distinguished vertex "entry," which represents the root of the task.

* The edges E are of two sorts. An edge [Mathematical Expression Omitted] denotes a control dependence between [n.sub.1] and [n.sub.2]. That is, either (1) [n.sub.1] is an entry vertex, and [n.sub.2] is not nested within any loop or conditional, or (2) [n.sub.1] represents a control predicate, and [n.sub.2] is immediately nested within the loop or conditional whose predicate is represented by [n.sub.1]. An edge [Mathematical Expression Omitted] denotes either a flow, output, or antidependence. That is, control can reach [n.sub.2] after [n.sub.1] via a path along which there is no definition of variable v and

(1) [n.sub.1] defines v, and [n.sub.2] uses v (flow dependence);

(2) [n.sub.2] defines v, and [n.sub.1] uses v (antidependence); or

(3) both [n.sub.1] and [n.sub.2] define v (output dependence).

For simplicity, the edge labels are omitted in subsequent figures.

We define "p [[implies].sub.*] q" to mean that node p can reach node q via zero or more control dependence edges or data dependence edges.

The program dependence graph of [[Tau].sub.2]'s code of Figure 6 is shown in Figure 10. In this particular example, there are no instances of an output dependence. Note that we do not consider loop-carried dependences induced by the task's periodic structure, since we never interleave two successive task instances.

The slice of program P with respect to program point p and expression e (i.e., P/<p, e>) can be obtained through a traversal of P's program dependence graph. Since p denotes a vertex of the EPDG, associating a vertex in EPDG with a single source line greatly simplifies the process of isolating a slice from a program. We pre-process a source program so that all assignment statements and control predicates lie on a separate textual lines.

Thus in our problem domain, the slicing criterion <p, e> need not include the expression e itself, since e is the entire right-hand side in p.(3) It follows that, for us, the program point <p> alone can serve as a slicing criterion.

We can extend the definition of a program slice for a set of slicing criteria C in a natural way and define P/C = [[union].sub.<p>[element of] C] P/<p>. A simple algorithm to compute a slice is given below.

ALGORITHM 4.1.2. Computes the slice P/<n>:

Step 1 Compute the slice by a backward traversal of EPDG such that

P/<n> = {m [where] m [[implies].sub.*] n}.

Step 2 Mark all nodes in P/<n>.

Figure 11 shows the graph that results from taking a slice of the program dependence graph in Figure 10 with respect to criterion <L10>.

One of our essential objectives is to provide the right slicing criteria for the algorithm, so that the computed IO slice of a task "covers" all the observable behaviors of the original task.

Let [C.sub.IO] ([Tau]) be a set of slicing criteria for IO slice of task [Tau]. Thus [C.sub.IO] ([Tau]) should contain all program points corresponding to the observable operations in the program. Given this, the task decomposition algorithm is given below:

ALGORITHM 4.1.9. Decompose task [Tau] into [[Tau].sup.IO] and [[Tau].sup.State]:

Step 1 Compute the slice of [Tau] with respect to [C.sub.IO]([Tau]) using Algorithm 4.1.2. Delete the unmarked source lines from [Tau], and the result is the generated slice [Tau]/[C.sub.IO]([Tau]), which becomes [[Tau].sup.IO].

Step 2 Delete from [Tau] all statements also contained in [[Tau].sup.IO], except for the conditional statements. Delete conditional statements which guard only an empty statement. The remaining code becomes [[Tau].sup.State].

Figure 9 shows the two subtasks [Mathematical Expression Omitted] and [Mathematical Expression Omitted] of [[Tau].sub.2] computed by Algorithm 4.1.3 with slicing criteria [C.sub.IO]([Tau]) = {<L1>, <L10>}.

Finally, we produce our transformed task [Tau][prime] by splicing [[Tau].sup.IO] and [[Tau].sup.State] together via sequential composition. We then have to retiree the transformed code, as well as separately generate the worst-case execution time for [[Tau].sup.IO], which we denote as [c.sup.IO]. Note that time of [Tau][prime] can easily exceed that of the original [Tau], since (1) control structures are replicated and executed twice and (2) splitting basic blocks may increase the number of register load and store operations [Aho et al. 1986].

4.2 Practical Considerations

As we have stated, the above presentation of real-time slicing is rather idealized; we abstracted out some of the practical considerations that one would have to face when building a production-quality tool. On the other hand, any tool's success will be contingent upon the inherent limitations of static program analysis. In the following discussion, we elaborate on some of these problems and point out some ways of working around them.

Real-time task slicing relies on imprecise static data-flow analysis to compute program slices. For the sake of soundness, we have to be conservative. That is, in practice we have to assume a data dependence relationship between any two statements, unless we can statically prove that one does not modify variables that the other accesses.

This has the greatest impact when we are confronted with function calls. Note that in this article we do not treat interprocedural slicing [Horwitz et al. 1990], which is another difficult problem. Rather, our algorithm will either include the entire call within a slice or not include it - and its decision will be based on static dependence analysis. When a function modifies pointer-aliased data, contemporary analyzers are often forced to assume that the operation may be on any global variable in the program. This will naturally err on the conservative side and will assume that a function call induces dependence relationships where none exist. And at worst, we will end up with code that appears totally unsliceable - when it may, in fact, be amenable to slicing. Fortunately, dependence analyzers are improving at a fast rate (see Hendren et al. [1992] and Pugh and Wonnacott [1992]), and our algorithm will improve along with them. Also, as a last resort, a developer can obtain better slices by manually disambiguating some of memory references.

As for output and antidependences, note that they are introduced as a result of variable reuse, and they can often be eliminated by preprocessing the code. For example consider Figure 12, which is a modified version of [[Tau].sub.2]'s source of Figure 6. In the modified code, two copy operations "state_new = state; t5_new = t5" have been added to the beginning of the block, and all corresponding uses of these variables have been changed accordingly. As a result, the antidependence between statements L4 and L7 is broken.

Such a procedure is called a rename transformation and is frequently used in many optimizing compilers [Cytron et al. 1986; 1991]. The key idea is presented in Cytron et al. [1986]: to (1) split every assignment of form "x = f (y)" into "x.n = f (y); x - x.n" and (2) change all uses of "x" by the corresponding "x.n." Since the left-hand side of the assignment "x = f (y)" is stored in a temporary variable "x.n," the antidependence on "x" is broken. (Note that special care must be taken of such variables that are defined within a loop.) For additional information on issues associated with renaming, we refer the reader to Cytron et al. [1986; 1991].

If we slice the code in Figure 12, we obtain a smaller IO slice - which includes more statements but takes less time to execute - as below.

L1: input(Sensor, &data); c_1 = !null(data); L2: if (c_1) { state_new = state; t5_new = t5; L3: t1 = F1(state_new); L6: t4 = F4(data); L7: t5 = F5(t1, data); L9: cmd = F7(t1, t4, t5); L10: output(Actuator, cmd);

We also rely on achieving reasonable execution time bounds for the code segments. But in the face of more complicated architectures, obtaining tight, static timing bounds is getting more difficult - due to the interplay between pipelines, hierarchical caches, shared memories, register windows, etc.

Thus one can use a two-tier approach to deal with time predictions. A static timing analyzer can be used for a rough, initial estimate. Then after program slicing is completed, the result can be verified with a more sophisticated profiler which actually runs the program. Performing such retiming is especially important in a cached memory structure, where code scheduling will always change the instruction alignment.

5. SCHEDULING THE SLICED TASKS

Assume that we have sliced and spliced some of the tasks in [Gamma]. How can we schedule [Gamma] to ensure that the event-based semantics is maintained? Any such scheme is dependent on two elements:

(1) An online scheduling policy that can exploit our task model - i.e., a task's [[Tau].sup.IO] component has to finish by the end of each period - [[Tau].sup.State] can slide over frame boundaries, and the precedence constraints between consecutive instances of [Tau] must be maintained.

(2) An associated offline analyzer for the given scheduling policy.

Note that our new task set [Gamma] relaxes the classical model put forth in Liu and Layland [1973]: now some tasks may finish after their periods are over. This fact implies that a rate-monotonic priority assignment may not be the optimal one. Moreover, the response-time equation (Eq. (1)) in Section 3.2 is no longer the right test for schedulability. (It is overly conservative.)

In Gerber and Hong [1993] we present a RMS-based method, in which each sliced task is considered as two separate tasks during schedulability analysis. Thus each original task receives two priorities, one for [[Tau].sup.IO] and one for [[Tau].sup.State]. Its principal strengths are a simple rate-monotonic priority assignment rule and an efficient analysis test to determine schedulability. Its weakness is that the online component lacks the simplicity found in pure, static priority scheduling. The dual-priority scheme mandates a dynamic priority-exchange mechanism, which in turn requires additional kernel support.

Burns [1994] pointed out these limitations, and presented an alternative in which each task is only given a single static priority - thereby obviating the extra kernel support necessary. The online component uses the same preemptive dispatching mechanisms described earlier in Section 3.2, in which priority "ties" are broken in favor of the task dispatched first. Thus, for example, a task's current iteration will finish before the next one starts.

The offline analyzer is constructive, in that it produces a feasible priority assignment if one exists. If no such assignment exists, perhaps the programmer may have to go back to the system design step and consider more aggressive system tuning.

For the actual priority assignment, Burns uses the top-level algorithm presented in Tindell et al. [1994], along with a schedulability test for our event-based model. The algorithm assumes that all tasks have been transformed so that their IO components come first. For example, it takes as input a set [Gamma] of transformed tasks

[Mathematical Expression Omitted]

[Mathematical Expression Omitted]

[Mathematical Expression Omitted]

and produces as output their appropriate priority levels. It starts by looking for a task that can run at the lowest priority (level n).(4) After such a task, say, [[Tau].sub.k] is found, the algorithm proceeds to search the new task set [Gamma] - {[[Tau].sub.k]} for the second lowest priority task, and so on. There is an important fact that leads to the correctness of this algorithm: while a task is being tested for priority level p, all p - 1 tasks whose priorities have not yet been assigned are assumed to run at higher priorities. Since a lower-priority task can never preempt the higher-priority tasks, selections made for priority levels p or below will not affect those above p.

To determine whether a sliced task [Mathematical Expression Omitted] can run at priority p, the offline analyzer in Burns [1994] basically checks whether the following conditions hold:

(1) There exists some integer k [greater than or equal to] 0, such that the kth complete instance of [[Tau].sub.i] finishes by time (k + 1)[T.sub.i]. In other words, there is one period during which the state-update component does not slide over the frame boundary.

(2) Assuming that (1) is true, let M be the first such instance of [[Tau].sub.i] that completes by time (M + 1)[T.sub.i]. Then for all instances [Mathematical Expression Omitted] completes by time (q + 1)[T.sub.i].

Condition (1) is another way of stating that the system's utilization demand does not exceed 1. (This would render the tasks unschedulable under any circumstances.) To understand why, consider the case when condition (1) is false. This means that no instance of [[Tau].sub.i] ever finishes within its [T.sub.i] frame and that every instance (except the 0th) is delayed by its predecessor. This, in turn, implies that there will never be a time when the CPU is unused (otherwise [[Tau].sub.i] would use it). Moreover, there is always an outstanding request for the CPU. Thus the utilization demand exceeds its capacity, i.e., the utilization demand exceeds 1. On the other hand, assume that condition (1) is true, and let M be the first instance of [[Tau].sub.i] that finishes within its [T.sub.i] time frame. During the interval starting at 0, and ending at the time when instance M ends, all requests that arise for the CPU get satisfied. Recall that the worst-case CPU load is highest within this interval, since all tasks originally get released simultaneously [Liu and Layland 1973]. So after instance M ends, either the utilization demand remains at 1, or it drops off. Either way the utilization does not exceed 1.

Condition (2) is required by the event-based semantics, in which the IO-handler always must complete within its time frame. And since the worst-case scheduling conditions will occur before instance M ends, we need to only check the schedulability of [Mathematical Expression Omitted] up to that point.

The proof that conditions (1) and (2) are both necessary and sufficient for schedulability follows directly from the theory of offline analysis presented in Audsley et al. [1993]. (Interested readers are encouraged to consult that paper for details.)

We combine several of the results from Burns [1994] into a single algorithm - Feasible in Figure 13, which checks both condition (1) and condition (2) at the same time. It also incorporates the scheduling analysis for nonsliced tasks, previously presented in Section 3.2. The function accepts the following two parameters:

* L = [[[Tau].sub.1], [[Tau].sub.2], ..., [[Tau].sub.m]]: A list of tasks in the system.

* [[Tau].sub.i]: Another task, not included in L.

The objective is to determine whether [[Tau].sub.i] is schedulable when the tasks in L are run at higher priority - whether or not [[Tau].sub.i] is sliced. In doing this, the algorithm makes use of the following primitive functions:

* sliced([[Tau].sub.i]) : a predicate denoting whether or not [[Tau].sub.i] is sliced.

* [L.sub.1]@[L.sub.2]: The list append operation, i.e., [[[Tau].sub.1], [[Tau].sub.2]] @ [[[Tau].sub.3], [[Tau].sub.4]] = [[[Tau].sub.1], [[Tau].sub.2], [[Tau].sub.3], [[Tau].sub.4] ].

Note that when sliced([[Tau].sub.i]) holds, [[Tau].sub.i]'s computation time [c.sub.i] is such that [Mathematical Expression Omitted].

Feasible starts out testing whether the composite utilization of L and [[Tau].sub.i] exceeds 1. If so, there is no need to proceed further, and the function returns false. Otherwise it calls the helper function Check, which possesses the additional parameter "q" - representing the current invocation of [[Tau].sub.i] being tested. Note that [f.sub.i,q] denotes the computation demand requested by the tasks in [[[Tau].sub.1], ..., [[Tau].sub.m]] @ [[[Tau].sub.i] ] within time interval of size [f.sub.i,q]. Since line (1) computes an recurrence equation on [f.sub.i,q] until the recurrence equation converges on a fix-point, it finds the first time point when the CPU becomes idle. Readers are referred to Tindell et al. [1994] for a more comprehensive treatment on this type of recurrence equation.

Consider the case in which [Tau] is not sliced. Since q is initially set to 0, lines (1)-(3) represent exactly the same scheduling test described in Section 3.2. That is, [[Tau].sub.i] is considered schedulable if and only if [f.sub.i,0] [less than or equal to] [T.sub.i]. On the other hand, consider what happens when [[Tau].sub.i] is sliced. Immediately following line (1), [f.sub.i,q] holds the finish time of [[Tau].sub.i]'s entire qth invocation, i.e., that of both [Mathematical Expression Omitted] and [Mathematical Expression Omitted]. The predicate on line (2) checks condition (1) above, namely whether q is the first instance of [[Tau].sub.i] that completes within its [T.sub.i] time frame. If this is not true then condition (2) is checked, i.e., whether at least the IO portion completes on time. If this is true, then the next instance q + 1 is tested, and so on.

Since the utilization does not exceed 1, this process is guaranteed to terminate. But what is the maximum number of recursions? Letting H be the least-common multiple of the task periods from L and [[Tau].sub.i], we see that the recursion depth will not exceed H/[T.sub.i] - the number of [[Tau].sub.i]'s invocations before the system comes "full cycle." In practice, such a case would be very rare indeed, signifying a state-update component delayed in every instance but the last. In our experience, overloads will typically last one or two periods at the most.

Now recall the unschedulable task set we showed earlier in Table I. For convenience, we reiterate the timing attributes of the three tasks:

[TABULAR DATA OMITTED]

Suppose that only [[Tau].sub.2] was sliced. This requires priority rearrangement among the tasks, since the rate-monotonic ordering is no longer optimal in the transformed task model. The result of new priority ordering is as follows:

[[Tau].sub.1] [predecessor] [[Tau].sub.3] [predecessor] [[Tau][prime].sub.2]

In the next section we show how this ordering is obtained, and why we decided to slice [[Tau].sub.2]. But given that we have an ordering, we can check it using the above tests.

For [[Tau].sub.1]:

[f.sub.1,00] = 400 [less than] [T.sub.1] = 1000

For [[Tau].sub.3]:

[Mathematical Expression Omitted]

For [[Tau][prime].sub.2]:

[Mathematical Expression Omitted]

[Mathematical Expression Omitted]

[Mathematical Expression Omitted]

As a result, the task set is shown to be schedulable under the new priority assignment.

6. PRIORITY ORDERING WITH TASK SLICING

Now we present the missing link, i.e., the algorithm that determines which tasks to slice and which to leave intact. Whereas Burns' algorithm expects that all tasks in F are sliced before they are submitted for priority assignment [Burns 1994], we also need a way to determine whether or not to slice in the first place. It is typically not desirable to blindly slice all tasks, due to execution time overhead incurred. Since we view slicing as a means of tuning an application, it should selectively be applied to tasks which will realize the greatest benefit.

To address this problem, we present an algorithm that not only finds a feasible task priority ordering, but also picks only a small subset of tasks to slice.

For a given list of tasks [Gamma], we refer to a certain permutation [Gamma][prime] of [Gamma] as a configuration, i.e., [Gamma][prime] denotes a priority ordering(5) of the tasks in [Gamma], and sliced([[Tau].sub.i]) has a determined truth value for all [[Tau].sub.i] [element of] [Gamma][prime]. There are n! different priority orderings and [2.sup.n] possible slicing choices. Thus the algorithm's job is to choose a task configuration among [2.sup.n] [center dot] n! distinct ones in a straightforward manner. Our problem is to slice for schedulability, which is complicated by the fact that the sliced and unsliced versions of a task almost never have the same computation time. We call this the "feasible configuration problem."

Definition 6.1 (Feasible Configuration). For a given [Gamma], a configuration [Gamma][prime] is said to be feasible iff all tasks in [Gamma][prime] meet their deadlines under the priority ordering and slicing choice denoted by [Gamma][prime].

6.1 The Algorithm

We now present the selection algorithm, which uses the following variables:

* [Gamma] = [[[Tau].sub.1], [[Tau].sub.2], ..., [[Tau].sub.n]]: The input task list, initially ordered in nondecreasing order of the deadlines. Such a deadline-monotonic ordering is desirable as a starting configuration, since most tasks, except for a small number of tasks to be sliced will end up consistent with it.

* [L.sub.1] and [L.sub.2]: [L.sub.1] @ [L.sub.2] holds the current list of tasks. In every invocation, Search([L.sub.1], [L.sub.2]) returns either the priority-ordered list of the tasks or false if it cannot find any feasible ordering among them.

In every invocation, Search attempts to assign the last task in [L.sub.1] (variable "[Tau]" in [ILLUSTRATION FOR FIGURE 14 OMITTED]) the priority level [absolute value of[L.sub.1]] + [absolute value of [L.sub.2]]. The condition on line (1) denotes that the algorithm has already generated a complete task configuration of [Gamma].

The condition on line (2) means that the algorithm has checked all tasks in lists [L.sub.1] and [L.sub.2] for priority level [absolute value of [L.sub.1]] + [absolute value of [L.sub.2]], but it can assign none of them that priority. Thus false is returned.

Otherwise, the algorithm attempts to find a feasible priority assignment for tasks up to the current priority (line (4)). If this cannot be done without [Tau], then [Tau] is infeasible at the current priority. In this case the tail recursion is invoked on line (12), which will attempt to find another ordering.

But when the higher-priority tasks are schedulable, [Tau] is checked for feasibility with them. If so, a new ordering L@[[Tau]] is returned. Otherwise, the algorithm slices [Tau] and examines whether the sliced versions is feasible, after which it returns L@[[Tau][prime]].

Example 6.1.1. We now trace through the algorithm, using our example task set from Table I. Initially Search([[[Tau].sub.1], [[Tau].sub.2], [[Tau].sub.3]], []), is called. Then, on line (3), [L[prime].sub.1] is assigned [[[Tau].sub.1], [[Tau].sub.2]], after which Search([[[Tau].sub.1], [[Tau].sub.2]], []) is recursively called on line (4). As we showed in Section 3.2, the tasks in [L[prime].sub.1] are schedulable; thus Search([[[Tau].sub.1], [[Tau].sub.2]], []) returns [[[Tau].sub.1], [[Tau].sub.2]]. But we also showed in Section 3.2 that [[Tau].sub.3] is not feasible with [L[prime].sub.1]. Thus it gets sliced on line (9). Function Slice([Tau]) returns the sliced version of [Tau]: If [Tau] is not sliced yet when Slice([Tau]) is called then it slices [Tau]; otherwise it simply returns the sliced version of [Tau].

For the sake of this example, assume that [Mathematical Expression Omitted] and [Mathematical Expression Omitted]. Then the sliced task [[Tau][prime].sub.3] is also infeasible with [L[prime].sub.1], as we show below.

[Mathematical Expression Omitted]

Thus Search([[[Tau].sub.1], [[Tau].sub.2]], [[Tau].sub.3]) is called on line (12). In this new invocation, we have [L.sub.1] = [[[Tau].sub.1], [[Tau].sub.2]] and [L.sub.2] = [[[Tau].sub.3]], and on line (3) we have that [L[prime].sub.1] = [[[Tau].sub.1]] and [Tau] = [[Tau].sub.2]. On Line (4), Search([[[Tau].sub.1], [[Tau].sub.3]] []) is called which, since [[Tau].sub.1] and [[Tau].sub.3] are mutually schedulable, returns the list [[[Tau].sub.1], [[Tau].sub.3]]. Now, back in calling invocation the algorithm finds that Feasible([[[Tau].sub.1], [[Tau].sub.3]], [[Tau].sub.2]) is false. At this point the task r is sliced, and then (as we have seen), Feasible([[[Tau].sub.1], [[Tau].sub.3]], [[Tau][prime].sub.2]) is found true. Therefore the top-level invocation returns [[[Tau].sub.1], [[Tau].sub.3], [[Tau][prime].sub.2]].

As in the example above, most of the time a feasible configuration is found without exhaustively exploring the search space, though the algorithm takes exponential time in the worst case; indeed, in the test cases we observed the algorithm computed a result very quickly. This is due to several reasons. First, the initial priority assignment is made in rate-monotonic order, which is the optimal one if none of the tasks get sliced. Most of the tasks end up having priorities consistent with this initial order. Second, the scheduling analysis included in the "Feasible" function can quickly rule out cases when a task is not schedulable with a set of higher-priority tasks - which will always be true if the task set is inherently unschedulable. Finally, a task is only sliced on demand, when its unsliced version cannot be scheduled within the current configuration.

7. CONCLUSION

In this article we presented a compiler-based system-tuning approach for fixed-priority, preemptive, real-time systems. The approach consists of two interrelated components, namely, a task-slicing algorithm and a fixed-priority scheduling strategy.

The event-based paradigm helps incorporate a higher level of abstraction into real-time domains. As we have shown, the event-based semantics constrains only those operations that are critical to real-time operation, i.e., the events denoted in the specification or those derived from it. Most importantly, it enables our compiler tools to transform the program.

For the underlying scheduling discipline, we have concentrated on rate-based scheduling, since it is one of the best understood areas in the real-time literature and one of the most widely embraced methods by practitioners. But within this field, the tradition has been to consider a "task" as an uninterpreted block of execution time perhaps with a period, a start time, and a deadline, but no other semantics to speak of. We have shown that once we "open up" the task to consider its event-based semantics, we can help transform an unschedulable application into a schedulable one. We view this type of approach as a promising tool to use in the tuning process, which is certainly preferable to measures such as assembly-level optimization or reimplementation in silicon - two of the more common remedies.

1 Multiprocessor scheduling is often accomplished by first solving an allocation subproblem and then single-processor scheduling subproblems for each node [Tindell et al. 1992].

2 Variants of this method axe presented in Audsley et al. [1993] and Lehoczky [1990]; proofs of its correctness can be found in these sources.

3 An input statement is considered to have an empty expression when it is selected as a slicing criterion.

4 For n tasks, n denotes the lowest priority level, and 1 denotes the highest.

5 The first task in the list has the highest priority.

REFERENCES

AHO, A., SETHI, R., AND ULLMAN, J. 1986. Compilers: Principles, Techniques, and Tools. Addison Wesley, Reading, Mass.

ARNOLD, R., MUELLER, F., AND WHALLEY, D. 1994. Bounding worst-case instruction cache performance. In Proceedings of the IEEE Real-Time Systems Symposium. IEEE, New York, 172-181.

AUDSLEY, N., BURNS, A., RICHARDSON, M., TINDELL, K., AND WELLINGS, A. 1993. Applying new scheduling theory to static priority preemptive scheduling. IEE/BCS Softw. Eng. J. 8, 5 (Sept.), 284-292.

BURNS, A. 1994. Preemptive priority-based scheduling: An appropriate engineering approach. In Principles of Real-Time Systems, S. Sang, Ed. Prentice-Hall, Englewood Cliffs, N.J., 225-248.

BURNS, A., TINDELL, K., AND WELLINGS, A. 1995. Effective analysis for engineering real-time fixed priority schedulers. IEEE Trans. Softw. Eng. 1, 5 (May), 475-480.

CYTRON, R., FERRANTE, J., ROSEN, B., WEGMAN, M., AND ZADECK, F. 1991. Efficiently computing static single assignment form and the control dependence graph. ACM Trans. Program. Lang. Syst. 13, 4 (Oct.), 451-490.

CYTRON, R., LOWRY, A., AND ZADECK, K. 1986. Code motion of control structures in high-level languages. In Conference Record of the 13th Annual ACM Symposium on Principles of Programming Languages. ACM Press, New York, 70-85.

DASARATHY, B. 1985. Timing constraints of real-time systems: Constructs for expressing them, method for validating them. IEEE Trans. Softw. Eng. 11, 1 (Jan.), 80-86.

FERRANTE, J. AND OTTENSTEIN, K. 1987. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst. 9, 3 (July), 319-349.

FISHER, J. 1981. Trace scheduling: A technique for global microcode compaction. IEEE Trans. Comput. 30, 478-490.

GERBER, R. AND HONG, S. 1993. Semantics-based compiler transformations for enhanced schedulability. In Proceedings of the IEEE Real-Time Systems Symposium. IEEE, New York, 232-242.

GERBER, R. AND HONG, S. 1995. Compiling real-time programs with timing constraint refinement and structural code motion. IEEE Trans. Softw. Eng. 1, 5 (May), 389-404.

GERBER, R., HONG, S., AND SAKSENA, M. 1995. Guaranteeing real-time requirements with resource-based calibration of periodic processes. IEEE Trans. Softw. Eng. 21, 7 (July), 579-592.

GOPINATH, P. AND GUPTA, R. 1990. Applying compiler techniques to scheduling in real-time systems. In Proceedings of the IEEE Real-Time Systems Symposium. IEEE, New York, 247-256.

HARMON, M., BAKER, T., AND WHALLEY, D. 1992. A retargetable technique for predicting execution time. In Proceedings of the IEEE Real-Time Systems Symposium. IEEE, New York, 68-77.

HENDREN, L., HUMMEL, J., AND NICOLAU, A. 1992. Abstractions for recursive pointer data structures: Improving the analysis and transformation of imperative programs. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, New York, 249-260.

HONG, S. AND GERBER, R. 1993. Compiling real-time programs into schedulable code. In Proceedings of the ACM SIGPLAN '95 Conference on Programming Language Design and Implementation. SIGPLAN Not. 8, 6, 166-176.

HORWITZ, S., REPS, T., AND BINKLEY, D. 1990. Interprocedural slicing using dependence graph. ACM Trans. Program. Lang. Syst. 12, 1 (Jan.), 26-60.

JAHANIAN, F. AND MOK, A. 1986. Safety analysis of timing properties in real-time systems. IEEE Trans. Softw. Eng. 12, 9 (Sept.), 890-904.

JEFFAY, K. 1983. The real-time producer/consumer paradigm: A paradigm for the construction of efficient, predictable real-time systems. In the ACM/SIGAPP Symposium on Applied Computing. ACM Press, New York, 796-804.

KATCHER, D., ARAKAWA, H., AND STROSNIDER, J. 1993. Engineering and analysis of fixed priority schedulers. IEEE Trans. Softw. Eng. 19, 9 (Sept.).

KLIGERMAN, E. AND STOYENKO, A. 1986. Real-Time Euclid: A language for reliable real-time systems. IEEE Trans. Softw. Eng. 12, 941-949.

KRAUSE, J. 1991. GN&C domain modeling: Functionality requirements for fixed rate algorithms. Tech. Rep. version 0.2, Honeywell Systems and Research Center, Phoenix, Ariz. Dec.

LEHOCZKY, J. 1990. Fixed-priority scheduling of periodic task sets with arbitrary deadlines. In Proceedings of the IEEE Real-Time Systems Symposium. IEEE, New York, 201-209.

LEHOCZKY, J., SHA, L., AND DING, Y. 1989. The rate monotonic scheduling algorithm: Exact characterization and average case behavior. In Proceedings of the IEEE Real-Time Systems Symposium. IEEE, New York, 166-171.

LEUNG, J. AND MERILL, M. 1980. A note on the preemptive scheduling of periodic, real-time tasks. Inf. Process. Lett. 11, 3 (Nov.), 115-118.

LIM, S., BAE, Y., JANG, C., RHEE, B., MIN, S., PARK, C., SHIN, H., PARK, K., AND KIM, C. 1994. An accurate worst case timing analysis for risc processors. In Proceedings of the IEEE Real-Time Systems Symposium. IEEE, New York, 97-108.

LIN, K. AND NATARAJAN, S. 1988. Expressing and maintaining timing constraints in FLEX. In Proceedings of the IEEE Real-Time Systems Symposium. IEEE, New York.

LIU, C. AND LAYLAND, J. 1973. Scheduling algorithm for multiprogramming in a hard real-time environment. J. ACM 20, 1 (Jan.), 46-61.

NIRKHE, V. 1992. Application of partial evaluation to hard real-time programming. Ph.D. thesis, Dept. of Computer Science, Univ. of Maryland, College Park, Md.

OTTENSTEIN, K. AND OTTENSTEIN, L. 1984. The program dependence graph in a software development environment. In Proceedings of the ACM SIGSOFT/SIGPLAN Software Engineering Symposium on Practical Software Development Environments. ACM Press, New York, 177-184.

PARK, C. AND SHAW, A. 1990. Experimenting with a program timing tool based on source-level timing schema. In Proceedings of the IEEE Real-Time Systems Symposium. IEEE, New York, 72-81.

PUGH, W. AND WONNACOTT, D. 1992. Eliminating false data dependences using the Omega test. In Proceedings of the ACM SIGPLAN '92 Conference on Programming Language Design and Implementation. ACM, New York.

SHA, L., LEHOCZKY, J., AND RAJKUMAR, R. 1986. Solutions for some practical problems in prioritized preemptive scheduling. In Proceedings of the IEEE Real-Time Systems Symposium. IEEE, New York, 181-191.

SHA, L., RAJKUMAR, R., AND LEHOCZKY, J. 1990. Priority inheritance protocols: An approach to real-time synchronization. IEEE Trans. Softw. Eng. 39, 1175-1185.

TINDELL, K., BURNS, A., AND WELLINGS, A. 1992. Allocating real-time tasks (an NP-hard problem made easy). J. Real-Time Syst. 4, 2 (June), 145-165.

TINDELL, K., BURNS, A., AND WELLINGS, A. 1994. An extendible approach for analysing fixed priority hard real-time tasks. J. Real-Time Syst. 6, 2 (Mar.), 133-152.

TIP, F. 1995. A survey of program slicing techniques. J. Program. Lang. 3, 3, 121-189.

VENKATESH, G. 1991. The semantic approach to program slicing. In Proceedings of the ACM SIGPLAN '91 Conference on Programming Language Design and Implementation. ACM, New York.

WEISER, M. 1979. Program slices: Formal, psychological, and practical investigations of an automatic program astraction method. Ph.D. thesis, Univ. of Michigan, Ann Arbor, Mich.

WEISER, M. 1983. Reconstructing sequential behavior from parallel behavior projections. Inf. Process. Lett. 17, 3 (Oct.), 129-135.

WEISER, M. 1984. Program slicing. IEEE Trans. Softw. Eng. 10, 352-357.

WOLFE, V., DAVIDSON, S., AND LEE, I. 1991. RTC: Language support for real-time concurrency. In Proceedings of the IEEE Real-Time Systems Symposium. IEEE, New York, 43-52.

XU, J. AND PARNAS, D. 1991. On satisfying timing constraints in hard real-time systems. In the ACM SIGSOFT '91 Conference on Software for Critical Systems. ACM, New York, 132-146.

YOUNIS, M., MARLOWE, T., AND STOYENKO, A. 1994. Compiler transformations for speculative execution in a real-time system. In Proceedings of the IEEE Real-Time Systems Symposium. IEEE, New York, 109-117.

ZHANG, N., BURNS, A., AND NICHOLSON, M. 1993. Pipelined processors and worst case execution times. J. Real-Time Syst. 5, 4 (Oct.).
COPYRIGHT 1997 Association for Computing Machinery, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 1997 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Ramsey, Norman; Fernandez, Mary F.
Publication:ACM Transactions on Programming Languages & Systems
Date:May 1, 1997
Words:27274
Previous Article:Optimal control dependence computation and the Roman Chariots problem.
Next Article:Slicing real-time programs for enhanced schedulability.
Topics:

Terms of use | Privacy policy | Copyright © 2018 Farlex, Inc. | Feedback | For webmasters