# A treatment of computational precision, number representation, and large integers in an introductory Fortran course.

Computational precision is sometimes given short shrift in a first programming course. Treating this topic requires discussing integer and floating-point number representations and inaccuracies that may result from their use. An example of a moderately simple programming problem from elementary statistics was examined. It forced students to confront problems associated with number representations and investigate appropriate ways to circumvent them. In the example, integers were represented using two's complement form, and floating-point numbers according to the Institute of Electrical and Electronics Engineers (IEEE) standard 754, both using 32 bits. Floating-point number representation (64-bit) was necessary to achieve results accurate to two decimal positions. Finally, Fortran 90/95 subprograms were written to implement arbitrarily precise operations with large integers. Students in an introductory course can write subprograms for addition and subtraction of large integers. More sophisticated subprograms that depend on the fast Fourier transform were written to perform multiplication and division of large integers. These subprograms are more suited to an advanced programming or numerical analysis course. The subprograms written to perform large-integer operations were used to determine exact solutions for the statistical problem.

**********

Programming in Fortran 90/95 is taught predominately to math and science majors at Francis Marion University, but some nonscience majors enroll in the course to satisfy a general education requirement in computer science. As a teacher of Fortran, I am routinely asked why our mathematics department "still teaches Fortran." Yet Fortran is the most widely used scientific programming language (Decyk, Norton, & Szymanski, 1995). Fortran 90/95, the current incarnation of the first scientific programming language, is the language of choice for scientific computation. Many of the features of Fortran 90/95 that give it this distinction are not generally discussed in an introductory course. However, they become important in sophisticated number-crunching applications encountered in scientific research. While C/C++ may often be perceived as the language with panache, Fortran 90/95 compiled code commonly executes an order of magnitude faster. Fortran 90/95 alone provides standardized language support for parallel computing (Prentice, 1995). A vast collection of Fortran programming is in use because Fortran is the oldest scientific language. Recoding to another language is economically impractical and typically resisted (Burley, 1998). See the document by Burley, the position taken by Cambridge University's High Performance Computing Facility (2000), and the letter by Prentice for more detailed discussions of Fortran's strengths in the area of scientific computing.

For students enrolled in an introductory programming course the demands of learning language syntax and mastering the art of progressing from a stated problem to pseudocode, and on to a well-written computer program often leave little classroom time for issues like internal representation of numbers and the impact that representation has on numerical precision. This is particularly true when most students in the class have no prior programming experience. Elements of the Institute of Electrical and Electronics Engineers (IEEE) 754-1985 standard, which govern binary floating-point arithmetic on an overwhelming majority of processors, are often relegated to more advanced programming or computer science courses.

Yet such issues can have a profound effect, which can be seen in an introductory setting. Consider the following adaptation of an example from a Fortran text (Nyhoff & Leestma, 1997, pp. 738-739). The following line of Fortran 90/95 code computes the square of the binomial A+B, subtracts all terms except [A.sup.2], and then divides by [A.sup.2]: C = ((A+B)**2-2*A*B-B**2)/A**2. For nonzero values of A, C=1. Variables A, B, and C have been typed REAL (Fortran floating-point, single precision). Table 1 shows the computed value C for different values of A, when B has the value 1000.0. Results were determined using a desktop PC (Windows 98) with an AMD Athlon processor (Digital Visual Fortran compiler) and an HP Alpha 2100 machine with the Open VMS operating system and associated compiler.

The inevitability of errors such as those shown in Table 1 has been a topic of study since the early days of computing. Interval arithmetic is one response to the problem. See the article by Hayes (2003) for an interesting introduction to this field. Using interval arithmetic, real numbers are replaced with intervals on which operations are performed. For example, let [a,b] represent the interval of real numbers between a and b, inclusive. Suppose x and y are not known exactly and that 1 [less than or equal to] x [less than or equal to] 3 and 5 [less than or equal to] y [less than or equal to] 8. The sum x + y can be expressed using interval addition as [1,3] + [5,8] = [6,11]. This interval addition replaces the conventional sum x + y where x and y represent known real numbers. A lower bound and an upper bound for x + y are 6 and 11, respectively. Of course, the sum x + y may be inaccurate, but at least the error is bounded and not hidden from view. Unfortunately, other interval operations are not so simple. Although much work has been done, support for interval algorithms in computing hardware has not been realized. Hardware support (as opposed to exclusive software support) for floating-point arithmetic became widely available only after the IEEE published floating-point format standards. Although drafts of standards for interval arithmetic exist, as yet none has been adopted by any standard-setting body (Hayes, 2003).

Unlike the obvious errors in computation like those shown in Table 1, some computational inconsistencies are subtler. One example is the lack of uniformity across different computing platforms when the IEEE standard is followed--at least nominally. That is, some components of the IEEE 754 standard may not be followed in a particular computing environment because of the processor or compiler (or both) in use. For an example, consider the internal representation of extremely small floating-point numbers on various platforms. Called denormalized or subnormal numbers in the IEEE standard, some processors allow users to opt out of using these floating-point numbers because of the associated degradation in performance.

Familiarity with floating-point, internal representation is needed to understand subnormal numbers. The IEEE specification for a 32-bit floating-point number can be depicted as

S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF

where S represents the sign bit, EEEEEEEE represents the binary exponent with a built-in bias of [127.sub.10], and the remaining 23 bits represent the mantissa or significand. Whenever the exponent is positive and less than [255.sub.10], the number represented is said to be normalized or normal and is determined using [(-1).sup.s] x [2.sup.(E-127)] x 1.F where E represents the decimal equivalent of the 8-bit binary exponent EEEEEEEE and 1.F represents the binary number comprised of an implied leading one, a binary point, and the 23-bit significand. Thus the smallest positive normalized number that can be represented is 00800000 expressed in hexadecimal, which is approximately 1.17549435x[10.sup.-38] expressed in decimal. Whenever the exponent is zero and at least one bit of the significand is one, the number is classified as subnormal and is determined using [(-1).sup.S] x [2.sup.-126] x 0.F where the implied leading one has been replaced with a zero and the exponent modified. The smallest positive subnormal number that can be represented is 00000001 (hexadecimal), which is approximately 1.40129846x[10.sup.-45] (decimal). Gradual underflow is said to occur when subnormal numbers are used. Otherwise, numbers smaller than 1.17549435x[10.sup.-38] are flushed to zero. So when IEEE compatibility is claimed by two systems, in practice, computational results using extremely small operands can differ, if the two platforms do not agree in their use of gradual underflow.

As an illustration, consider a simple Fortran 90/95 program to divide 0.1E-5 by 0.1E35. The program was compiled with the Open VMS operating system's Fortran 90 compiler on the HP Alpha 2100 machine. At first the compilation was accomplished with gradual underflow engaged. The binary result was 0 00000000 0000 0010 0010 1101 1000 010, which exactly equals

[summation over (n[member of]{7,11,13,14,16,17,22})][2.sup.-n] x [2.sup.-126]

This subnormal number is approximately equal to 0.99999461011147595815 x [10.sup.-40] in decimal. The second time the program was compiled with gradual underflow turned off so that subnormal numbers were flushed to zero. The binary result was 00000000 expressed in hexadecimal, which represents 0.0 in decimal.

An introductory programming student should know it is a mistake to consider a result correct simply because it was determined by executing a correctly coded computer program. Generally only six to nine significant decimal digits will be accurate when using single-precision, floating-point numbers (storage type REAL [4 bytes] in Fortran) in computation (Sun Microsystems, Inc., 2004, chap. 2).

A SIMPLE STATISTICAL PROBLEM TO EXPOSE COMPUTATIONAL LIMITATIONS

In Fortran 90/95, what Fortran old-timers refer to as double precision is implemented using the KIND attribute when declaring a variable of type REAL. With most compilers the default is (KIND = 4), which calls for a floating-point value to be stored using 32 bits, whereas KIND = 8 causes 64 bits to be allocated for a single floating-point number. Even with 64-bit representation only 15 to 17 accurate decimal digits can be expected for floating-point operations in general (Sun Microsystems, Inc., 2004, chap. 2). In an introductory Fortran course at Francis Marion University, students were assigned a fairly simple numerical problem whose solution (accurate to the nearest integer value) is impossible to determine without using double precision. Students were asked to write a Fortran 90/95 program to randomly generate 1000 integers each between 1 and 100,000, inclusive, and write them to a sequential file. Additionally, the program was required to rewind the file and read the first n integers from the file to compute the mean and variance of that subset of n integers, where 1 [less than or equal to] n [less than or equal to] 1000. For uniformity across the class, each user would provide the value 500 for n at runtime. The students had been instructed previously regarding the generation of such a set of random integers. Furthermore, all students used the same value to seed the random number generator with the Fortran statement: call random_seed (put = (/ (16807,i = 1,10) /)). This ensured that every student's program generated the same sequence of 1000 random integers upon iteratively invoking the intrinsic subroutine random_number. The students were instructed to use the following familiar computational formulas to compute the mean [bar.X] and variance [s.sup.2] of the 500 integers:

[bar.x] = [[n.summation over (i=1)][x.sub.i]]/n and [s.sup.2] = [[n.summation over (i=1)][x.sub.i.sup.2] - [1/n]([n.summation over (i=1)][x.sub.i])[.sup.2]]/[n - 1].

Assume the intrinsic subroutine random_number used to produce random numbers provides uniform deviates in the range 0 to 1, variable Rval has been declared REAL, and variables a, b, and Rint have been typed INTEGER. Then the Fortran code

Call random_number(Rval)

Rint = a + int((b - a + 1)*Rval),

where a = 1 and b = 100,000, will store in variable Rint an integer from 1 to 100,000 that can reasonably be expected to belong to a uniform distribution on the interval from 1 to 100,000. If we assume that each random integer [x.sub.i] comes from such a distribution then

E([500.summation over (i=1)][x.sub.i]) = 25,000,250 and E([500.summation over (i=1)][x.sub.i.sup.2]) [approximately equal to] 1.67 x [10.sup.12],

where the use of E[X] refers to expectation. Computing the variance of the 500 randomly generated integers will require calculating and possibly storing the sum of the squares of the 500 integers in memory. Most students in the class, if not all, were unaware that the sum of the squares could approximate 1.7 trillion.

With 32-bit Integer Representation

In the discussion that follows, values were computed using the AMD processor and Digital Visual Fortran compiler unless otherwise noted. Not surprisingly, after obtaining a collection of 1000 random integers, some students used variables of type INTEGER to accumulate the necessary sums:

[n.summation over (i=1)][x.sub.i] and [n.summation over (i=1)][x.sub.i.sup.2], where n = 500. Most computer systems in use today use two's complement format to represent signed integer values. The largest positive 32-bit signed integer that can be represented in two's complement form is 0111 1111 1111 1111 1111 1111 1111 1111[.sub.2], or 2,147,483,647[.sub.10], so a likely value for the sum of the squares of the randomly generated integers cannot be represented using 32-bit two's complement form. For the set of integers randomly generated using the AMD processor, the true sum of the first 500 integers was 25,090,091 and the true sum of squares of the first 500 integers was 1,673,244,076,909. When students' programs were executed using 32-bit integer storage for the sum and sum of squares of the integers, the internal representation of the sum of the 500 integers was accurate (0000 0001 0111 1110 1101 1000 0010 1011[.sub.2] = 25,090,091[.sub.10]). The stored binary representation of the sum of squares (1001 0101 0001 1110 0110 1011 0110 1101[.sub.2], or -1,793,168,531[.sub.10]) was computed incorrectly due to integer overflow. This incorrect result, in turn, yielded an incorrect, negative value for the variance of the 500 integers (-3590757.75). These values are summarized in Table 2. See Table 5 for exact values of the sum, sum of squares, mean, and variance of the 500 integers.

This may not signal a problem for students who have not been exposed to elementary statistics. Thus, the standard deviation was required in addition to the variance forcing students to confront and address the overflow problem. Of course, computing the standard deviation results in a runtime error in this exercise. Experience shows that once this error manifests at runtime, unless warned in advance, many novice programming students will fail to diagnose the true source of the problem. That is, students may recognize the negative radicand in the standard deviation calculation, but not easily determine why it is negative. This is notwithstanding their earlier exposure to internal integer representation and the integer constraint value 2,147,483,647.

It should be noted that students should be reminded frequently to make certain all runtime error-checking options offered by the compiler are engaged. This will ensure that underflow and overflow error (both integer and floating-point) will be detected by the compiler, as well as errors generated by attempts at illegal array access and division by zero.

With 32-bit Floating-Point Representation

What programming strategy do most students pursue to address the difficulties described? If students are not aware of 64-bit floating-point representation at this point in their study, then the next step usually involves the default 32-bit floating-point representation. Many Fortran programming environments to which students have access do not support 64-bit representation for integers anyway. (The largest integer that can be represented using 64-bit signed integer format is approximately 9.2 x [10.sup.18], so 64-bit integer representation would suffice for correctly determining the sum of the squares of the 500 integers.) To implement 32-bit floating-point representation, certain program modifications were necessary. Variables used to collect sums had to be declared REAL and random integers converted to type REAL prior to being summed. After the necessary program modifications were made and the results recalculated, the internal floating-point representation of the sum of the 500 integers was no longer accurate (0 10010111 01111110110110000010010 = 25,090,084.0[.sub.10]). The floating-point binary representation of the sum of squares was 0 10100111 10000101100101010010110 or 1,673,244,966,912.0[.sub.10]. Likewise, this value was incorrect. After further 32-bit computation, the sums noted above yielded 830,100,864.000 for the variance. These values are shown in Table 3. Note that this value for the variance differs from the true value by over 3000.

These results represented a best-case scenario for students using 32-bit floating-point numbers. This group of students used the code that follows to sum the squares of the integers.

SumSQ = 0.0

Do I = 1, n

SumSQ = SumSQ + Real(Rint)**2

End do

SumSQ was 32-bit floating point (REAL). I and n = 500 were typed INTEGER. Rint (an INTEGER variable) represented one of the 500 random integers. Other students omitted the call to the intrinsic function Real, which converted Rint to floating-point form prior to the exponentiation operation. In that case, without the call to function Real, the square of the random integer (Rint**2) was evaluated using integer multiplication of Rint by itself. For 272 of the 500 random integers, the value of Rint**2 exceeded the largest 32-bit signed integer available. As a result, 272 incorrect values were added to accumulator variable sumSQ--another subtle way to miss the true value of the variance.

Students will be satisfied with these incorrect results, unless they have been given the correct values for the sum and variance of the 500 integers in advance. Even though the standard deviation of the 500 integers is required, students may not be alarmed since the variance is positive. The students' backgrounds and inherent talents determine how demanding the instructor should be. In the author's case, because this topic occurs relatively early in the course where loop constructs are the focus, and since many students have no prior programming experience, the floating-point results previously mentioned (variance = 830,100,864.000) were accepted, whereas the integer results (negative variance) were not. When the programming assignment was returned, the 64-bit floating-point representation was discussed and the correct values revealed. The author had determined values for the statistics that are accurate to two decimal positions using 64-bit floating-point representation. Also Maple was used to check the results. See Table 4 for a summary of these results.

Integer Representation, Floating-Point Addition, and Rounding Error

First a little background. Floating-point arithmetic sometimes enjoys unearned respect among students because floating-point numbers do not suffer from the range limitation inherent in integer representation of numbers. Indeed, 32-bit floating-point numbers that follow IEEE standard 754 range approximately between [+ or -] 3.4028 x [10.sup.38]. This sometimes obscures the fact that only finitely many real number values in that range are representable by any computer. A commonly cited example is [0.1.sub.10], which can be expressed in integer powers of 2 as [0.1.sub.10] = [3/2][6.summation over (n-1)](1/2)[.sup.4n]. The number [0.1.sub.10] is stored in 32-bit binary floating-point form as 0 01111011 10011001100110011001101. This binary sequence represents +1.10011001100110011001101 x [2.sup.-4], or 0.1000 00001490116119384765625 in decimal. Of course, the problem is that [0.1.sub.10] cannot be expressed as a finite sum of integer powers of two. The alternative, smaller than [0.1.sub.10] and nearest [0.1.sub.10], is +1.10011001100110011001100 x [2.sup.-4] or 0.0999999940395355224609375 in decimal. The representation actually used is the first since it represents a value closer to 0.1.

The IEEE 754 standard specifies four rounding modes: to the nearest representable value, toward negative infinity, toward positive infinity, and toward 0 (Sun Microsystems, Inc., 2004, chap. 2). The default rounding mode is rounding to the nearest representable value as evidenced above. In the event that two representable values are equally close to the actual value, the representation whose least-significant bit is zero is used (HP-UX Floating-Point Guide: HP 9000 Computers, 1997).

Integer representations in memory are exact. Integer addition results are exact. This claim assumes that the addends and sum do not exceed the range of integer representation, which is -2,147,483,648 to -2,147,483,647 for the default 4-byte signed integer. The error in the statistical problem described above, in which integer addition was used, arose because the sum of the squares of the 500 integers exceeded 2,147,483,647. Recall that the sum of the 500 integers, 25,090,091, was accurately represented. However, the computation of the sum of squares of the 500 integers literally did not leave square one. The first randomly generated integer, 99470, when squared is 9,894,280,900. With 499 squares remaining to be added, the sum was already outside the range of integers that can be represented using 32 bits. The two's complement, signed-integer, binary result of squaring 99470 translated into 1,304,346,308 (instead of 9,894,280,900) in decimal form.

Recall that 32-bit floating-point data representation was used next in an attempt to determine accurate values for the statistics. Even though the sum of the 500 integers was correctly determined using integer representation, in this exercise, the attempt to accurately compute that sum using 32-bit floating-point numbers failed due to rounding error. The rounding error surfaced after 337 integers had been accurately accumulated yielding 16,709,895. The 338th randomly generated integer was 79016. When added to the previously accumulated sum of 16,709,895, the result was incorrect due to rounding error. It is easy to see how this can occur. Addition of two floating-point numbers is accomplished by first right-shifting the bits of the significand of the number with the smaller magnitude. This is done until the two numbers have the same exponent after which the addition is performed. Lower-order bits can be lost during this right-shifting of bits, causing an error. However, that did not occur in this case. Instead, the normalization of the resulting sum, in preparation for its storage in memory, caused the lowest-order bit to be lost. Figure 1 shows how the addition is performed and the sum normalized. The value stored in memory for the sum of the first 338 integers was 16,788,912.0. The correct value, 16,788,911.0, could not be stored exactly using 32 bits.

Returning briefly to Table 1, we are now able to examine the case for one value of A and show how such surprising results were computed from an apparently innocuous expression such as C = ((A + B)**2 - 2*A*B-B**2)/A**2. Recall that B = 1000.0 and consider the case where A=0.00001. (See data row 6 in Table 1.) Figure 2 shows how the value A + B was computed internally, assuming the sum was to be stored in memory. The right-shifting needed to represent the smaller addend (0.00001) with the same exponent as the larger addend (1000.0) in preparation for adding the two numbers forced nonzero bits out of the 23-bit range of bits comprising the significand of the smaller addend. Thus, using 32-bit addition, 0.00001 essentially became zero prior to being added to 1000.0. Actually, internal floating-point registers in many microprocessors today are 80 bits in length. In this case A + B is computed more accurately (in a CPU register) than the 32-bit representation of the sum A + B indicates. A more accurate intermediate result for A + B may be squared and then added to -2AB and -[B.sup.2] to produce the numerator, (A + B)**2 - 2*A*B - B**2, which is to be divided by [A.sup.2] before the final result is stored in memory. Still one can see that using 80-bit internal floating-point registers did not ensure an accurate result since the stored value for variable C was, in the end, 0.0, not 1.0 as it should have been.

[FIGURE 1 OMITTED]

[FIGURE 2 OMITTED]

LARGE-INTEGER ARITHMETIC AND ARBITRARY PRECISION

Recall that using 32-bit floating-point representation in the statistical example resulted in an incorrect value for the sum of the first 338 random integers. The first instance of lost accuracy manifested in the eighth significant digit since 16,788,912[.sub.10] was incorrect by one. (Recall the correct value was 16,788,911[.sub.10].) As stated earlier, when IEEE 32-bit floating-point representation is used, the number of digits that can be expected to be accurate is at least 6, but not more than 9. That range grows to at least 15, but not more than 17 when 64-bit floating-point representation is used (Sun Microsystems, Inc., 2004, chap. 2). Thus, accurate results for both the sum and the sum of squares of the 500 randomly generated integers could be (and were) obtained using 64-bit floating-point representation. The preferred method of data representation for accumulating the sums in this specific problem is 64-bit integer representation. As stated earlier, on another computing platform, this would have been possible. For example, alpha processors support 64-bit integer representation. However, there is a largest representable integer even with 64-bit integer representation. This limitation is absent when using computer algebra systems (CAS) like MAPLE because such programs are capable of performing arithmetic without error using integers that contain an arbitrary number of digits. In fact, such arbitrary precision is commonly used and routines for its implementation are available without charge on the Internet.

The term arbitrary precision arithmetic generally refers to mathematical operations on floating-point numbers that result in accuracy to the nth decimal position. The value of n is arbitrary and limited only by choice of the user or technology constraints (such as computer memory or time). If integer values are involved exclusively, the term arbitrary precision refers to the capacity to achieve the correct integer result regardless of the number of digits in the operands or the operation to be performed. Large-integer arithmetic, which refers to the latter, would be put to good use in the case of the variance computation previously discussed. Indeed, the topic is a very interesting one and can be used, albeit in a small way, in an introductory Fortran programming course.

Large-Integer Addition and Derived Data Types

Implementation of large-integer arithmetic requires a collection of routines to perform the basic arithmetic operations as well as ancillary functions like basic input and output of large integers, bit shifting, initialization and comparison of large integers, and perhaps conversion of large integers from one base to another. Many of these routines are more appropriately studied in an advanced programming course or a course in numerical analysis, but a few can be used in an introductory programming course with minor modifications. That is, the concepts can be engaged, even if the requirement of arbitrary precision is relaxed. For example, the addition of two positive integers is conceptually simple regardless of the number of digits involved. After studying arrays and their passage in and out of subroutines and function subprograms, students can be required to write a subprogram that will accept two large integers, each having a specified number of digits, and compute and return their sum.

For example, consider the following assignment that was made in an introductory Fortran class. Students were instructed to write a Fortran 90/95 main program and subprogram pair to compute the sum of the two 34-digit integers 6187435869037950432062599440847266 and 7734294836297438 040078249301059024. The calling program had to read the integers from an ASCII file and store them as integer arrays--one digit per array element--and pass the arrays into the subroutine or function, which computed and returned the 34-digit or 35-digit sum for output. Most students did not find this too formidable once they realized the required algorithm was the one for base-ten addition encountered in the third grade. Stronger students could proceed to some of the more sophisticated tasks such as multiplication of two large integers or performing addition or multiplication in bases other than ten. Unfortunately, time is usually too limited for this.

A discussion of derived data types, a topic which often appears later in Fortran texts, is suitable for an introductory class. Large-integer arithmetic offers an interesting opportunity for the discussion of derived data types. For example, Figure 3 shows a module containing a derived data type used by the author when writing and implementing a collection of routines for large-integer arithmetic. A large integer was stored digit-by-digit in the array named Digit in the data structure named bigInteger. Within that data structure, a single integer variable (ILD [index last digit]) contained the location within the array Digit of the most significant digit of the stored integer. The logical variable Neg was set to "true" or "false" depending on the sign of the stored large integer. For example the base-ten large integer 2,704,397,803,264,985 would be stored as shown in Figure 4. The "arbitrariness" of the precision is implemented by setting the parameter lenDigitArray (Figure 3) to an integer value that is larger than the maximum number of digits needed for an operation to be performed, which depends on the operation and the number of digits in the operands. Alternatively, the number of digits required for an operation could be determined at runtime after which arrays for storing operands and results could be dynamically allocated. In this case, the base to be used is set as ten, but can be any integer larger than one. Parameter LP is set to the value of the largest possible index for array Digit after the parameter lenDigitArray is set. Parameter digitKind must be set to the length in bytes of an intrinsic integer data type offered by the computing platform in use. It determines the integer storage type used for elements of array Digit and variable ILD. Here digitKind was set to two bytes. Although using just one byte to store each "digit" would have been more frugal, that would have limited the magnitude of a digit to [127.sub.10]. Bases larger than [128.sub.10] are commonly used for large-integer arithmetic, so two bytes were used to store each digit. Parameter MAXALPHA was used during I/O operations as a ceiling for the number of bytes of ASCII input when reading in a large integer. Of course, it can be increased. Finally, parameter maxIntKind must be set to indicate the largest integer type supported by the platform in use. Here that was four, since an eight-byte integer was unavailable.

Multiplication

Using the data structure discussed previously, the author wrote functions to perform addition (routine Add) and comparison (routine Comp) of two positive large integers. A function (Subt) for the subtraction of two positive large integers followed. If negatives were involved in either operation--addition or subtraction--the calling program must examine the signs of the two large integers and possibly change the operation based on that examination. When a subtraction was to be performed (with two positive operands), function Subt invoked the comparison function Comp to determine which operand was smaller. Function Subt then subtracted the smaller integer from the larger using the familiar algorithm for subtraction.

It is doubtful that enough time can be devoted to large-integer arithmetic in an introductory course so that multiplication can be treated. However, multiplication presents interesting challenges that are more suited to an advanced programming course, or a course in numerical analysis. Of course, like addition and subtraction, multiplication may be performed using the familiar algorithm taught early in one's formal education. The author wrote function multOneDigit to multiply a positive, large integer by a single digit and function shiftLeft to left-shift all digits in a large integer by an arbitrary number of places in the array Digit. To demonstrate, suppose 6452 is to be multiplied by 497. The comparison function Comp was used to determine that the larger of the two factors is 6452. A large-integer variable called Sum was initialized to zero and multOneDigit was called to multiply 6452 by seven. The product was left-shifted zero times and added to Sum using function Add. Next 6452 was multiplied by nine using multOneDigit. The product was left-shifted once and added to Sum. Finally, 6452 was multiplied by four after which the product was left-shifted two positions and added to Sum. After the multiplicand was multiplied by each digit of the multiplier and the sum accumulated with appropriate shifting, variable Sum contained the product. If we assume both large integers are comprised of N digits, this traditional hand method for multiplication requires O([N.sup.2]) operations (Press, Teukolsky, Vetterling, & Flannery, 1992, p. 909). In practice, this is too slow.

In recent years much research has focused on large-integer arithmetic. Indeed, Crandall and Pomerance (2002) stated:
``` The art of large-integer arithmetic has, especially in modern times,
sustained many revisions. Just as with the fast Fourier transform
(FFT) engineering literature itself, there seems to be no end to the
publication of new approaches, new optimizations, and new
applications for computational number theory. (p. 433)
```

Much of the focus has been on a search for faster methods for multiplication. Mention the fast Fourier transform (FFT) and most think of James Cooley and John Tukey, who published a now-famous paper in April 1965 that presented an algorithm designed to compute a discrete Fourier transform (DFT) more efficiently than previously possible (Cooley & Tukey, 1965). Cooley and Tukey are generally credited with the algorithm, now known as the FFT. Others had actually worked earlier to reduce the computational effort needed for computing a DFT. For an interesting history concerning the development of the FFT algorithm, see the article written by Cooley (1990) himself. Gauss actually described the faster FFT in 1805. Another excellent article on the subject traces the development of the FFT from the time of Gauss (Heideman, Johnson, & Burrus, 1984). Evaluation of a DFT is an O([N.sup.2]) calculation, while an FFT calculation is O(N [log.sub.2] N). To appreciate the difference, consider a calculation involving 1000 components (for example, a large integer comprised of 1000 digits). If the FFT calculation required one second, the same DFT calculation would require roughly one minute, 40 seconds. If an FFT calculation involving [10.sup.6] components required ten seconds, the same DFT calculation would require almost six days of computer time!

The FFT can be used to multiply two large integers. Since the familiar hand method for multiplication, like the DFT evaluation, is an O([N.sup.2]) calculation, the FFT offers a dramatic improvement, even for large integers with as few as 100 digits. See Figure 5 for an illustration of the multiplication method, to which an FFT calculation can be applied. The conventional, hand multiplication of two three-digit integers is shown in panel (A). On the other hand, in panel (B) each digit of the multiplicand is multiplied by each digit of the multiplier, but no carry is implemented as in the conventional method. Next the columns are added after which the usual carry operation is enforced. The next-to-last line (10 32 85 86 72) in panel (B) is an example of a convolution of the digits of the multiplicand and the digits of the multiplier. One of many uses of the FFT is the computation of the convolution of two sets of components such as the digits in a multiplicand and multiplier. Afterward, the simple carry operation must be implemented to determine the resulting product.

[FIGURE 5 OMITTED]

A function subprogram named Convolution was written to perform the multiplication of two large integers using this method. The implementation of the FFT is delicate in ways not mentioned here, but the central idea derives from the convolution theorem. This theorem provides that the product of the individual FFT's of the digits comprising the two large integers is equal to the FFT of the convolution of the two sets of digits. The "fast DFT" routine z_fast_dft from the IMSL Fortran 90 MP Library is called by function Convolution to first calculate the two FFT's of the individual sequence of digits comprising the large integers. These sequences are the analogues of digital signals in the field of digital signal processing where the FFT is used in the same way. The two calculated FFT's are then multiplied component-wise. The routine z_fast_dft is called again with the appropriate input required to force computation of the inverse FFT of the product just described. This yields the desired convolution. It only remains for function Convolution to perform the carry operation needed to determine the product of the two large integers. (Referring to Figure 4, the carry operation transforms the convolution represented by "10 32 85 86 72" into the product 141,432.)

If the statistical problem presented to the students, in particular, correctly determining the variance of the 500 randomly generated integers, is to be solved using large-integer arithmetic, a routine for division must be written. Multiplying two large integers using the FFT method is certainly not a programming problem for introductory students. Likewise the implementation of large-integer division is not for the novice programmer. The reader is referred to the text by Knuth (1981, pp. 295-297) and the text by Press, Teukolsky, Vetterling, & Flannery (1992, pp. 910-912) for discussions. Division of two large integers required a separate routine, which was named Reciprocal, for determining a large-integer representation of the reciprocal of the divisor. The divisor was expressed as a large integer with an implied radix point placed before the first digit. The divisor is sent to function subprogram Reciprocal, which uses Newton's rule to compute the reciprocal of the divisor to a desired number of digits. If V represents the large-integer divisor, and W represents the large-integer representation of 1/V, then Newton's rule (Knuth, 1981, pp. 264, 295-297; Press, Teukolsky, Vetterling, & Flannery, p. 911) converges quadratically and its implementation can be expressed as [W.sub.i+1] = [W.sub.i] * (2 - V * [W.sub.i]) where W [right arrow] 1/V.

The author wrote function Divide to compute the large-integer quotient and remainder given both large-integer dividend (U) and divisor (V). Using routines Reciprocal and Convolution, function Divide computes both the large-integer quotient Q and U - QV, the large-integer remainder. For the case where both dividend and divisor are comprised of N digits, this division operation is O(N log N x log(log N)) where the Nlog N factor accounts for the FFT computation and the log(log N) factor represents the determination of the reciprocal of the divisor using Newton's algorithm (Press, Teukolsky, Vetterling, & Flannery, 1992, p. 911). Function Divide returns a two-element array of type bigInteger. The first element contains the quotient and the second contains the remainder.

Determination of the Variance of the 500 Randomly Generated Integers Using the Large-Integer-Arithmetic Routines

Returning to the variance of the 500 randomly generated integers, the computational formula [s.sup.2] = [n [n.summation over (i=1)][x.sub.i.sup.2] - ([n.summation over (i=1)][x.sub.i])[.sup.2]]/[n(n - 1)], was used instead of [s.sup.2] = [[n.summation over (i=1)][x.sub.i.sup.2] - [1/n]([n.summation over (i=1)][x.sub.i])[.sup.2]]/[n - 1], to avoid dealing with two divisions and their associated quotient-remainder pairs. Figure 6 shows a segment of the Fortran 90/95 program used to compute the variance of the 500 randomly generated integers using the large-integer routines. Module LaregInterArithmeticDec (see Figure 3) contained declarations for the derived data type bigInteger. Module LaregInterArithmeticLib contained the collection of large-integer-arithmetic routines. The function subprogram named MakeBigInteger was written to handle initializations of large integers and, as shown in Figure 6, two large-integer accumulator variables (Sumint and SumSQ) were initialized to zero. Recall that n = 500. With each iteration of the loop, a random integer was read from the file in which they were stored earlier. The random integer was converted to a large integer (BigRint) after which routine Add was called to add BigRint to the accumulator variable Sumint. Routine Convolution was called to compute the large-integer square of BigRint, which was subsequently added to the second accumulator variable SumSQ.

After the sum of the 500 integers (Sumint) and the sum of the squares of the 500 integers (SumSQ) were determined, variable BigN was initialized to 500, the number of random integers used to compute the statistics. Routine Divide was then called to determine the mean of the 500 integers with the result loaded into array xbar. Finally, routines Convolution, Subt, and Divide were summoned to compute the variance. The result was stored in array var.

The results of the computations leading to the determination of the variance are listed in Table 5, where it is seen that [n.summation over (i=1)] [x.sub.i] = 25,090,091, [n.summation over (i=1)] [x.sub.i.sup.2] = 1,673,244,076,909, and [1/n][n.summation over (i=1)] [x.sub.i] = 50,180 + [91/500]. The variance of the 500 random integers was computed as 830,097,683 + [157,719/249,500]. These values represent exact solutions to the statistics problem.

SUMMARY

It is not necessary to relegate a discussion of integer and floating-point number representations in memory and the limitations on precision imposed by those data representations to a second course in programming. At the very least students should be informed about such inaccuracies as those shown in Table 1. It is possible to study looping and conditional constructs while, at the same time, presenting students with a simple problem that exposes penalties incurred when data representations are taken for granted. Methods designed to correct these problems can be discussed in an introductory programming course. The solution may be as simple as changing the data type from integer to floating-point or changing the data type allocation from 32-bit to 64-bit. On the other hand, in extreme cases, where integers are involved, large-integer arithmetic may be necessary. Even in this case, an elementary algorithm to implement "grammar school" addition (and possibly multiplication) is not beyond strong introductory programming students. The dramatically faster multiplication and division that rely on the FFT are more suited to the advanced programming or numerical analysis course.

References

Burley, C. (1998). User notes on Fortran programming (An open cooperative practical guide). Comparison between Fortran and C. (chap. 1, sect. 2). Retrieved February 19, 2006, from http://www.ibiblio.org/pub/languages/for-tran/

Cambridge University High Performance Computing Facility. (2000). The case against C. Retrieved February 19, 2006, from http://www.hpcf.cam.ac.uk/C_rant.html

Cooley, J. W., & Tukey, J. W. (1965). An algorithm for the machine calculation of complex fourier series. Mathematics of Computation, 19(90), 297-301.

Cooley, J. W. (1990). How the FFT gained acceptance. In S. G. Nash (Ed.), A history of scientific computing (pp. 133-140). Reading, MA: Addison-Wes-ley. ISBN 0-201-50814-1

Crandall, R., & Pomerance, C. (2002). Prime numbers: A computational perspective. New York: Springer.

Decyk, V., Norton, C., & Szymanski, B. (1995). Introduction to object-oriented concepts using Fortran 90. Retrieved February 19, 2006, from http://www.cs.rpi.edu/~szymansk/OOF90/F90_Objects.html

Hayes, B. (2003). A lucid interval. American Scientist, 91(6), 484-488.

Heideman, M. T., Johnson, D. H., & Burrus, C. S. (1984). Gauss and the history of the fast Fourier transform. IEEE ASSP Magazine, 1, 14-21.

Hewlett-Packard Development Company. (2004). HP-UX floating-point guide: HP 9000 computers. Retrieved February 26, 2005, from http://docs.hp.com/en/B3906-90006/go01.html#d0e21238 and http://docs.hp.com/en/B3906-90006/go01.htm1#d0e21154

Knuth, D.E. (1981). The art of computer programming, seminumerical algorithms, Vol. 2 (2nd ed.). Reading, MA: Addison-Wesley.

Nyhoff, L.R., & Leestma, S.C. (1997). Fortran 90 for scientists and engineers. Upper Saddle River, NJ: Prentice-Hall.

Prentice, J. (1995), Fortran 90 for science students. Retrieved February 19, 2006, from http://www.lahey.com/PRENTICE.HTM

Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery B. P. (1992). Numerical recipes in Fortran 77 (2nd ed.). Cambridge, UK: Cambridge University Press.

Sun Microsystems, Inc. (2004). Numerical computation guide (chap. 2). Retrieved February 19, 2006, from http://docs.sun.com/source/817-5073/

WILLIAM H. RICHARDSON JR.

Francis Marion University

USA

wrichard@fmarion.edu
```Table 1 Values computed for C=((A+B)**2-2*A*B-B**2)/A**2 using various
values for A where B = 1000.0.

Value for A Value for C (AMD Athlon) Value for C (Alpha)

1.0 1.000000000000000 1.000000000000000
0.1 1.000000000000000 -6.249999523162841
0.01 0.999999523162841 0.000000000000000
0.001 1.000007510185241 -62499.9921875000
0.0001 1.001171827316284 6250000.500000000
0.00001 0.000000000000000 0.000000000000000
0.000001 0.000000000000000 0.000000000000000
0.0000001 -11641.5322265625 0.000000000000000
0.00000001 0.000000000000000 0.000000000000000

Table 2 Large-integer statistical values for the 500 randomly generated
integers. Values were computed using 32-bit integer representation to
accumulate the sum and sum of squares of the 500 integers.

Sum of the 500 random integers 25,090,091
Sum of the squares of the 500 random integers -1,793,168,531
Mean of the 500 random integers 50180.183593750
Variance of the 500 random integers -3590757.750

Note. Unless otherwise noted, values shown represent exactly the result
stored in memory.

Table 3 Large-integer statistical values for the 500 randomly generated
integers. Values were computed using 32-bit floating-point
representation to accumulate the sum and sum of squares of the 500
integers.

Sum of the 500 random integers 25,090,084.0
Sum of the squares of the 500 random 1,673,244,966,912.0
integers
Mean of the 500 random integers 50,180.167968750
Variance of the 500 random integers 830,100,864.0

Note. Unless otherwise noted, values shown represent exactly the result
stored in memory.

Table 4 Large-integer statistical values for the 500 randomly generated
integers. Values were computed using 64-bit floating-point
representation to accumulate the sum and sum of squares of the 500
integers.

Sum of the 500 random integers 25,090,091.0
Sum of the squares of the 500 1,673,244,076,909.0
random integers
Mean of the 500 random integers 50,180.183593750
Variance of the 500 random 830,097,683.632 (approximate)
integers

Note. Unless otherwise noted, values shown represent exactly the result
stored in memory.

module LargeIntegerArithmeticDec
implicit none

integer, parameter ::IenDigitArray=300, BASE=10, &
LP=IenDigitArray-1, digitKind=2, &
MAXALPHA=600, maxIntKind=4

type bigInteger
integer(kind=digitKind),dimension(0:LP)::Digit
integer(kind=digitKind) :: ILD
logical :: Neg
end type bigInteger

end module LargeIntegerArithmeticDec

Figure 3. Fortran 90 module containing a derived data type declaration
for storing large integers

Base-ten Large Integer: 2,704,397,803,264,985

array Digit

LP LP-1 ... 17 16 15 14 13 12 11 10

0 0 ... 0 0 2 7 0 4 3 9
9 8 7 6 5 4 3 2 1 0
7 8 0 3 2 6 4 9 8 5

ILD = 15
Neg = .false.

Figure 4. An instance of the derived data type bigInteger used to store
the base-ten large integer 2,704,397,803,264,985

program StatBigInt
use LargeIntegerArithmeticDec
use LargeIntegerArithmeticLib
implicit none

integer :: i, Rint, n
integer, parameter :: a = 1, b = 100000
real :: Rval
type(bigInteger)::SumSQ,Sumint,BigRint,BigRintSQ
type(bigInteger)::nSumSQ,sumintSumint,nn_1,BigN
type(bigInteger)::BigNumer
type(bigInteger),dimension(2) :: xbar,var
.
.
.
Sumint = MakeBigInteger(0,BASE) ! initialize big integer
SumSQ = MakeBigInteger(0,BASE) ! (accumulators) to zero

do i = 1,n
BigRint = MakeBigInteger(Rint,BASE) !ran int to big int
BigRintSQ = Convolution(BigRint,BigRint,BASE)
end do

BigN = MakeBigInteger(n,BASE) !Convert integer n to big int
xbar = Divide(Sumint,BigN) !Compute the mean
nSumSQ = Convolution(BigN,SumSQ,BASE)!n*(sum of squares)
SumintSumint = Convolution(Sumint,Sumint,BASE)! Sum^2
BigNumer = Subt(nSumSQ,SumintSumint) !Compute numerator
nn_1 = Convolution &
(BigN,Subt(BigN,MakeBigInteger(1,BASE)),BASE) !n(n-1)
var = Divide(BigNumer,nn_1) !Compute the variance
.
.
.

Figure 6. Program segment showing computation of the variance of the 500
randomly generated integers using the large-integer routines

Table 5 Large-integer statistical values for the 500 randomly generated
integers. Values were determined using the large-integer subprograms for
all computations.

Sum of the 500 random integers 25,090,091
Sum of the squares of the 500 random 1,673,244,076,909
integers
Mean of the 500 random integers 50,180 + [91/500]
Variance of the 500 random integers 830,097,683 + [157,719/249,500]

Note. Unless otherwise noted, values shown represent exactly the
result stored in memory.
```