Printer Friendly

The RISC BLAS: A Blocked Implementation of Level 3 BLAS for RISC Processors.

1. INTRODUCTION

The Level 3 BLAS are a set of computational kernels targeted at matrix-matrix operations with the aim of providing efficient and portable implementations of algorithms on high-performance computers. The linear algebra package LAPACK [Anderson et al. 1995], for example, makes extensive use of the Level 3 BLAS.

This article describes a version of single- and double-precision Level 3 BLAS computational kernels [Dongarra et al. 1990a; 1990b] called the RISC BLAS, designed to be efficient on RISC processors. It is based on the use of the matrix-matrix multiplication kernel GEMM. We show that this implementation is portable and efficient on a range of RISC-based computers.

This version of the Level 3 BLAS is an evolution of the one described by Dayde et al. [1994] for MIMD vector processors. They report on experiments on a range of computers (ALLIANT, CONVEX, IBM, and CRAY) and demonstrate the efficiency of their approach whenever a tuned version of the matrix-matrix multiplication is available. They conclude by saying that similar ideas could be used to design a tuned uniprocessor Level 3 BLAS for computers where the processor accesses data through a cache, since blocking would also be beneficial.

The availability of powerful RISC processors is of major importance in today's market, since they are used both in workstations and in the most recent parallel computers. Because of the success of RISC-based architectures, we have decided to study the design of a version of the Level 3 BLAS that is efficient on RISC processors. This tuned version of the Level 3 BLAS uses the same blocking ideas as in Dayde et al. [1994], except that the ordering of loops is designed for efficient reuse of data held in cache. Thus, all the codes are specifically tuned for RISC processors, and the software includes a tuned version of GEMM.

Our basic idea in the design of the Level 3 BLAS is to partition the computations across submatrices so that the calculations can be expressed in terms of calls to GEMM and operations involving triangular matrices. All the codes we are using are written in Fortran and are tuned using blocking, copying, and loop-unrolling. We believe these codes provide an efficient implementation of the Level 3 BLAS on computers where a highly tuned version is not available. In this article, the timings for the non-GEMM blocked kernels are for versions using our own blocked GEMM code. We note, that, in cases when the vendor supplies a more efficient version of GEMM, it is trivial for us to use this in these other kernels (see Section 9). By doing so, we can often do far better than the vendor-supplied versions of these other kernels. At this time, we are very concerned with portability and so have only included a few specific tuning techniques that are crucial on some computers. Additionally, our experiments often use nonideal--critical--leading dimensions for the matrices involved in the calculations (e.g., powers of two). On some machines the times would be better for other values. We would be happy to discuss with users and vendors the possibility of designing more highly tuned, but less portable, kernels for specific machines. We also hope to receive input and comments from users to improve this software.

The implementation of the kernels using both blocking and loop unrolling is described in Sections 3 to 8 (examples of codes are included); more details on this implementation are reported by Qrichi Aniba [1994]. We only consider the implementation of the real and the double-precision Level 3 BLAS kernels.

This implementation of the Level 3 BLAS is available on anonymous FTP, and we welcome input from users to improve and extend our BLAS implementation. More details and more experiments can be found in Dayde and Duff [1996] and Dayde and Duff [1997].

2. BLOCKED IMPLEMENTATION OF LEVEL 3 BLAS FOR RISC PROCESSORS

2.1 RISC Processors

Vector processors are commonly used in supercomputers. Recently, very fast RISC processors, which can also process vectors efficiently, have come on to the market. They are usually more efficient than vector processors on scalar applications. The main reason for their success in the marketplace is their very good cost-to-performance ratio. They are used as a CPU both in workstations and in most of the current MPPs (DEC Alpha on CRAY T3E, SPARC on CM5 and PCI CS2, HP PA on CONVEX EXEMPLAR, and RS/6000 on IBM SP1 and SP2).

We report results from uniprocessor executions on a range of RISC-based computers (in practice, we have performed experiments on a larger set of machines):

(1) CRAY T3D (1 node) located at IDRIS

(2) DEC 8400 5/300 located at RAL

(3) HP 715/64 located at ENSEEIHT

(4) IBM SP2 (1 thin node) located at CNUSC

(5) MEIKO CS2-HA (using a HyperSparc processor) located at CERFACS

(6) SGI Power Challenge 10000 using a MIPS R10000 processor located at CERFACS

(7) SUN UltraSPARC-1 model 140 located at ENSEEIHT

2.2 Efficient Exploitation of the Memory Hierarchy

The ability of the memory to supply data to the processors at a sufficient rate is crucial on most modern computers. This necessitates complex memory organizations, where the memory is usually arranged in a hierarchical manner. The minimization of data transfers between the levels of the memory hierarchy is a key issue for performance [Gallivan et al. 1987; 1988].

Most of the RISC-based architectures have a memory hierarchy involving a cache. The cache memory is used to mask the memory latency (typically the cache latency is around 1-2 clocks, while it is often 10 times higher for the memory). The code performance is high so long as the cache hit ratio is close to 100%. This may happen if the data involved in the calculations can fit in cache or if the calculations can be organized so that data can be kept in cache and efficiently reused. One of the most commonly used techniques for that purpose is called blocking, and examples of this are reported in the following sections. Blocking enhances spatial and temporal locality in computations. Unfortunately, blocking is not always sufficient, since the cache miss ratio can be dramatically increased in quite an unpredictable way by memory accesses using a stride greater than I [Bodin and Seznec 1994].

Some strides are often called critical because they generate a very high cache miss ratio (i.e., when referencing cache lines that are mapped into the same physical location of the cache). These critical strides obviously depend on the cache management strategy. For example, assuming a(i) is one word and assuming the cache line length is equal to four words (assuming that the cache is initially empty), when executing the loop
do i=l,n,4
  temp = temp + a(i)
enddo


then each read of a(i) causes a cache miss.

Copying blocks of data (e.g., submatrices) that are heavily reused may help to improve memory and cache accesses (e.g., by avoiding critical strides). Since it may induce a large overhead, it is, however, not always a viable technique (e.g., when the number of memory references required by the copy is the same order as the number of flops involved in the calculation to be performed). We illustrate copying in our blocked implementation of the BLAS. Note that blocking and copying are also very useful in limiting the effect of TLB (Translation Lookaside Buffer) misses or memory paging.

2.3 Motivations and Design of the RISC BLAS

We have previously implemented full and sparse linear solvers on computers where a tuned version of the BLAS was not available (or was not available without cost), or where the tuning of some of the BLAS kernels was not done efficiently. Because the performance of the Level 3 BLAS is crucial to most of our work, we decided to invest some time in the tuning of the Level 3 BLAS kernels.

Our main goal when designing the RISC BLAS was to provide reasonable performance with very simple software, on a range of computers. Our interest in RISC processors arose from the fact that they are used in workstations and parallel computers where a well-tuned version of the BLAS was often not available.

We considered the blocking of the triangular solver from the Level 3 BLAS--TRSM--in Dayde and Duff [1989]. Then, we developed a blocked version of the Level 3 BLAS for MIMD vector multiprocessors [Amestoy and Dayde 1993; Dayde et al. 1994]. At the same time, we also studied the development of a parallel version of the Level 3 BLAS for Transputers [Berger et al. 1991]. Some of the ideas in the Transputer parallel version and in the blocked version for MIMD vector processors were used to design the serial and parallel versions of the Level 3 BLAS for the BBN TC2000 [Amestoy et al. 1995]. Note that the BBN TC2000 used a RISC processor: the Motorola 88100. The serial version developed for the BBN was extended and modified to be portable and efficient on a wide range of RISC processors [Dayde and Duff 1996; Qrichi Aniba 1994]. The corresponding software--Version 0--was made available in 1995 and was installed in various places. The version we refer to in the present article is Version 1.0.

The RISC BLAS differs from the blocked version of the Level 3 BLAS for MIMD vector multiprocessors in the following ways:

--The loop ordering is dictated by consideration of efficient cache reuse rather than parallel implementation.

--The codes are now tuned for RISC processors rather than for vector processors, and, additionally, we provide a tuned GEMM code.

--SYMM and SYR2K are blocked in a different way and only make use of GEMM (as described in Ling [1993] and suggested implicitly in Sheikh and Liu [1989]).

Our basic idea for efficient implementation of the BLAS on RISC processors is to express all the Level 3 BLAS kernels in terms of subkernels that either perform GEMM operations on square submatrices of order NB or perform operations involving triangular submatrices. Additionally, all the calculations in these subkernels are performed using tuned Fortran codes with loop-unrolling. Copying is occasionally used. Of course, the relative efficiency of this approach depends on the availability of a highly tuned GEMM kernel. Our approach is relatively independent of the computer: only the NB parameter, corresponding to the block size, and in some cases the loop-unrolling depth need to be set according to the characteristics of the target machine. NB is determined by the size of the cache (see Section 3.1) and the loop-unrolling depth by the number of scalar registers. The value of NB is set within the installation makefile by selecting an architecture name.

Note that Kagstrom et al. [1998a; 1998b] use similar ideas. Using their terminology, the RISC BLAS is a GEMM-based BLAS. In fact, most of the manufacturers develop BLAS in that way [Sheikh and Liu 1989].

The main differences between the RISC BLAS and the work by Kagstrom et al. are the following:

--The RISC BLAS only makes use of Level 3 BLAS operations. These operations are effected using the tuned Fortran codes included in our software. A lot of effort has been made to provide tuned building blocks (using blocking, copying, and loop-unrolling) for all the kernels including GEMM.

--The GEMM-based BLAS does not provide a tuned implementation of GEMM and relies on the one available on the computer. It is based on the use of GEMM and Level 1 and Level 2-based operations.

--The RISC BLAS is still a on-going effort. Version 2.0 will be delivered soon: some improvements in the GEMM design (e.g., multilevel blocking) have been implemented, and complex versions of GEMM will be included.

---We incorporate some specific optimizations in our software that appear to be crucial on some processors (e.g., we have a version tuned for the SGI Power Challenge 10000), but we have not included all possible optimizations, in order to keep the code simple (our intention is not to provide the best possible implementation on a particular processor but a good one over a range of processors). We will certainly distribute separately tuned versions of the RISC BLAS for specific processors in the future.

The installation makefile provided in our software offers default options for a wide range of computers (including those used in our experiments). Basically, the user has only to select an architecture name (e.g., RS6K64 for an IBM RS6000 Power with a 64KB cache or SPARC10 for SUN workstations using a SPARC 10 processor), the corresponding organization of operations (TRIADIC or NOTRIADIC for the IBM RS6000 and the SUN, respectively) according to the recommendations in the makefile, and the compiler and linker options (we provide default options). Examples when using GEMM are included in Section 3.1. We use the C preprocessor as the main mechanism to generate the version of the RISC BLAS for a particular processor. Our software has been tested on a wide range of RISC processors running the UNIX operating system. We have not yet studied extensions for non-UNIX systems, but generating a Fortran version (without C preprocessor directives) before porting the code is straightforward.

Modifying the software to add a new processor is extremely simple: only the block size (which depends on the cache size) has to be set using a very simple calculation rule as explained in Section 3.1. The TRIADIC organization of operations is almost always the best choice.

In the following sections, we describe the blocked implementation of the real and double-precision Level 3 BLAS: GEMM, SYMM, TRSM, TRMM, SYRK, SYR2K (all these names are prefixed by S or D depending on whether the routine is single or double precision).

For each kernel there are a number of options, e.g., whether the matrix is transposed or not. For the sake of clarity, we comment only on one of these variants of the kernels, and we illustrate our blocking strategy on matrices that are only partitioned into four blocks. In practice, the matrices are partitioned into square blocks of order NB where NB is chosen according to the machine characteristics.

3. BLOCKED IMPLEMENTATION OF GEMM

3.1 Description of the Blocked GEMM

GEMM performs one of the matrix-matrix operations

C = [Alpha] op(A) op(B) + [Beta]C,

where a and[Beta] are scalars; A and B are rectangular matrices of dimensions m x k and k x n, respectively; C is an m x n matrix; and op(A) is A or [A.sup.T].

We consider the following case (corresponding to op equal to "No transpose" in both cases):

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

DGEMM can obviously be organized in terms of a succession of matrix-matrix multiplication on submatrices as follows:

(1) [C.sub.11] [left arrow] [Beta] [C.sub.11] + [Alpha] [A.sub.11] [B.sub.11] (GEMM)

(2) [C.sub.11] [left arrow] [C.sub.11] + [Alpha] [A.sub.12] [B.sub.21] (GEMM)

(3) [C.sub.12] [left arrow] [Beta] [C.sub.12] + [Alpha] [A.sub.11] [B.sub.12] (GEMM)

(4) [C.sub.12] [left arrow] [C.sub.12] [left arrow]+ [Alpha] [A.sub.12] [B.sub.22] (GEMM)

(5) [C.sub.21] [left arrow] [Beta] [C.sub.21] + [Alpha] [A.sub.21] [B.sub.11] (GEMM)

(6) [C.sub.21] [left arrow] [C.sub.21] + [Alpha] [A.sub.22] [B.sub.21] (GEMM)

(7) [C.sub.22] [left arrow] [Beta] [C.sub.22] + [Alpha] [A.sub.21] [B.sub.12] (GEMM)

(8) [C.sub.22] [left arrow] [C.sub.22] + [Alpha] [A.sub.22] [B.sub.22] (GEMM)

The ordering of these eight computational steps is determined by considerations on efficient reutilization of data held in cache. The submatrix of A, because of the access by rows, leads to nonunit strides in the innermost loops of the calculations. It is then multiplied by a and transposed into a working array AA. AA is kept in cache as long as required. This is why we have decided to organize the calculations in order to reuse the submatrices of A as much as possible (thus, trying to amortize the cost of this copy), and we perform all operations involving a submatrix before moving to another one (see Figure 1). For our simple example, it means that we perform the calculations as follows: Step 1, Step 3, Step 5, Step 7, Step 2, Step 4, Step 6, and Step 8. This approach is similar to that used by Dongarra et al. [1991].
*
*    Form C := beta*C
*
     IF( BETA.EQ.ZERD )THEN
         DO 20 J = 1, N
            DO 10 I = 1, M
               C( I, J ) = ZERO
  10        CONTINUE
  20     CONTINUE
     ELSE
          DO 40 J = 1, N
             DO 30 I = 1, M
                C( I, J ) = BETA*C( I, J )
  30         CONTINUE
  40      CONTINUE
     END IF
*
*    Form C := alpha*A*B + beta*C.
*
     DO 70 L = 1, K, NB
        LB = MIN( K - L + 1, NB )
        DO 60 I = 1, M, NB
           IB = MIN( M - I + 1, NB )
           DO 62 II = I, I + IB - 1
              DO 61 LL = L, L + LB - 1
                 AA (LL-L+1,II-I+1)=ALPHA*A(II,LL)
61            CONTINUE
62         CONTINUE
           DO 50 J = 1, N, NB
              JB = MIN( N - J + 1, NB )

*
*  Perform multiplication on submatrices
*
              IF ((MOD(IB,2).EQ.O).AND.(MOD(JB,2).EQ.O)) THEN
                 CALL DG EMML2X 2_NN (IB,JB,LB,AA,NB,B(L,J),LDB,
                 C(I,J),LDC)
              ELSE
                 CALL DGEMML_NN(IB,JB,LB,AA,NB,B(L,J),LDB,C
                 (I ,J),LDC)
              END IF
50         CONTINUE
60      CONTINUE
70   CONTINUE


Fig. 1. Blocked code for GEMM.

The blocked code is reported in Figure 1. We use two tuned Fortran codes to perform calculations on submatrices (see Figure 2):

--DGEMML2X2_NN is a tuned code for performing matrix-matrix multiplication on square matrices of even order.

--DGEMML_NN is a tuned code that includes additional tests over DGEMML2X2_NN to handle matrices of odd order. It is occasionally slightly less efficient than DGEMML2X2_NN.
*
*            C := alpha*A*B + C.
*
       DO 70  J = 1, N, 2
          DO 60 I = 1, M, 2
             T11 = C(I,J)
             T21 = C(I+1,J)
             T12 = C(I,J+1)
             T22 = C(I+1,J+1)
             DO 50 L = 1, K
                B1 = B(L,J)
                B2 = B(L,J+1)
                A1 = A(L,I)
                A2 = A(L,I+1)
                T11 = T11 + B1*A1
                T21 = T21 + B1*A2
                T12 = T12 + B2*A1
                T22 = T22 + B2*A2
50           CONTINUE
             C(I,J) = T11
             C(I+1,J) = T21
             C(I,J+1) = T12
             C(I+I,J+1) = T22
60       CONTINUE
70  CONTINUE

*
*         C := alpha*A*B + C.
*
    DO 70  J = 1, N, 2
       DO 60 I = 1, M, 2
          T11 = C(I,J)
          T21 = C(I+1,J)
          T12 = C(I,J+1)
          T22 = C(I+1,J+1)
          DO 50 L = 1, K
             B1 = B(L,J)
             B2 = B(L,J+1)
             A1 = A(L,I)
             A2 = A(L,I+1)
             T1 = B1*A1
             T2 = B1*A2
             U1 = B2*A1
             U2 = B2*A2
             T11 = T11 + T1
             T21 = T21 + T2
             T12 = T12 + U1
             T22 = T22 + U2
50        CONTINUE
          C(I,J) = T11
          C(I+1,J) = T21
          C(I,J+1) = T12
          C(I+1,J+1) = T22
6O     CONTINUE
70  CONTINUE


Fig. 2. Tuned code for GEMM (left: TRUADIC option, right: NOTRIADIC option).

We have used two versions for all the tuned codes:

--the TRIADIC option for computers where triadic operations are either supported in the hardware (e.g., the floating-point multiply-and-add on IBM SP2) or are efficiently compiled

--the NOTRIADIC option for other computers.

The use of triadic operations should not normally degrade the performance severely on processors that do not support these operations, since efficient code generation can transform them into dyadic operations. However, in early versions of SPARC compilers, we saw that there was sometimes such a degradation. Thus we prefer to offer both options.

The tuned code DGEMML2X2_NN using the TRIADIC options is shown on the left side of Figure 2, while the code corresponding to the NOTRIADIC option is shown on the right side. The selection between the options is effected using the C preprocessor that generates one of the two codes. All the tuned codes described in the rest of this article offer both options. In Version 1.0 of the RISC BLAS, we use a 2-by-2 and a 4-by-2 unrolling in double and single precision, respectively. In Version 2.0, the same loop-unrolling depth will be used in both precisions on 64-bit processors such the DEC, the IBM, and the MIPS processors.

There is some freedom for selecting an appropriate block size. Since the elements of the working array AA that are used to store the transpose of the submatrix of A, and the submatrix of B, are referenced several times in the innermost computational loops (see Figure 2), NB should be chosen to guarantee that these subarrays fit in cache. Additionally, the elements of C that are updated should also fit in cache [Bodin and Seznec 1994; Gallivan et al. 1988; Hennessy and Patterson 1996; Ling 1993].

If the leading dimension of an array is critical--e.g., a multiple of a power of two--the effective space in the cache to hold a submatrix of that array may be drastically reduced (typically, several elements of the submatrix are to be stored at the same cache location, increasing the number of cache misses). It is then useful to detect these critical leading dimensions [Ling 1993; Kagstrom et al. 1998a; 1998b]. One possibility is to decrease NB in order to decrease the number of cache misses in an attempt to fit the submatrix into the cache. Copying these blocks into a working array with a favorable leading dimension is also very useful (as used in the RISC BLAS). Note that avoiding powers of 2 as dimensions of a 2D array is not always sufficient to prevent a catastrophic behavior of caches [Bodin and Seznec 1994].

Similarly to Bell [1991] and Dongarra et al. [1991], NB is selected for the RISC BLAS in a simple way: it is chosen so that all the submatrices of A, B, and C required for each submultiplication fit in the largest on-chip cache, except for the MEIKO CS2-HA because the HyperSparc only possesses an external cache on that computer. On some machines, access to off-chip caches has so low latency that we can improve performance by using a larger block size. This is true, for example, on the SGI Power Challenge. The most efficient use of multilevel cache machines is out with the scope of this article, but some multilevel blocking is implemented in Version 2.0 of our software. Additionally in Version 2.0, the multiplication of C by[Beta] is effected within the sectioning loops so that the submatrix of C can be reused more efficiently. The way we manage critical leading dimensions will be improved in future releases of the RISC BLAS.

Since all the computational kernels call GEMM, the block size NB is always determined as the most appropriate block size for GEMM. That is, we choose the largest even integer (even--to enhance loop-unrolling by using the tuned codes such as DGEMML2X2 NN without needing additional tests to handle matrices with odd order) such that

3[(NB).sup.2] prec [is less than] CS

where prec is the number of bytes corresponding to the precision used (four bytes for single precision and eight bytes for double precision in IEEE format), while CS is the cache size in bytes. For example, with a 64KB cache, NB is set to 52 when using 64-bit arithmetic.

We show the block sizes used in our experiments in Table I. We also include the cache organization (direct-mapped or set-associative). Note that the DEC processor (DEC 21164) used on the DEC 8400 5[Beta]00 has two levels of internal cache of size equal to 8KB and 96KB, respectively, and an external cache of between 1MB and 64MB. We have tuned our codes with respect to the second level of internal cache, since our experiments show this is the most efficient. The SGI Power Challenge the using R10000 processor and the UltraSPARC-1 have an external cache of size equal to 1MB and 512KB respectively.

Table I. Block Size Used in the Blocked BLAS on the Target Computers
                         Cache
                    Characteristics         Block Size

Computer            Size         Org.     Single    Double

CRAY T3D             8KB       Direct       24        16
DEC 8400 5/300      96KB       3-way        88        60
HP 715/64           64KB       Direct       72        52
IBM SP2             64KB       4-way        72        52
MEIKO CS2-HA       256KB       4-way       140       100
SGI Power 10000     32KB       Direct       42        36
SUN Ultra-1 140     16KB       Direct       36        24

                        Organization
                            of           Clock      Peak
Computer                Operations       (MHz)      Perf.

CRAY T3D                TRIADIC           150        150
DEC 8400 5/300          TRIADIC           300        600
HP 715/64               TRIADIC           64         128
IBM SP2                 TRIADIC           66         264
MEIKO CS2-HA            NOTRIADIC         100        100
SGI Power 10000         TRIADIC           195        390
SUN Ultra-1 140         NOTRIADIC         143        286


Single-Precision Implementation on the IBM RS/6000 and IBM SP2. Slight modifications [Dongarra et al. 1991] allow further improvement in performance on the IBM Power and Power2 processors. The IBM FPU performs its arithmetic using 64-bit operands. As a consequence, these processors perform single-precision operations in the following way:

(1) Convert operands from single to double precision.

(2) Perform double-precision computation.

(3) Convert the double-precision result into single precision.

These conversions can be very costly and explain why the IBM is slower in single precision than in double precision. Therefore, we have slightly modified the tuned code SGEMML4X2 to convert operands within the innermost loop once only. The matrix A is copied into a double-precision working array in the blocked code, and the elements of arrays C and B are stored in double-precision temporary scalars.

We have used this data conversion only in SGEMM on the IBM, but it should be used everywhere else. Since IBM provides a tuned BLAS implementation in its scientific library, we have decided not to spend too much effort on tuning our code for the IBM.

3.2 Numerical Experiments

We show in Table II the performance achieved on CRAY T3D, DEC 8400 5[Beta]00, IBM SP2, HP 715, MEIKO CS2-HA, SGI Power Challenge, and SUN UltraSPARC-1. We also include the performance of the manufacturer-supplied library version when available (we use -lblas on the SGI and the HP, ESSL on the IBM SP2, SCILIB on the CRAY T3D, and -ldxml on the DEC 8400). We only include the single-precision experiments for the CRAY T3D, since single precision corresponds to 64-bit arithmetic on that machine (we proceed similarly for the other kernels). "Standard" in column 2 of Tables II, III, V, and VI refers to the standard Fortran version. The performance reported is the average performance achieved on a set of 4 matrix-matrix multiplications where matrices are square of order 32, 64, 96, and 128. We also report, on the SGI Power Challenge, the performance achieved using an increased block size corresponding to a cache size equal to 64KB instead of 32KB (line: blocked (64KB)).

Table II. Average Performance in Mflop/s of the Blocked Implementation of DGEMM and SGEMM on RISC Workstations (using square matrices of order 32, 64, 96, and 128)
                                              op(A), op(B)

Processor              DGEMM/SGEMM        `N', `N'       `N', `T'

CRAY T3D               standard             /12            /12
                       blocked              /49            /40
                       library              /90            /66

DEC 8400 5/30          standard            95/105         98/108
                       blocked            216/246        208/267
                       library            335/412        327/388

HP 715/64              standard            15/18          16/18
                       blocked             29/55          30/59
                       library             52/81          47/63

IBM SP2 (thin node)    standard            31/32          32/32
                       blocked            132/153        111/167
                       library            189/206        182/197

MEIKO CS2-HA           standard            39/28          36/33
                       blocked             38/60          45/58

SGI Power 10000        standard            79/91          77/95
                       blocked            152/152        197/174
                       blocked (64KB)     206/271        194/261
                       library            136/152        168/173

SUN Ultra-1 140        standard            28/42          26/42
                       blocked             67/112         64/116

                              op(A), op( B )

Processor                `T', `N'       `T', `T'

CRAY T3D                    /15            /8
                            /49           /49
                            /87           /71

DEC 8400 5/30              76/77         61/65
                          215/239       211/258
                          345/416       318/395

HP 715/64                  20/22         22/24
                           30/55         34/56
                           46/71         51/81

IBM SP2 (thin node)        51/27         51/27
                          146/171       135/191
                          197/220       161/202

MEIKO CS2-HA               30/33         27/36
                           39/69         43/78

SGI Power 10000           110/12         90/108
                          206/247       225/249
                          200/264       186/266
                          198/247       209/249

SUN Ultra-1 140            33/54         26/45
                          67/112         63/118


Table III. Average Performance in Mflop/s of the Blocked Implementation of DGEMM (SGEMM on CRAY T3D) on RISC Workstations (where C is a square matrix of order 32, 64, 96, and 128 and inner dimension of the product, k, equal to 8 and 16)
                                                 k

Processor               GEMM (64-bit)      8          16

CRAY T3D                standard          15          15
                        blocked           37          24
                        library           73          44

HP 715/64               standard          17          16
                        blocked           23          25
                        library           38          42

IBM SP2 (thin node)     standard          33          17
                        blocked           87          62
                        library           85          78

MEIKO CS2-HA            standard          33          37
                        blocked           32          35

SGI Power 10000         standard          91          49
                        blocked           18         106
                        library           18         108

SUN Ultra-1 140         standard          32          17
                        blocked           51          31


Table V. Average Performance in Mflop/s of the Blocked Implementations of DTRSM and DTRMM (STRSM and STRMM on the CRAY T3D) on RISC Processors (using square matrices of order 32, 64, 96, and 128)
                                         Kernel

                                              TRSM: "Left"

                                     `U'      `L'      `U       `L'
Processor              Version       `N'      `N'      `T       `T'

CRAY T3D               standard      12       12       16       16
                       blocked       38       38       35       34
                       library       87       86       82       81

DEC 8400 5/300         standard      40       73       70       72
                       blocked      204      187      199      231
                       library      184      182      176      183

HP 715/64              standard      12       12       20       19
                       blocked       30       28       31       31
                       library       33       34       34       31

IBM SP2 (thin node)    standard      41       45       53       53
                       blocked      113      140      109      123
                       library      155      157      146      147

MEIKO CS2-HA           standard      13       12       31       30
                       blocked       50       47       43       42

SGI Power 10000        standard      67       82      118      116
                       blocked      212      212      261      254
                       library      211      211      259      252

SUN Ultra-1 140        standard      22       22       34       33
                       blocked       57       57       57       57

                                     Kernel

                                   TRMM: "Left"

                         `U'      `L'      `U       `L'
Processor                `N'      `N'      `T       `T'

CRAY T3D                 12       12       13       13
                         47       48       36       36
                         12       12       16       16

DEC 8400 5/300           84       81       74       70
                        194      180      183      184
                        175      174      178      175

HP 715/64                16       15       21       20
                         30       29       33       31
                         41       41       41       39

IBM SP2 (thin node)      31       29       54       55
                        106      121      116      109
                        154      174      155      166

MEIKO CS2-HA             13       12       31       30
                         50       47       43       42

SGI Power 10000          77       74      117      116
                        228      224      242      178
                        228      224      242      178

SUN Ultra-1 140          24       25       34       34
                         64       64       64       57


Table VI. Average Performance in Mflop/s of the Blocked Implementations of DSYMM, DSYRK, and DSYR2K (SSYMM, SSYRK, and SSYR2K on the CRAY T3D) for RISC Processors (using square matrices of order 32, 64, 96, and 128)
                                             SYMM

                             `L'        `L'        `R         `R'
Processor       Version      `U'        `L'        `U'        `L'

CRAY T3D        standard       4          7         11         11
                blocked       45         45         48         48
                library        4          7         11         11

DEC 840         standard      87         96         91         90
 5/300          blocked      188        183        180        183
                library      315        310        300        300

HP 715/64       standard       4          4         16         16
                blocked       28         28         30         31
                library       44         45         45         46

IBM SP2         standard      52         51         34         34
 (thin node)    blocked      116        114        113        121
                library      156        165        141        168
MEIKO           standard      15         15         40         38
 CS2-HA         blocked       37         36         40         42

SGI Power       standard      44         37         98         97
 10000          blocked      241        237        226        220
                library      241        237        220        214

SUN Ultra-1     standard      26         25         26         26
 140            blocked       61         61         62         62

                                 SYRK

                   `U'       `L'        `U'        `L'
Processor          `N'       `N'        `T'        `T'

CRAY T3D           11        11         16         16
                   39        38         44         42
                   11        11         16         16

DEC 840            81        82         74         78
 5/300            172       162        182        189
                   82        86        100         85

HP 715/64          17        15         24         24
                   26        23         32         31
                   16        16         28         27

IBM SP2            31        30         60         60
 (thin node)      106        99        115        109
                  169       179        177        174
MEIKO              20        20         37         36
 CS2-HA            44        44         40         40

SGI Power          73        70         139        137
 10000            192       182         192        184
                  198       198         199        197

SUN Ultra-1        24        24          36         36
 140               61        63          69         70

                                  SYR2K

                   `U'       `L'        `U         `L'
Processor          `N'       `N'        `T         `T'

CRAY T3D            7         7          5          5
                   42        42         47         48
                    7         7          5          5

DEC 840            83        89         82         82
 5/300            187       181        191        198
                  106       103        130        127

HP 715/64           5         5          2          2
                   33        31         34         35
                    5         5          3          3

IBM SP2            42        41         81         89
 (thin node)      113       103        123        128
                  146       161        169        177
MEIKO              15        17         10         10
 CS2-HA            44        44         46         46

SGI Power          40        37         40         39
 10000            208       204        219        215
                  210       216        254        251

SUN Ultra-1        29        29         26         26
 140               62        61         64         65


The blocked implementation of GEMM usually provides a gain of more than 2 over the standard Fortran code when the matrices exceed the cache size. Note that better performance can be achieved if the matrices are already located (preloaded) in the cache, which is not the case in our experiments. On the MEIKO CS2-HA, the KAP preprocessor that we use performs extremely efficient optimizations (using loop-unrolling), and, since the matrices are relatively small and fit in cache (the size of the external cache is 256KB), the standard version of DGEMM when arrays are not transposed is the same as our tuned version (optimization performed by KAP and by hand are equivalent, since blocking has no effect). On the DEC 8400, the vendor-supplied library routines perform significantly better than our blocked code, in both single and double precision, probably because better use is made of the multilevel cache. The vendor-supplied library GEMM routines on the SGI perform similarly in single precision and are slightly worse in double precision than our blocked code. However, if we increase the block size, we can improve the performance of the blocked codes by more than 60% in some cases even though the submatrices do not then fit in the on-chip caches (see SGI Power results).

We also report in Table III the average performance of DGEMM (SGEMM on the CRAY T3D) when the inner dimension of the matrix-matrix product is small (k equals 8 and 16), since it is of special interest for sparse matrix calculations [Amestoy and Duff 1989; Amestoy et al. 1995; Puglisi 1993]. We only consider the case where A and B are not transposed.

We show in Table IV the performance of the best version of DGEMM and SGEMM available to us, i.e., we use either our implementation or a tuned manufacturer-supplied version. We choose the option when both A and B are not transposed and run on square matrices of order 500 and 1000 in order to study whether we can get close to the theoretical peak performance.

Table IV. Performance in Mflop/s of the Best Available Implementation of DGEMM and SGEMM on RISC Workstations (A and B are not transposed)
                                                 Size

Processor              Version      Kernel    500    1000     Peak

CRAY T3D               library      SGEMM     102     103      150

DEC 8200 5/300         library      DGEMM     334     313      600
                                    SGEMM     431     418

HP 715/64              library      DGEMM      35      35      128
                                    SGEMM      71      68

IBM SP2 (thin node)    library      DGEMM     211     212      266
                                    SGEMM     232     234

MEIKO CS2-HA           blocked      DGEMM      49      49      100
                                    SGEMM      88      88

SGI Power 10000        blocked      DGEMM     233     231      388
                                    SGEMM     305     296

SUN Ultra-1 140        blocked      DGEMM      66      64      286
                                    SGEMM     124     122


The performance achieved by the tuned versions of GEMM is relatively far from the peak performance for all the RISC processors except the IBM SP2. On the IBM SP2, it is possible to reach peak performance by changing the leading dimensions of the matrices [Agarwal et al. 1994].

4. BLOCKED IMPLEMENTATION OF TRSM

TRSM solves one of the matrix equations

AX = [Alpha] B, [A.sup.T]X = [Alpha] B, XA = [Alpha] B, or [XA.sup.T] = [Alpha] B

where [Alpha], is a scalar; X and B are m x n matrices; and A is a unit, or nonunit, upper or lower triangular matrix. B is overwritten by X.

We consider the following case (corresponding to the parameters "Left," "No transpose," and "Upper," i.e., we solve for AX = [Alpha] B where A is not transposed, and upper triangular):

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(1) Solution of [A.sub.22][X.sub.21] = [Alpha] [B.sub.21] and [B.sub.21], is overwritten by [X.sub.21] (TRSM)

(2) Solution of [A.sub.22][X.sub.22] = [Alpha] [B.sub.22] and [B.sub.22] is overwritten by [X.sub.22] (TRSM)

(3) [B.sub.11] [left arrow] [Alpha] [B.sub.11] - [A.sub.12][B.sub.21] (GEMM)

(4) [B.sub.12] [left arrow] [Alpha] [B.sub.12] - [A.sub.12][B.sub.22] (GEMM)

(5) Solution of [A.sub.11][X.sub.11] = [B.sub.11] and [B.sub.11], is overwritten by [X.sub.11] (TRSM)

(6) Solution of [A.sub.11][X.sub.12] = [B.sub.12] and [B.sub.12] is overwritten by [X.sub.12] (TRSM)

Therefore, TRSM can be computed as a sequence of triangular solutions (TRSM) and matrix-matrix multiplications (GEMM). The ordering of computational steps is chosen so that each submatrix [A.sub.i,i] on the diagonal of A, involved in each solution step, is kept in the cache for as long as it can be used. As for GEMM, we use two distinct versions of the tuned Fortran code for the solution step: TRSML2X2 when the order of B is even and TRSML otherwise.

As soon as the matrices are large enough compared with the block size, GEMM represents a high percentage of the total number of floating-point operations required by the blocked version of TRSM. For example, when the triangular system is "Left," assuming that NB divides m and n exactly (so that the number of blocks is m/NB x n/NB), the percentage of operations spent in GEMM is equal to 1 - NB/m. Note that the blocked versions of the other kernels behave similarly.

We report, in Table V, the performance achieved on our range of RISC workstations for all variants when A is unit and the system is left (the performance is similar when A is nonunit and when the system is right). The performance gain provided by the blocked implementation of DTRSM compared with the standard Fortran version is close to a factor of 3 and is more impressive than that obtained for DGEMM. In both single and double precision, our blocked code outperforms the vendor code on the DEC 8400, and would be even faster if we used calls to the vendor-supplied GEMM routines from within our blocked code.

5. BLOCKED IMPLEMENTATION OF TRMM

TRMM performs one of the matrix-matrix operations

B = [Alpha] AB, B = [Alpha] [A.sup.T]B

or

B = [Alpha] BA, B = [Alpha] [BA.sup.T]

where [Alpha] is a scalar, B is an m x n matrix, A is a unit, or nonunit, upper or lower triangular matrix.

We consider the following case (corresponding to the parameters "Left," "No transpose," and "Upper," i.e., B = [Alpha] AB where A is upper triangular):

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(1) [B.sub.11] [left arrow] [Alpha] [A.sub.11][B.sub.11] (TRMM)

(2) [B.sub.11] [left arrow] [B.sub.11] + [Alpha] [A.sub.12][B.sub.21 (GEMM)

(3) [B.sub.12] [left arrow] [Alpha] [A.sub.11][B.sub.12] (TRMM)

(4) [B.sub.12] [left arrow] [B.sub.12] + [Alpha] [A.sub.12][B.sub.22] (GEMM)

(5) [B.sub.21] [left arrow] [Alpha] [A.sub.22][B.sub.21] (TRMM)

(6) [B.sub.22] [left arrow] [Alpha] [A.sub.22][B.sub.22] (TRMM)

TRMM is expressed as a sequence of GEMM and TRMM operations. The computations of the submatrices of B within the same block row are independent. The GEMM operations can be combined. We use a tuned Fortran code called TRMML2X2 for performing the multiplication of diagonal blocks of A.

We report in Table V the performance of the blocked version of TRMM in the case where A is unit and when the system is left. Performance is similar when the system is right. Our blocked code can be seen to be usually more than twice as efficient as standard BLAS. Larger blocking on the SGI does not help much on this kernel except in single precision. On the DEC 8400, our blocked code performs consistently better than the vendor code, particularly in single precision. The blocked code would be even faster if we used the vendor-supplied GEMM routines. TRMM is obviously not tuned at all in the SCILIB library that we have access to on the CRAY T3D.

6. BLOCKED IMPLEMENTATION OF SYMM

SYMM performs one of the following matrix-matrix operations

C = [Alpha] AB + [Beta] C

or

C = [Alpha] BA + [Beta] C

where [Alpha] and [Beta] are scalars; A is an m x m symmetric matrix (only the upper or lower triangular parts are used); and B and C are m x n matrices.

We consider the following case (corresponding to the parameters "Left," "Upper," i.e., C = [Alpha] AB + [Beta] C where only the upper part of A is referenced):

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(1) [C.sub.11] [left arrow] [Beta] [C.sub.11] + [Alpha] [A.sub.11][B.sub.11] (SYMM)

(2) [C.sub.12] [left arrow] [Beta] [C.sub.12] + [Alpha] [A.sub.11][B.sub.12] (SYMM)

(3) [C.sub.11] [left arrow] [C.sub.11] + [Alpha] [A.sub.12][B.sub.21] (GEMM)

(4) [C.sub.12] [left arrow] [C.sub.12] + [Alpha] [A.sub.12][B.sub.22] (GEMM)

(5) [C.sub.21] [left arrow] [Beta] [C.sub.21] + [Alpha] [A.sub.22][B.sub.21] (SYMM)

(6) [C.sub.22] [left arrow] [Beta] [C.sub.22] + [Alpha] [A.sub.22][B.sub.22] (SYMM)

(7) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(8) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Therefore, SYMM can be expressed as a sequence of SYMM and GEMM operations. The SYMM operations are used for the matrix-matrix multiplication involving the blocks [A.sub.ii] (only the upper triangular part is stored, since the submatrices are symmetric). A straightforward way of avoiding the multiplication step involving these symmetric submatrices consists in copying the submatrices [A.sub.ii] into a working array AA where both the upper and the lower triangular part are stored as described in Ling [1993]. Therefore, instead of using a SYMM operation for multiplications using the submatrices [A.sub.ii], we can use a GEMM operation involving AA. The additional operations that we make are compensated by the performance gain due to the use of GEMM.

Using this strategy, SYMM is expressed as a sequence of GEMM operations:

(1) Copy [A.sub.11] into AA

(2) [C.sub.11] [left arrow] [Beta] [C.sub.11] + [Alpha] AA.[B.sub.11] (GEMM)

(3) [C.sub.12] [left arrow] [Beta] [C.sub.12] + [Alpha] AA.[B.sub.12] (GEMM)

(4) [C.sub.11 [left arrow] [C.sub.11] + [Alpha] [A.sub.12][B.sub.21] (GEMM)

(5) [C.sub.12] [left arrow] [C.sub.12] + [Alpha] [A.sub.12][B.sub.22] (GEMM)

(6) Copy [A.sub.22] into AA

(7) [C.sub.21] [left arrow] [Beta] [C.sub.21] + [Alpha] AA.[B.sub.21] (GEMM)

(8) [C.sub.22] [left arrow] [Beta] [C.sub.22] + [Alpha] AA.[B.sub.22] (GEMM)

(9) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(10) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The GEMM operations on block rows of C can be combined. This allows us to perform GEMM operations on longer vectors and decreases the overhead due to subroutine calls.

We present in Table VI the performance of the blocked version of SYMM, and we compare it with the performance of the standard Fortran version. We see a big improvement over standard Fortran BLAS when using our blocked version, usually a factor of 2 but occasionally nearly a factor of 8. On the SGI, our blocked code and the manufacturer-supplied version perform similarly. It seems clear that DEC has used a similar device to ours on the DEC 8400, since the performance of their library code for SYMM is close to their GEMM performance. SYMM is obviously not tuned at all in the SCILIB library that we have access to on the CRAY T3D.

7. BLOCKED IMPLEMENTATION OF SYRK

SYRK performs one of the following symmetric rank-k operations

C = [Alpha] [AA.sup.T] + [Beta] C

or

C = [Alpha] [A.sup.T]A + [Beta] C

where [Alpha] and [Beta] are scalars; C is an n x n symmetric matrix (only the upper or lower triangular parts are updated); and A is a n x k matrix in the first case and a k x n matrix in the second case.

We consider the following case (corresponding to "Upper," and "No transpose," i.e., we perform C = [Alpha] [AA.sup.T] + [Beta] C where only the upper triangular part of C is updated):

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(1) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(2) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(3) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(4) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(5) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(6) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The symmetric rank-k update is expressed as a sequence of SYRK for updating the submatrices [C.sub.ii] and GEMM for the other blocks. The updates of the submatrices of C can be performed independently. The GEMM updates of off-diagonal blocks can be combined. Note that we could perform the update of the diagonal blocks of C using GEMM instead of SYRK, at the price of extra operations.

Note that it is more efficient to perform the multiplication of matrix C by [Beta] before calling GEMM rather than performing this multiplication within GEMM. In Table VI the performance of the standard Fortran code and of the blocked implementation are compared for all the variants. For this kernel, our gains over using standard BLAS are significant, usually by a factor of close to 2. On the SGI, the vendor code and the blocked version perform similarly. Using a larger block size on the SGI improves performance by up to 40% in double precision. Our blocked code is substantially better than the vendor kernel on the DEC 8400 and would be even faster if we used the vendor-supplied GEMM. The CRAY SCILIB code is obviously not tuned.

8. BLOCKED IMPLEMENTATION OF SYR2K

SYR2K performs one of the following symmetric rank-2k operations

C = [[Alpha]B.sup.T] + [[Alpha]BA.sup.T] + [Beta]C

or

C = [[Alpha]A.sup.T]B + [[Alpha]B.sup.T]A + [Beta]C

where [Alpha] and [Beta] are scalars; C is an n x n symmetric matrix (only the upper or lower triangular parts are updated); and A and B are n x k matrices in the first case and k x n matrices in the second case.

We consider the following case (corresponding to "Upper," and "No transpose," i.e., C = [[Alpha]AB.sup.T] + [[Alpha]BA.sup.T] + [Beta]C where only the upper triangular part of C is updated):

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(1) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(2) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(3) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(4) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(5) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(6) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(7) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(8) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

SYR2K is expressed as a sequence of SYR2K for updating the triangular submatrices [C.sub.i, i], and GEMM on the other blocks. The update of the submatrices of C can be effected simultaneously. There is no need to compute both [[Alpha]AB.sup.T] and [[Alpha]BA.sup.T] (since it is the same matrix but transposed). Thus, only one of the two operations is performed, and the result is stored into a working array called CC. This can be done using GEMM as described in Ling [1993].

Using these remarks SYR2K is computed in the following way:

(1) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(2) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(3) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(4) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(5) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(6) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(7) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(8) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(9) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(10) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

As for SYRK, with a larger number of blocks, the GEMM updates of the off-diagonal blocks can be combined. We report, in Table VI, the results obtained using our blocked implementation of SYR2K. Our blocked code is substantially better than the vendor kernel on the DEC 8400 and would be even faster if we used the vendor-supplied GEMM. SYR2K is not tuned on the CRAY T3D.

9. USE OF THE MANUFACTURER-SUPPLIED GEMM

In Table VII, we show the effect of using a tuned version of GEMM--the manufacturer-supplied version--within our RISC BLAS. We only show the performance of one of the variants for each Level 3 BLAS kernel. It is compared with the manufacturer-supplied library kernel.

The performance of the RISC BLAS is substantially increased when using the tuned vendor code for GEMM within our blocked version. We achieve better performance than the manufacturer-supplied library on the HP except for SYMM and TRMM. On the IBM SP2, we are still far from the ESSL performance on TRSM, TRMM, and SYRK, while we outperform the vendor code for SYMM and the single-precision SYR2K. On the CRAY T3D, except for TRSM which is tuned in the manufacturer-supplied library, we outperform the vendor codes.

10. CONCLUSION

We have described an efficient and portable implementation of the Level 3 BLAS for RISC processors.

The Level 3 BLAS are expressed as a sequence of matrix-matrix multiplications (GEMM) and operations involving triangular blocks. The combination of blocking, copying, and loop unrolling allows efficient exploitation of the memory hierarchy, and only the blocking parameter and the loop unrolling depth are machine dependent. Both the performance of GEMM and the performance of the kernel dealing with triangular matrices are crucial.

We have shown here that significant Megaflop rates can be achieved, only using tuned Fortran kernels. Although our primary aim is not to outperform the vendor-supplied libraries, our portable implementation compares well with the manufacturer-supplied libraries on the IBM SP2, the HP 715/64, the DEC 8400 5/300, and the SGI Power Challenge. It is interesting that, although the vendor-supplied GEMM routines are better than our blocked version of GEMM on the DEC 8400 5/300 and the CRAY T3D, many of our blocked versions of the other kernels are better than the vendor-supplied equivalents, sometimes by a large margin. We have also demonstrated that the availability of a highly tuned version of the matrix-matrix multiplication kernel GEMM improves the performance figures of our blocked code substantially. For example, when using the manufacturer-supplied version of DGEMM within our blocked version of DTRSM, we achieve a close or marginally better performance than that of the DTRSM kernel available in the vendor-supplied library on the HP 715/64. It is the same for DSYMM on the IBM SP2. We suggest that some vendors could easily increase the performance of their non-GEMM Level 3 BLAS kernels by using the techniques described in this article. Finally, for some machines, performance could be enhanced by judiciously selecting appropriate leading dimensions of the matrices (e.g., avoiding powers of 2), although we do not consider this because it is dependent on the machine architecture and cache management strategy.

We demonstrated in Dayde et al. [1994] how this blocked version could be used to parallelize the Level 3 BLAS. A preliminary version was successfully used for developing both serial and parallel tuned versions of the Level 3 BLAS for a 30-node BBN-TC2000 [Amestoy et al. 1995; Dayde and Duff 1995]. We are currently experimenting on other shared and virtual shared memory machines in order to develop tuned serial and parallel implementations for them.

11. AVAILABILITY OF CODES

The codes described in the present article are available using anonymous FTP at ftp.enseeiht.fr. The software is located in pub/numerique/BLAS/ RISC. A compressed tarfile called blas_risc.tar. Z contains the following codes:

--A set of test routines that check the correct execution and compute the Megaflop rates of the blocked implementation compared with the standard version of the Level 3 BLAS.

--The blocked implementation of the Level 3 BLAS.

We advise the user to check the availability of tuned serial codes (manufacturer-supplied library) before using our blocked implementation.

ACKNOWLEDGMENTS

We are grateful to Nick Hill of the Rutherford Appleton Laboratory for his advice on the DEC 8400, to Andrew Cliffe of AEA Technology, Harwell for performing the runs on the SGI Power Challenge.

We acknowledge the support of CNUSC and IDRIS for providing accesses to the IBM SP2 and the CRAY T3D respectively.

REFERENCES

AGARWAL, R. C., GUSTAVSON, F. G., AND ZUBAIR, M. 1994. Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms. IBM J. Res. Dev. 38, 5 (Sept. 1994), 563-576.

AMESTOY, P. R. AND DAYDE, M. J. 1993. Tuned block implementation of Level 3 BLAS for the CONVEX C220 and RISC processors. Distributed implementation of LU factorization using PVM. Tech. Rep. RT/APO/93/3 Toulouse, France.

AMESTOY, P. R. AND DUFF, I. S. 1989. Vectorization of a multiprocessor multifrontal code. Int. J. Supercomput. Appl. High Perform. Eng. 3, 3, 41-59.

AMESTOY, P. R., DAYOE, M. J., DUFF, I. S., AND MORERE, P. 1995. Linear algebra calculations on a virtual shared memory computer. Int. J. High Speed Comput. 7, 1, 21-43.

ANDERSON, E., BAI, Z., BISCHOF, C., DEMMEL, J., DONGARRA, J., DUCROZ, J., GREENBAUM, A., HAMMARLING, S., MCKENNEY, A., OSTROUCHOV, S., AND SORENSEN, D. 1995. LAPACK Users' Guide. 2nd ed. SIAM, Philadelphia, PA.

BELL, R. 1991. IBM RISC System/6000 NIC tuning guide for Fortran and C. Tech. Rep. GG24-3611-01, IBM International Technical Support Centers.

BERGER, P. AND DAYDE, M.J. 1991. Implementation and use of Level 3 BLAS kernels on a Transputer T800 ring network. Tech. Rep. TR/PA/91/54, CERFACS.

BODIN, F. AND SEZNEC, A. 1994. Cache organization influence on loop blocking. Tech. Rep. 803, IRISA, Rennes, France.

DAYDE, M. J. AND DUFF, I. S. 1989. Use of Level 3 BLAS in LU factorization on the CRAY-2, the ETA 10-P, and the IBM 3090 VF. Int. J. Supercomput. Appl. High Perform. Eng. 3, 2, 40 -70.

DAYDE, M. J. AND DUFF, I. S. 1995. Porting industrial codes and developing sparse linear solvers on parallel computers. Comput. Syst. Eng. 4, 5, 295-305.

DAYDE, M. J. AND DUFF, I. S. 1996. A blocked implementation of Level 3 BLAS for RISC processors. Tech. Rep. RAL-TR-96-014. Rutherford Appleton Lab., Didcot, Oxon, United Kingdom. Also ENSEEIHT-IRIT Tech. Rep. RT/APO/96/1 and CERFACS Rep. TR/PA/96/06.

DAYDE, M. J. AND DUFF, I. S. 1997. A block implementation of Level 3 BLAS for RISC processors, revised version. Tech. Rep. RT/APO/97/2, ENSEEIHT-IRIT.

DAYDE, M. J., DUFF, I. S., AND PETITET, A. 1994. A parallel block implementation of Level-3 BLAS for MIMD vector processors. ACM Trans. Math. Softw. 20, 2 (June 1994), 178-193.

DONGARRA, J. J., DU CROZ, J., HAMMARLING, S., AND DUFF, I. 1990a. Algorithm

679: A set of level 3 basic linear algebra subprograms: Model implementation and test programs. ACM Trans. Math. Softw. 16, 1 (Mar. 1990), 18-28.

DONGARRA, J. J., DU CROZ, J. J., HAMMARLING, S., AND DUFF, I. S. 1990b. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, I (Mar. 1990), 1-17.

DONGARRA, J. J., MANS, P., AND RADICATI BI BROZOLO, G. 1991. LAPACK Working Note 28: The IBM RISC System/6000 and linear algebra operations. Tech. Rep. CS-91-130. Department of Computer Science, University of Tennessee, Knoxville, TN.

GALLIVAN, K., JALBY, W., AND MEIER, U. 1987. The use of BLAS3 in linear algebra on a parallel processor with a hierarchical memory. SIAM J. Sci. Stat. Comput. 8, 6 (Nov. 1, 1987), 1079-1084.

GALLIVAN, K., JALBY, W., MEIER, U., AND SAMEH, A. H. 1988. Impact of hierarchical memory systems on linear algebra algorithm design. Int. J. Supercomput. Appl. High Perform. Eng. 2, 1, 12-48.

HENNESSY, J. L. AND PATTERSON, D.A. 1990. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA.

KAGSTROM, B. AND VAN LOAN, C. 1998. Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues. ACM Trans. Math. Softw. 24, 3, 303-316.

KAGSTROM, B., LING, P., AND VAN LOAN, C. 1998. GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark. ACM Trans. Math. Softw. 24, 3, 268-302.

LING, P. 1993. A set of high-performance Level 3 BLAS structured and tuned for the IBM 3090 VF and implemented in Fortran 77. J. Supercomput. 7, 3 (Sept. 1993), 323-355.

PUGLISI, C. 1993. QR Factorization of large sparse overdetermined and square matrices using the multifrontal method in a multiprocessor environment. Ph.D. Dissertation.

QRICHI ANIBA, A. 1994. Implementation performante du BLAS de niveau 3 pour les processeurs RISC. Tech. Rep. Rapport 3eme Annee. Departement Informatique et Mathematiques Appliquees, ENSEEIHT, Toulouse, France.

SHEIKH, Q. AND LIU, J. 1989. Basic linear algebra subprogram optimization on the CRAY-2 system. CRAY Channels Spring.

Received: December 1996; revised: March 1998, February 1999, and April 1999; accepted: April 1999

Michel J. DAYDE ENSEEIHT-IRIT and IAIN S. DUFF CERFACS and Rutherford Appleton Laboratory

Part of this study was funded by Conseil Regional Midi-Pyrenes under project DAE1/RECH/9308020

Author's addresses: M. J. Dayde, ENSEEIHT-IRIT, 2 rue Camichel, Toulouse Cedex, 31071, France; I. S. Duff, Rutherford Appleton Laboratory, Oxon, 0QX11, England. Permission to make digital/hard copy of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee.
COPYRIGHT 1999 Association for Computing Machinery, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 1999 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:DAYDE, MICHEL J.; DUFF, IAIN S.
Publication:ACM Transactions on Mathematical Software
Article Type:Statistical Data Included
Geographic Code:4EUUK
Date:Sep 1, 1999
Words:9776
Previous Article:Algorithm 796: A Fortran Software Package for the Numerical Inversion of the Laplace Transform Based on a Fourier Series Method.
Next Article:Algorithm 797: Fortran Subroutines for Approximate Solution of Graph Planarization Problems Using GRASP.
Topics:

Terms of use | Privacy policy | Copyright © 2021 Farlex, Inc. | Feedback | For webmasters |