# Blocked Algorithms and Software for Reduction of a Regular Matrix Pair to Generalized Schur Form.

1. INTRODUCTION

In this contribution, we present blocked variants of algorithms used for solving the generalized nonsymmetric eigenvalue problem (GNEP)

Ax = [Lambda] Bx,

where A and B are real n x n matrices. The methods considered start by transforming (A, B) to more condensed forms using orthogonal equivalence transformations:

(1) The reduction of (A, B) to (H, T) form, where H is upper Hessenberg and T is upper triangular.

(2) The reduction of the (H, T) pair to generalized Schur (GS) form (S, T), where S is upper quasi triangular and T is upper triangular.

A quasi triangular matrix is block upper triangular with 1 x 1 and 2 x 2 blocks on the diagonal. Given a matrix pair in GS form we can easily extract the eigenvalues of (A, B). The 2 x 2 blocks on the diagonal of S - [Lambda] T correspond to pairs of complex conjugate eigenvalues. The real eigenvalues are given by ([s.sub.ii], [t.sub.ii]) [is not equal to] (0,0), i.e., the 1 x 1 diagonal blocks of (S, T). The finite eigenvalues are [s.sub.ii]/[t.sub.ii], where [t.sub.ii] [is not equal to] 0. An infinite eigenvalue is represented as ([s.sub.ii], 0), where [s.sub.ii] [is not equal to] 0. If ([s.sub.ii], [t.sub.ii]) [is not equal to] (0,0) for all i, then (A, B) is a regular matrix pair.

The Hessenberg-triangular form is also important in its own right, for example, when solving matrix pencil systems of type (A - sB)x = b for many different values of the scalar s and possibly different right-hand sides [Enright and Serbin 1978].

The standard GNEP method for computing all eigenvalues and eigenvectors of a regular matrix pair (A, B) is based on the QZ algorithm [Moler and Stewart 1973], which is a generalization of the Francis implicit double-shift QR algorithm. As implemented in the LAPACK driver routine DGEGV, it proceeds in five steps [Anderson et al. 1995]. In the first two steps, (A, B) is reduced to upper Hessenberg-triangular form. Step 1 corresponds to reducing B to triangular form using a blocked QR factorization routine (DGEQRF). Step 2 transforms A to upper Hessenberg form, while keeping B upper triangular (DGGHRD). In the third step the GS form of the upper Hessenberg-triangular pair is computed. In the fourth step the eigenvalues of (S, T) in GS form are extracted. Steps 3 and 4 are implemented in the LAPACK routine DHGEQZ. Finally, in the fifth step the eigenvectors are computed (DTGEVC).

In this article, we focus on steps 2 and 3, since they constitute the major operations in terms of execution time.

The elementwise algorithm implemented in DGGHRD for reducing a matrix pair to (H, T) form using Givens rotations is described in Golub and Van Loan [1989]. The blocked variant of DGGHRD presented here is implemented as a two-stage algorithm. Stage 1 reduces the matrix pair to block Hessenberg-triangular form, and in stage 2 the desired Hessenberg-triangular form is computed.

In Dackland and Kagstrom [1995], several different level 3 algorithms that reduce a regular matrix pair to block upper Hessenberg-triangular form are described, implemented, and evaluated. These blocked algorithms permit reuse of data in the higher levels of a memory hierarchy, which make them run closer to peak machine speed.

A parallel version of one of the blocked variants based on Householder reflections and compact WY [Schreiber and Van Loan 1989] representation of the Householder matrices is presented in Dackland and Kagstrom [1998] and Dackland [1998]. A parallel algorithm to reduce just one matrix to block upper Hessenberg from is presented in Berry et al. [1995].

In the second stage of the reduction algorithm all but one of the subdiagonals of the block Hessenberg A-part are set to zero using Givens rotations. The algorithm proceeds in a sequence of supersweeps, each reducing m columns of H. The updates with respect to row and column rotations are organized to reference consecutive columns of H and T. To further improve the data locality, all rotations produced in a supersweep are stored to enable a left-looking reference pattern, i.e., all updates are delayed until they are required for the continuation of the supersweep.

This two-stage approach has earlier been used successfully in the reduction of a symmetric matrix to tridiagonal form [Bischof et al. 1996; Lang 1993; 1998a]. In the two-stage reduction to tridiagonal form the matrix is first reduced to a symmetric banded matrix [A.sup.b] with some semibandwidth b [is greater than] 1, and in the second stage [A.sup.b] is tridiagonalized. The reduction to banded form is a generalization of the standard elementwise algorithm [Golub and Van Loan 1989], and instead of reducing a single column a block column is reduced using QR factorizations. The second stage is mainly performed using level 2 BLAS.

The LAPACK routine DHGEQZ implements a single-/double-shift version of the QZ method [Moler and Stewart 1973]. The QZ iteration is not as well studied as the QR iteration for the standard eigenvalue problem, but the novel ideas applicable to the QR algorithm are often possible to extend to the generalized problem.

The original version of the implicit QR iteration proceeds in a sequence of sweeps along the subdiagonal of the condensed (usually Hessenberg) matrix H. To accelerate the convergence, various shifting techniques are used. The most common choice of shifts are the eigenvalues of the 2 x 2 bottom-right submatrix of H (Wilkinson shift, see Golub and Van Loan [1989]). The implicit shifting strategy creates a 2 x 2 bulge in the condensed matrix, which is chased from one end of the matrix to the other. Typically, the bulge is created at the top of the matrix and moved down the subdiagonal. It is also possible to introduce the bulge at the bottom and chase it upward. Creating bulges at both ends implies chasing in both directions, but we also need a procedure for passing bulges through each other [Watkins 1998].

A generalization of the standard technique is the multishift QR algorithm [Bai and Demmel 1989]. This algorithm computes k [is greater than] 2 simultaneous shifts and forms a bulge of size k x k. Although, increasing k entails more floating-point operations it might improve the performance, since

it enables implementations expressed in terms of matrix-matrix and matrix-vector operations (level 2 and 3 BLAS) instead of vector-vector operations (level 1 BLAS). However, it turns out that the multishift QR algorithm, due to round-off errors, requires an increasing number of iterations before the first eigenvalues deflate. The number of required iterations increases in concert with the number k; therefore, in LAPACK, k is limited to 6 [Anderson et al. 1995]. A multishift QR iteration without computing the shifts explicitly is presented in Dubrulle and Golub [1994].

A similar generalization of the QR algorithm is to apply k [is greater than] 2 shifts as k/2, pipelined, bulges of size 2 x 2 [Dubrulle 1991; Lang 1998a]. This strategy does not suffer from the convergence problem for large k as observed in the multishift algorithm [Dubrulle 1991]. The pipelined approach is well suited for parallel implementation by spacing bulges apart and, after startup of the bulge train, keep multiple processors in work, updating rows and columns. A SIMD implementation targeting a 16K processor MASPAR is discussed in Watkins [1994], and an MIMD implementation for distributed-memory architectures is presented in Henry et al. [1997].

In Lang [1998b], a related technique is described for solving the symmetric standard eigenvalue problem. The idea is to apply [n.sub.b] standard (2 x 2) implicit-shift QR iterations to the condensed tridiagonal matrix, and collect the Givens rotations in a vector before application to the transformation matrix U. Since the standard algorithm is used in each sweep the convergence properties are not affected. The gain in performance is obtained by use of level 3 BLAS in the updates of U. This is effected by first accumulating the rotations into a small matrix Q and then postmultiplying Q to consecutive columns of U.

Our implementation of the QZ iteration moves the bulge a number nb columns down the diagonals of (H, T) in Hessenberg-triangular form and stores the Householder reflections of length 3, before they are applied to (H, T) and the transformation matrices Q and Z. By restricting the updates to the number of columns of (H, T) required to proceed the bulge chasing, we improve the data locality and thereby the overall performance of the QZ iteration.

Before we proceed any further, we outline the rest of the article. Section 2 describes the two-stage reduction of a regular matrix pair (A, B) to Hessenberg-triangular form (H, T) via a block Hessenberg-triangular form ([H.sub.r], T). The algorithms are analyzed in terms of cost in floating-point operations and data reuse. In Section 3, our blocked variant of the single/double-shift QZ method for computing the generalized Schur form of a regular matrix pair in (H, T) form is presented. The blocking is done by restructuring the reference pattern of the updates associated with the bulge chasing in the QZ step. Section 4 presents performance and testing results of the implemented LAPACK-style software of our blocked algorithms and LAPACK driver routines using our blocked implementations. The results show that our new blocked variants outperform the current LAPACK routines by a factor 2-5 for sufficiently large problems. Already, for 128 x 128 matrix pairs we see a factor-of-two performance improvement for our blocked variants of DGGHRD and DHGEQZ, running on the IBM P2SC processor.

2. REDUCTION TO HESSENBERG-TRIANGULAR FORM The reduction of a regular n x n matrix pair (A, B) to Hessenberg-triangular form is performed as two major operations:

--Determine an orthogonal matrix U [element of] [R.sup.n x n] such that [U.sup.T ]B is upper triangular. To preserve the generalized eigenvalues of the matrix pair, U is also applied to A: A [left arrow] [U.sup.T]A. Now A is dense, and B is triangular.

--Determine orthogonal matrices U, V [element of] [R.sup.n x n] such that A [left arrow] [U.sup.T]AV is upper Hessenberg and such that B [left arrow] [U.sup.T]BV is upper triangular.

Blocked reductions of B to upper triangular form by premultiplication by, for example, Householder reflections are well known (e.g., see Bischof and Van Loan [1987], Schreiber and Van Loan [1989], and Anderson et al. [1995]). In the following, we assume that B is already in triangular form and focus on the second operation in the reduction procedure.

2.1 Elementwise Reduction to Hessenberg-Triangular Form The elementwise algorithm for reducing A to upper Hessenberg form while preserving B upper triangular can be expressed in terms of Givens rotations or 2 x 2 Householder reflections [Golub and Van Loan 1989]. Orthogonal transformations applied from the left annihilate elements in A but destroy the structure of B. The triangular form of B is restored by right-hand-side orthogonal transformations. We illustrate one step of the reduction using Givens rotations (A and B are of size 4 x 4):

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],

The rotation [U.sub.34] eliminates [a.sub.41] and when applied to B introduces a nonzero element [b.sub.34]. This fill-in is to set to zero by a right rotation [V.sub.34]. Since [V.sub.34] only operates on columns 3 and 4, no new nonzero elements are introduced in A. To annihilate the remaining elements below the subdiagonal, similar two-sided rotations are used. This elementwise algorithm is implemented in the LAPACK routine DGGHRD.

2.2 Two-Stage Blocked Reduction to Hessenberg-Triangular Form Our blocked variant of DGGHRD is designed as a two-stage algorithm. First A is reduced to upper r-Hessenberg form while preserving the triangularity of B. An upper r-Hessenberg matrix [H.sub.r] has r nonzero subdiagonals. If r = 1, then [H.sub.r] = H is an upper Hessenberg matrix. By an appropriate partitioning, [H.sub.r] can be viewed as an N x N block upper-Hessenberg matrix

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where N = [n/r]; each [H.sub.ij] (i [is less than or equal to] j) is a dense r x r matrix; and the subdiagonal blocks [H.sub.i+1i] are square upper triangular. Notice that the use of the ceiling function in the definition of N only effects its value when r is not a factor of n. In this case, the last block row and block column of [H.sub.r] will have fewer than r rows and columns, respectively.

The second stage is to further reduce the matrix pair ([H.sub.r], T) to the desired Hessenberg-triangular form (H, T), i.e., to annihilate r - 1 subdiagonals of [H.sub.r] while preserving T upper triangular.

2.3 Stage 1: Blocked Reduction to Upper r-Hessenberg-Triangular Form A straightforward way to transform the described elementwise reduction algorithm to a blocked algorithm producing an upper r-Hessenberg matrix is to introduce matrix-matrix operations in the innermost loop, thereby allowing us to efficiently use level 3 BLAS in the updates from the left- and right-hand sides of A and B, and if requested also on U and V. In the annihilation phase of the elementwise algorithm, two consecutive elements in a column are referenced, and the bottom one is set to zero while the other one is updated. In the blocked variant the operations on elements are replaced by operations on r x r blocks, and the lower block is annihilated while the upper is triangularized. This is performed by a QR factorization of a rectangular 2r x r submatrix in the current block column. When the transformations are applied to B, an r x r block instead of a single element will be filled in. This fill-in is annihilated by an RQ factorization of a square block of size 2r x 2r.

The data reference pattern of the reduction of the first block column for A, B [element of] [R.sup.14x14] and a block size r = 3, is shown in Figure 1. Picture 1 illustrates the QR factorization of the black A-block (UR = [A.sub.9:14,1:3]) and the associated left updates ([A.sub.9:14,4:14] [left arrow] [U.sup.T][A.sub.9:14,4:14] and [B.sub.9:14,9:14] [left arrow] [U.sup.T][B.sub.19:14, 9:14]). Picture 2 shows the operations required to restore B to triangular form (RV = [B.sub.9:14,9:14], [A.sub.:, 9:14] [left arrow] [A.sub.:, 9:14V] and [B.sub.1:8,9:14] [left arrow] [B.sub.1:8,9:14V]). Pictures 3 and 4 illustrate in a similar way the computations required to transform [A.sub.6:11,1:3] to upper trapezoidal form and to restore [B.sub.6:11,6:11] to triangular form. Finally, in pictures 5 and 6 we illustrate the last operations performed to reduce the first block column of A. To complete the reduction, similar two-sided operations are applied to the remaining submatrix pair ([A.sub.:, 4:14], [B.sub.:, 4:14]).

[Figure 1 ILLUSTRATION OMITTED]

In the blocked variants discussed so far, the number of reduced A-blocks is limited to two per block iteration. This choice minimizes the fill-in in each block iteration. A possible generalization is to let p [is greater than or equal to] 2 blocks be involved in each step; p - 1 are annihilated, and one is triangularized. This will cause a fill-in in B of size pr x pr in each iteration, which in turn impose more computations for restoring B to triangular form. The upper limit for p is the number of blocks below the diagonal in the current block column of [H.sub.r]. We propose the use of a fixed p [is greater than or equal to] 2 for the complete reduction.

In Figure 2, we present this generalized blocked algorithm, where on input A, B [element of] [R.sup.nxn], B is upper triangular, r is the block size, and p [is greater than or equal to] 2 is the number of r x r blocks to be reduced in each block iteration. On output A is on upper r-Hessenberg form, B is upper triangular, and U, V are the orthogonal transformation matrices. Throughout this article, algorithms are presented using a pseudo-Matlab notation. Notice that in loop constructions the control variable is specified as start:step:end, which differs from Fortran notation (start:end:step).

Fig. 2. Blocked reduction to upper r-Hessenberg-triangular form (stage 1).
function [A, B, U, V] = blkHT (A, B, r, p)
k = ceil(n/r);          # blocks in the first block column.
for j = 1:r:n-r
k = max (k - 1, 2);   # blocks to reduce in current block
column j.
l = ceil((k-1)/(p-1));# steps required for the reduction.
i = n ;
for step = 1:l
nb = min(p*r, i-j-r+1);
Phase 1: Annihilation of p r x r blocks in block column
j of A.
[q, A(i-nb+1:i,j:j+r-1)] = qr(A(i-nb+1:i,j:j+r-1));
A(i-nb+1:i,j+r:n) = q'*A(i-nb+1:i,j+r:n);
B(i-nb+1:i,i-nb+1:n) = q'*B(i-nb+1:i,i-nb+1:n);
U(:,i-nb+1:i) = U(:,i-nb+1:i)*q; U = [I.sub.n] initially.
Phase 2: Restore B - annihilation of fill-in.
[z, B(i-nb+1:i,i-nb+1:i)] = rq(B(i-nb+1:i,i-nb+1:i));
A(1:n,i-nb+1:i) = A(1:n,i-nb+1:i)*z;
B(1:i-nb,i-nb+1:i) = B(1:i-nb,i-nb+1:i)*z;
V(: ,i-nb+1:i) = V(: ,i-nb+1:i)*z; V = [I.sub.n]initially.
i = i - nb + r;       Pointer for next block annihilation.
end
end

The A matrix is reduced by QR factorizations of rectangular pr x r blocks, and B is restored by RQ factorizations of square pr x pr blocks. Since the fill-ins overlap for consecutive iterations, it is possible to apply the RQ factorization to blocks of size (p - 1)r x pr in all iterations except the last one in each block column.

A parallel implementation of the first stage, including scalability analysis based on a hierarchical performance model [Dackland and Kagstrom 1996] and real experiments, is presented in Dackland and Kagstrom [1998] and Dackland [1998].

2.4 Stage 2: Blocked Reduction to Hessenberg-Triangular Form

The second stage is to annihilate the remaining r - 1 subdiagonals of [H.sub.r] to get an upper Hessenberg matrix while keeping the B-part upper triangular. We solve this problem by modifying the elementwise algorithm described in Section 2.1 to handle the reduction of a ([H.sup.r], T) pair to (H, T) form.

An unblocked algorithm for the generalized reduction is presented in Figure 3, where on input A, B [element of] [R.sup.nxn], A is upper r-Hessenberg, and B is upper triangular. On output A is upper-Hessenberg, and B is upper triangular.

Fig. 3. Unblocked reduction to upper Hessenberg-triangular form (stage 2).
function [A,B] = HrT2HT(A, B, r)
[m,n] =size(A)
for k = 1:n-2
for l = min(k+r-1, n-1):-1:k+1
[c, s] =givens (A (1:1+1 ,k))
A(1:1+1, k:n) = row_rot(A(1:1+1, k:n),c,s)
for i = 1:r:n-1
B(i:i+1,i:n) = row_rot(B(i:i+1,i:n),c,s)
[c,s]=givens(B(i+1,i:i+1))
B(1:i+1, i:i+1) = col_rot(B(1:i+1, i:i+1),c,s)
m = min(i+r+1,n)
A(1:m,i:i+1) = col_rot(A(1:m,i:i+1),c,s)
if (i+r+1 <= n)
[c,s]=givens(A(i+r:i+r+1,i))
A(i+r:i+r+1, i:n) = row_rot(1(i+r:i+r+1, i:n),c,s)
end
end
end

Algorithm HrT2HT annihilates [A.sub.l+1, k] using a Givens rotation and applies the rotation to rows l, l + 1 of A. To preserve the eigenvalues of(A, B) the rotation is applied to rows l, l + 1 of B as well. This application introduces a nonzero element [B.sub.l+1, l]. We zero this fill-in by a column rotation (applied from right), which in turn, when applied to A, introduces a new nonzero element [A.sub.l+r+1, l]. The i-loop chases the unwanted nonzero elements down the (r + 1)th subdiagonal of A and the subdiagonal of B. To complete the reduction of column k this procedure is repeated r - 1 times. Similar operations are applied to the remaining columns (k + 1, ..., n - 2) to produce the desired (H, T) form.

The main drawback of the unblocked algorithm is the memory reference pattern. Specifically, for each annihilated element in column k every rth row pair and column pair of A and B are updated, which causes extensive data traffic. In our blocked algorithm we increase the data locality in each sweep (one iteration of the k loop in HrT2HT) as follows:

(1) All r - 1 subdiagonal elements in column k are reduced before the chasing of unwanted nonzero fill-in elements starts.

(2) A supersweep that reduces m columns of A per iteration in the outer k loop is introduced.

(3) The updates of A and B are restricted to r consecutive columns at a time. We store all rotations (in vectors) to enable delayed updates with respect to Givens rotations generated for previous updated columns in the current supersweep.

In the rest of this section, we describe the blocked variant of HrT2HT more in detail.

When reducing the kth column of A, the submatrix pair (A.sub.:, k+1:n], [B.sub.:, k+1:n]) is partitioned in s = [(n - k)/r] column blocks of size n x r (the last one of size n x mod((n - k), r) when r is not a factor of n - k). Each block column pair i is further divided into square r x r upper triangular blocks denoted [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], and rectangular blocks denoted [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]:

(1) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] is of size (i [multiplied by] r + k) x r, and [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII) is ((i - 1) [multiplied by] r + k) x r. Notice that the zero-block is not present in block column s - 1 of [A.sub.:, k+1:n] and in block column s of [B.sub.:,k+1:n]. Moreover, the last block column of [A.sub.:, k+1:n] has neither a zero nor a triangular block, i.e., it consists of [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] only. We also remark that [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] is upper trapezoidal when it has fewer than r rows.

In Figure 4, we illustrate this block partitioning when reducing the first column (k = 1) of a matrix pair of size 12 x 12, where A has r = 4 subdiagonals. The blocks labeled 5 and 9 are [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], and the blocks labeled 4, 8, and 12 correspond to [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], respectively. Similarly, the diagonal blocks 2, 6, and 10 are [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], and finally, the blocks 3, 7, and 11 are [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], respectively.

[Figure 4 ILLUSTRATION OMITTED]

We define [row.sub.i] as the set of all row eliminations (rotations) required to annihilate r - 1 elements of A, and similarly [col.sub.i] is the set of all column rotations needed to annihilate r - 1 elements of B. The [row.sub.1]-set reduces r - 1 subdiagonal elements of the kth column of A, while [row.sub.i] for i [is greater than or equal to] 2 annihilates fill-in introduced in the subdiagonal of [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. The sets [col.sub.i] zero fill-in introduced in the subdiagonals of [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. By [row.sub.1:i] and [col.sub.1:i] we denote the row and column sets 1 to i, respectively. Notice when [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] have r' [is less than] r rows that we annihilate r' - 1 elements only, and the associated rotation sets contain r' - 1 rotations.

A sweep reducing column k of A proceeds as follows:

Reduce:

The set [row.sub.1] is generated (1) and applied to [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (2), and by generating [col.sub.1] the resulting fill-in is annihilated (2). [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] is updated with respect to [col.sub.1] (3). Chase:
for i= 1:s
Apply [col.sub.1] to [MATHEMATICAL  EXPRESSION
NOT REPRODUCIBLE IN ASCII]).                       (4, 8, 12)
Apply [row.sub.1:i] to [MATHEMATICAL  EXPRESSION
NOT REPRODUCIBLE IN ASCII]) (all previous).        (4, 8, 12)
if i [is less than] s
Apply [col.sub.i] to [MATHEMATICAL EXPRESSION
NOT REPRODUCIBLE IN ASCII]).                    (5, 9)
Zero fill-in of [MATHEMATICAL EXPRESSION NOT
REPRODUCIBLE IN ASCII]), i.e., generate
[row.sub.i]+1.                                  (5, 9)
Apply [row.sub.i+1] to [MATHEMATICAL EXPRESSION
NOT REPRODUCIBLE IN ASCII].                     (6, 10)
Zero fill-in of [MATHEMATICAL EXPRESSION NOT
REPRODUCIBLE IN ASCII], i.e., generate
[col.sub.i+1].                                  (6, 10)
Apply [col.sub.i+1] to [MATHEMATICAL EXPRESSION
NOT REPRODUCIBLE IN ASCII]                      (7, 11)
Apply [row.sub.1:I] to [MATHEMATICAL EXPRESSION
NOT REPRODUCIBLE IN ASCII] (all previous).      (7, 11)
end

In case of empty row or column rotation sets, no action is taken in the updates.

The block labels 1-12 in Figure 4 correspond to the order in which the blocks are accessed. The labels are also marked in the algorithm description above to identify operations performed on these blocks. For example, labels 4-7 correspond to operations performed on [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] for i = 1. The corresponding blocks are labeled 8-11 for i = 2, and so on.

In our blocked implementation, we extend the above-described procedure to allow m columns to be reduced and chased in each sweep (called supersweep). To distinguish row-sets belonging to different reduced columns of A, we use a superscript j = 1, ..., m. For example, [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], denote the first row-set of each of the m columns reduced in a supersweep. Column sets belonging to a supersweep are denoted analogously.

In the reduce-part of a supersweep, the sets [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] are generated. When the m columns of A are reduced, we only perform the necessary block column updates of A and B.

The chase-part of the supersweep iteratively advances the sweeps one block column ahead in a pipelined fashion, starting with the leading block.

This procedure is illustrated in Figure 5 for a matrix pair of size 14 x 14, r = 3, and m = 2. Blocks updated with respect to [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] are colored black or gray, and blocks updated with respect to [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] are striped.

[Figure 5 ILLUSTRATION OMITTED]

Pictures 1 and 2 illustrate the reduce part of the supersweep, where [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], are generated and applied to a minimum number of block columns of A and B.

Pictures 3-9 illustrate the reference pattern of the pipelined, block, column-oriented chase-part of the algorithm, where [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], are generated and applied to appropriate block columns of A and B.

The number of updated consecutive columns of A and B in each step of the supersweep increases with m ([is greater than] 1). To find the optimal value of m several parameters must be considered, including the number of nonzero subdiagonals r, the matrix size n, and the memory hierarchy of the target architecture. We have found that m = 2 gives the best performance for the architectures considered in our computational experiments presented in Section 4.

2.5 Cost in Floating-Point Operations and Data Reuse

The number of floating-point operations (flops) required to reduce a matrix pair (A, T) to (H, T) form using an unblocked algorithm based on Givens or 2 x 2 Householder transformations is

8[n.sup.3] - 10[n.sup.2] - 26n + 28

and

32/3 [n.sup.3] - 22[n.sup.2] - 26/3n + 20,

respectively [Golub and Van Loan 1989].

Since the blocked two-stage algorithm presented here introduces more nonzero elements than its unblocked equivalent, the number of flops is also increased.

Algorithm blkHT in Figure 2 has two tuning parameters, the block size r and the number of blocks reduced per block iteration p. Notice that the resulting condensed form is different for different block sizes but is the same for different values of p and a fixed r. The total number of flops performed on A and B in blkHT is

(2) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The cost for accumulation of the transformations in U and V is not included in this expression.

The corresponding cost in flops for an unblocked algorithm (using 2 x 2 Householder reflections) to compute the ([H.sub.r], T) form is

32/3 [n.sup.3] + (- 20r - 2)[n.sup.2] + ([8r.sup.2] - 8r - 26/3)n + 4/3 [r.sup.3] + 10[r.sup.2] + 26/3 r.

See Dackland and Kagstrom [1998] and Dackland [1998] for details about the cost in flops of the different suboperations.

In Table I we show the cost-ratios in flops between blkHT and the elementwise algorithm for r = 32 and p = 2,4,8,16, and 32. The last four columns show the flop ratios for some practical values of n and the asymptotic ratio for n = [infinity].

Table I. Extra flops for the blocked reduction to ([H.sub.r], T) form, block size r = 32
p      Flop Ratio blkHT/Elementwise

2     18[n.sup.2] - 776n - 24064
16[n.sup.2] - 451n - 2541

4    123[n.sup.3] - 9344[n.sup.2] - 75776n + 10878976
96[n.sup.3] - 5778[n.sup.2] + 71346n + 487872

8    417[n.sup.3] - 39136[n.sup.2] - 1526784n + 148701184
224[n.sup.3] - 13482[n.sup.2] + 166474n + 1138368

16   1485[n.sup.3] - 207776[n.sup.2] - 28193792n + 2266169344
480[n.sup.3] -  28890[n.sup.2] + 356730n + 2439360

32   5541[n.sup.3] - 1325344[n.sup.2]- 490374144n + 35360014336
992[n.sup.3] - 59706[n.sup.2] + 737242n + 504184

Matrix Size (n)

p    512   1024   4096   [infinity]

2   1.08   1.11   1.12      1.13

4   1.23   1.26   1.28      1.28

8   1.69   1.79   1.85      1.86

16   2.33   2.78   3.03      3.09

32    --    4.08   5.31      5.59

In addition, the cost for the reduction from ([H.sup.r], T) to (H, T) form, omitting some low-order terms, is

flp(H, T) = 6r - 6/r [n.sup.3] + 6r[n.sup.2] - 9[r.sup.2]n + 5[r.sup.3].

By adding flp([H.sub.r], T) and flp(H, T) we get the total cost for the two-stage blocked algorithm with p = 2 as

108r - 36/6r [n.sup.3] - 133r/6 [n.sup.2] - 17[r.sup.2]/2 n + 62[r.sup.3]/3.

The choice p = 2 means that besides minimizing the fill-in, the number of r x r blocks reduced per iteration in stage 1 is limited to two. The cost ratio in flops between the two-stage blocked reduction algorithm (with p = 2) and the unblocked Hessenberg-based reduction is listed in Table II. Notice that r = 1 corresponds to all operations being done in stage 1. Similarly, r = n corresponds to all operations being done in stage 2, i.e., only Givens rotations are used in the reduction to (H, T) form. The implementation of the first stage of the blocked algorithm uses the compact WY representation of the Householder vectors [Schreiber and Van Loan 1989]. The updates of the matrix pair with respect to the Householder vectors involve an extra triangular matrix multiply in each iteration. The accumulated cost in flops for these operations is 4/3[n.sup.3]- 47/6[n.sup.2] + 85/6n - 23/3, explaining an asymptotic ratio larger than one for r = 1 in Table II. The benefit is that these updates are performed with higher execution rate using level 3 BLAS [Dongarra et al. 1990; Kagstrom et al. 1998a; 1998b]. The asymptotic ratio ([is less than] 1) for r = n in Table II is due to the fact that Givens-rotation-based reductions require fewer flops than Householder-based equivalents.

From Table II we are able to derive the break-even point for r. For [is less than] 0.32n, which typically is the normal case, the blocked version of DGGHRD requires more floating-point operations than its unblocked equivalent. Despite the increase in flops for [is less than] 0.32n, the performance can be improved due to a more efficient utilization of the memory system. An optimal choice of r is dependent on several parameters, including the matrix size n and the memory system of the target architecture. However, results for different values of r presented in Section 4 show that the performance is quite insensitive to the choice of r.

Table II. Extra Flops for the Two-Stage Blocked Reduction to (H, T) Form
r      Flop Ratio Two-Stage Blocked/Elementwise

r   (108r - 36)[n.sup.3] - 133[r.sup.2][n.sup.2]
- 51[r.sup.3]n + 124[r.sup.4]
4r(16[n.sup.3] - 33[n.sup.2] - 13n + 30)

1       72[n.sup.3] - 133[n.sup.2] - 51n + 124
64[n.sup.3] - 132[n.sup.2] - 152n + 120

n                  12[n.sup.3] - 9
16[n.sup.3] - 33[n.sup.2] - 13n + 30

r   [approximately equals]
Ratio

r      (108r - 36)/64r

1          1.13

n          0.75

The unblocked implementation updates all rows and columns of A and B that are affected as soon as the transformations have been produced. For each left-hand-side update, three complete rows of A and B must be loaded into the top-level memories. Since the memory system moves data between levels in the hierarchy in entities of fixed-size blocks of consecutive column elements, there will be loads of unwanted elements. If the matrix is large these unwanted elements will migrate down the hierarchy before they are requested for computation.

In the blocked implementation, the computations are organized to operate on all elements loaded to the top-level memories and if possible reuse cached data. In stage 1, our blocked algorithm perform, mainly, level 3 (matrix-matrix) operations. Conceptually, this means, for example, that O([r.sup.3]) operations are performed on O([r.sup.2]) data, which give us the volume-to-surface effect that enables reuse of cached data. Stage 2 is organized to operate on column blocks. The effect of the reorganization of the updates is a reduction of loads and an increase in reused data.

2.6 Software Implementation

The described two-stage reduction to (H, T) form has been implemented in Fortran 77. The two stages are collected in a single routine with the same functionality as the LAPACK double-precision routine DGGHRD. The parameter list is, however, extended with an extra work array. The extra storage required in the stage I reduction to ([H.sub.r], T) form for a fixed p = 2 (used in the computational experiments discussed later) is n + [r.sup.2] + rn double-precision elements (words). Storing the scalar factors of the elementary reflectors ([Tau]) requires n double-precision elements; the triangular matrix T requires [r.sup.2]; and a workspace of size nr is required in the updates. The stage 2 reduction requires an additional workspace of size 2mn double-precision elements for storing the sines and cosines from m sweeps in a supersweep.

3. REDUCTION TO GENERALIZED SCHUR FORM

In this section, we present a blocked variant of the single-/double-shift QZ method for computing the generalized Schur form of a regular matrix pair in (H, T) form [Moler and Stewart 1973]. The blocking is done by restructuring the reference pattern of the updates associated with the bulge chasing in the QZ step.

3.1 Unblocked Reduction to Generalized Schur Form

An unblocked variant of the QZ method is implemented in the LAPACK double-precision routine DHGEQZ [Anderson et al. 1995]. The algorithm proceeds in a sequence of single or double QZ steps applied to (A, B) in (H, T) form. To accelerate the convergence, the generalized eigenvalues of the bottom right 2 x 2 block of A and B are used as (implicit) shifts. If the eigenvalues are real a single step is executed; otherwise, the double step branch of the algorithm is entered. The double steps are performed using Householder transformations while the single steps are effected using Givens rotations.

When a subdiagonal element of A or a diagonal element of B gets negligible, it is set to zero, and the problem deflates into a smaller subproblem. The monitoring and decoupling associated with deflation surrounds the so-called QZ step, which performs an orthogonal equivalence transformation [Q.sup.T](A, B)Z that includes bulge chasing of unwanted elements along the subdiagonals of A and B [Moler and Stewart 1973]. Figure 6 outlines the unblocked double-shift QZ step as described in Golub and Van Loan [1989], where on input A, B [element of] [R.sup.nxn] are in unreduced (H, T) form, i.e., A is unreduced upper Hessenberg, while B is nonsingular upper triangular.
Fig. 6. Unblocked QZ step with double implicit shift strategy.

function [A, B, Q, Z] = QZ2(A, B)
Let M = [AB.sup.-1] and compute (M - aI)(M - bI)[e.sub.1]
= [(x, y, z, 0, ..., 0).sup.T],
where a and b are the eigenvalues of M's lower
right 2 x 2 submatrix.
for k = 1 : n - 2
Find [Q.sub.k]: [Q.sub.k][[x y z].sup.T] = [[* 0 0].sup.T]
A = diag([I.sub.k-1], [Q.sub.k], [I.sub.n-k-2])A
B = diag([I.sub.k-1], [Q.sub.k], [I.sub.n-k-2])B
Find [Z.sub.k1]: [[b.sub.k+2,k] [b.sub.k+2,k+1]
[b.sub.k+2,k+3]][Z.sub.k1] = [0 0 *]
B = Bdiag([I.sub.k-1], [Z.sub.k1], [I.sub.n-k-2])
Find [Z.sub.k2]: [[b.sub.k+1,k] [b.sub.k+1,k+1]]
[Z.sub.k2] = [0 *]
B = Bdiag([I.sub.k-1], [Z.sub.k2], [I.sub.n-k-2])
x = [a.sub.k+1,k]; y = [a.sub.k+2,k];
if k < n - 2
z = [a.sub.k+3,k]
end
end
Find [Q.sub.n-1]: [Q.sub.n-1][[x y].sup.T] = [[* 0].sup.T]
and update A and B from left.
Find [Z.sub.n-1]: [b.sub.n,n-1] [b.sub.n,n]][Z.sub.n-1] = [0 *]
and update A and B from right.

All [Q.sub.k] and [Z.sub.k] matrices of size 3 x 3 (or 2 x 2) in the outlined algorithm are applied to the three (or two) associated rows and columns, respectively, of A and B as soon as they are produced. This means that when [Q.sub.k] is applied from the left-hand side, both the A and B matrices are referenced with stride n (if the algorithm is implemented in Fortran and if the matrices are allocated with leading dimension equal to matrix size). It is well known that this reference pattern leads to poor performance. However, when [Z.sub.k] is applied from the right-hand side, both A and B are referenced with stride 1 and the performance is improved due to fewer data transfers in the memory hierarchy.

The number of required floating-point operations for the outlined QZ step is 22[n.sup.2]. The additional cost for accumulating the left- and right-hand-side transformations are 8[n.sup.2] and 13[n.sup.2], respectively [Golub and Van Loan 1989].

3.2 A Blocked QZ Iteration

The blocked variant of DHGEQZ described here focuses on the QZ step. The infrastructure for computing the shifts, splitting the problem (deflation for converged eigenvalues) and the computation of the eigenvalues from the produced generalized Schur form are identical to the original LAPACK implementation [Anderson et al. 1995].

The performance gain is obtained by restructuring the reference pattern of the updates associated with the bulge chasing in the QZ step and thereby reducing the data traffic in the memory system. In the blocked variant, we collect nb - 4 Householder vectors of length 3 and delay the application to A and B, and if requested also to Q and Z, as long as possible. A similar blocking is also performed for a single QZ sweep, i.e., nb - 3 Givens rotations are collected before application. Here, we only describe the 2 x 2 case corresponding to a QZ step with double implicit shift strategy.

The procedure is illustrated in Figure 7 for A, B [element of] [R.sup.14x14] and a block size nb = 7. After the construction and application of the transformation associated with the implicit column of (M - aI)(M - bI)[e.sub.1] = [(x, y, z, 0, ..., 0).sup.T], where M = [AB.sup.-1] (see Figure 6), the blocked bulge chasing is entered. The input structure of A and B is depicted in picture 0. First the 2 x 2 bulge in A is moved from column 1 to column 4 while keeping B upper triangular. This is performed by applying a variant of the unblocked bulge-chasing algorithm (as described in Figure 6) to a submatrix pair on the block diagonal of A and B (the gray shaded blocks in pictures 0 and 1). The transformations are stored in two vectors (one for the left transformations and one for the right transformations) and applied to the striped blocks of A and B (in picture 1) to prepare for the next application of the elementwise algorithm. Next the bulge is moved to column 7, and the striped blocks are updated with respect to the transformations produced in previous steps. The antidiagonaly striped (\) blocks denote submatrices updated by the left transformations. Here we have six left transformations of size 3 x 3. Each of the updates of A and B overlap by two rows. Similarly, the diagonally striped (/) blocks denote submatrices updated by the right transformations. The bulge chasing continues in pictures 3 and 4, and the sweep is completed in picture 5, which is performed outside the block-update loop.

[Figure 7 ILLUSTRATION OMITTED]

The standard way is to annihilate the fill-in of B in each step of the chasing. We illustrate the situation in the following example which corresponds to the first chasing within the gray shaded blocks in picture 0 of Figure 7:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where [B.sub.11] and [B.sub.33] are upper triangular, while [B.sub.22] is a full matrix. The three B-elements below the diagonal of [B.sub.22] are set to zero via two right-hand-side Householder transformations (of length 3 and 2, respectively). The result is an upper triangular B-part and a new fill-in appears in the A-part, which in turn is chased along the subdiagonal in the next step. We remark that the fill-ins in B overlap between iterations in the chasing sweep.

This fact is utilized in the current LAPACK implementation, and only the capital B-elements are set to zero. This can be done by computing z as the first column of [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], i.e., [B.sub.22]z = [e.sub.1], which is done by an LU factorization and scaling. Then an orthogonal transformation [Z.sub.2] (a 3 x 3 Householder transformation) is determined such that [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. By applying [Z.sub.2] from the right, [B.sub.22][Z.sub.2] = [[Beta]e.sub.1], i.e., the first column of [B.sub.22][Z.sub.2] is a multiple of the first unit vector. The gain in each step of the sweep is one less application of a 2 x 2 transformation.

This procedure was originally introduced in a generic bulge-chasing algorithm for the generalized eigenvalue problem [Watkins and Elsner 1994]. In a multishift strategy the gain is, of course, increased. It is, however, not clear if this approach gives the same accuracy in all situations as the more costly standard way as outlined in Figure 6. Since we want to be able to compare our blocked chasing scheme with the standard scheme, we keep the approach used in LAPACK for annihilating fill-ins in the B-part. In Figure 7, the subdiagonal B-element which is not annihilated (in the blocked 2 x 2 chasing) is marked with a black box.

Figure 8 shows a high-level description of our blocked version of the chasing in the QZ step, where on input A, B [element of] [R.sup.nxn], A is unreduced upper Hessenberg, B is nonsingular upper triangular, and nb is the block size.
Fig. 8. Blocked updating in the QZ chasing.

function [A, B] = blkQZchase(A, B, nb)
j = 1
done = false
repeat
nnb = min(nb, n-j+1)
je = j+nnb-1
if je == n
done = true
if j > 1
Update(A(2:j+2,j+4:je), ltransf) Apply left transformations
Update(B(2:j+2,j+4:je), ltransf) to A and B.
QZ2chase(A(j:je,j:je), B(j:je,j:je), ltransf, rtransf)
if j > 1
Update(A(1:j-1,j+1:je-1), rtransf) Apply right
transformations
Update(B(1:j-1,j+1:je-1), rtransf) to A and B.
j = j + nb - 4
until done

The routine denoted QZ2chase chases a 2 x 2 bulge from the top to the bottom of the input nnb x nnb diagonal submatrices. The vectors of length 3 associated with the left and right Householder transformations used to perform the chasing within the diagonal blocks of A and B are returned in the arrays ltransf and rtransf, respectively. Before the bulge chasing proceeds, the next nnb - 4 columns of A and B are updated with respect to ltransf by calling the Update routine. The right transformations in rtransf are applied to rows above the current diagonal submatrices after the bulge chasing. The Householder vectors from upcoming chasings within diagonal blocks are appended to the arrays ltransf and rtransf for delayed updates of (A, B), as well as Q and Z (if requested).

Notice that in the implementation we may apply blkQZchase to a submatrix pair of the original (A, B). If so, the left and right transformations are applied to all row and columns required to produce the requested results.

The number of floating-point operations required by the blocked QZ step is the same as for the corresponding unblocked algorithm.

3.3 Software Implementation

The described blocked QZ-chasing algorithm is implemented in Fortran 77. The routine has the same functionality and calling sequence as the LAPACK routine DHGEQZ. However, the work array is increased to 6n double-precision elements in order to hold the left- and right-hand-side transformations. For the application of the collected Householder transformations to A and B, we use the ScaLAPACK auxiliary routine DLAREF [Blackford et al. 1997; Henry et al. 1997].

4. PERFORMANCE AND RELIABILITY

The numerical tests were carried out on the IBM Scalable POWERParallel system (IBM SP) and the SGI Onyx2 system at the High Performance Computing Center North (HPC2N), Umea University, Sweden. The IBM SP system is an MIMD machine with distributed local memory (128Mb) attached to each of the 64 processors (P2SC, 120MHz). The peak machine speed per node is 480 Mflops/s. The SGI Onyx2 system is a shared-memory MIMD machine including 10 processors (R10000, 195MHz). The peak machine speed per processor is 390 Mflops/s.

All routines are implemented in Fortran 77 and compiled using optimization flags -03 -qstrict -qarch = pwr2 and -03 -mips4 for the IBM and SGI, respectively. For the LAPACK routines that are called by all timed routines (blocked and unblocked versions) we use the vendor-supplied libraries. On the IBM SP system, the Power2 ESSL library is used and on the Onyx2 system we use the lscs library.

4.1 Timing Results

The timings were performed using a modified version of the timing software for eigensystems provided by LAPACK [Anderson et al. 1994]. The modifications made extend the program to enable timing of the driver routines DGEGV and DGEGS and adjust the parameter lists to include the workspace required by the blocked routines. Moreover, the original LAPACK timing program times DGGHRD along with the routines that perform the initial reduction of B to upper triangular form, to allow comparison with the EISPACK routine QZHES. Since we are interested in the performance of the blocked version of DGGHRD compared to the unblocked LAPACK version, we exclude the routines implementing the initial reduction (QR factorization, application of transformation to A) from these timings. However, they are included in the timings for the driver routines DGEGV and DGEGS.

The timing program provides four different matrix types, which all have at least one zero, one infinite and one singular generalized eigenvalue, i.e., one pair of the diagonal elements in the generalized Schur form is (0,0). The remaining eigenvalues might be both real and complex, distributed in magnitude as follows:

(1) clustered entries 1, [Epsilon] ..., [Epsilon] with random signs;

(2) evenly spaced entries (difference between adjacent eigenvalues is constant) from 1 down to [Epsilon] with random signs;

(3) geometrically spaced entries (ratio of adjacent eigenvalues is constant) from 1 down to [Epsilon] with random signs;

(4) eigenvalues randomly chosen from the interval ([Epsilon], 1).

Here [Epsilon] is the relative the machine precision. For a comprehensive description of the timing program and matrix types see Anderson et al. [1994].

Since the timing results for the different matrix types are similar, we only present the results for one matrix type (Type 1).

Tables III-X present results for four different blocked routines and compare the execution times to their LAPACK equivalents (DGGHRD, DHGEQZ, DGEGV, and DGEGS). All the functionality of the LAPACK routines are included in the blocked variants. For a more detailed description of the computations performed, etc., see Anderson et al. [1995].

Table III. Performance Results in Seconds for the Routine DGGHRD on a Single R1000 Processor. Leading dimension (LDA) = matrix size n.
DGGHRD          U = N, V = N          U = Y, V = N

n       nb    Old    New   Ratio    Old    New   Ratio

64     8   0.03   0.05    0.60   0.04   0.06    0.67
64    16          0.04    0.75          0.05    0.80
64    32          0.04    0.75          0.04    1.00
64    64          0.03    1.00          0.04    1.00

128     8   0.22   0.30    0.73   0.27   0.35    0.77
128    16          0.25    0.88          0.30    0.90
128    32          0.22    1.00          0.27    1.00
128    64          0.23    0.96          0.26    1.04
128   128          0.21    1.05          0.24    1.13

512    16     13     15    0.87     19     20    0.95
512    32            14    0.93            18    1.06
512    64            13    1.00            17    1.12
512    80            13    1.00            17    1.12
512   128            12    1.08            15    1.27

1024    32    327    138    2.37    384    172    2.23
1024    48           138    2.37           172    2.23
1024    64           138    2.37           171    2.25
1024    80           139    2.35           180    2.13
1024   128           138    2.37           172    2.23

DGGHRD          U = Y, V = Y

n       nb    Old    New   Ratio

64     8   0.05   0.07    0.71
64    16          0.05    1.00
64    32          0.05    1.00
64    64          0.04    1.25

128     8   0.31   0.41    0.76
128    16          0.35    0.89
128    32          0.31    1.00
128    64          0.30    1.03
128   128          0.27    1.15

512    16     24     24    1.00
512    32            22    1.09
512    64            20    1.20
512    80            20    1.20
512   128            19    1.26

1024    32    433    218    1.99
1024    48           218    1.99
1024    64           218    1.99
1024    80           220    1.97
1024   128           218    1.99

Table IV. Performance Results in Seconds for the Routine DGGHRD on a Single P2SC Processor. Leading dimension (LDA) = matrix size n.
DGGHRD              U = N, V = N          U = Y, V = N

n       nb    Old    New   Ratio    Old    New   Ratio

64     8   0.02   0.04    0.50   0.02   0.05    0.40
64    16          0.03    0.67          0.04    0.50
64    32          0.03    0.67          0.03    0.67
64    64          0.02    1.00          0.02    1.00

128    16   0.37   0.24    1.54   0.41   0.27    1.52
128    32          0.21    1.76          0.25    1.64
128    64          0.18    2.06          0.22    1.86
128   128          0.13    2.85          0.17    2.41

512    16     35     15    2.33     38     18    2.11
512    32            15    2.33            18    2.11
512    64            15    2.33            18    2.11
512    80            15    2.33            18    2.11
512   128            17    2.06            19    2.00

1024    16    341    136    2.51    371    166    2.23
1024    32           118    2.89           145    2.56
1024    64           112    3.04           137    2.71
1024    80           113    3.02           137    2.71
1024   128           119    2.87           141    2.63

DGGHRD              U = Y, V = Y

n       nb    Old    New   Ratio

64     8   0.03   0.06    0.50
64    16          0.04    0.75
64    32          0.04    0.75
64    64          0.03    1.00

128    16   0.46   0.33    1.39
128    32          0.30    1.53
128    64          0.25    1.84
128   128          0.20    2.30

512    16     42     22    1.91
512    32            22    1.91
512    64            22    1.91
512    80            22    1.91
512   128            22    1.91

1024    16    400    197    2.03
1024    32           172    2.32
1024    64           162    2.47
1024    80           161    2.48
1024   128           165    2.42

Table V. Performance Results in Seconds for the Routine DHGEQZ on a Single R10000 Processor. Leading dimension (LDA) = matrix size n.
DHGEQZ           (E)                   (S)

n       nb   Old    New    Ratio   Old    New    Ratio

64     8   0.06   0.06    1.00   0.07   0.07    1.00
64    16          0.06    1.00          0.07    1.00
64    32          0.05    1.20          0.06    1.17
64    64          0.06    1.00          0.07    1.00

128     8   0.31   0.25    1.24   0.42   0.32    1.31
128    16          0.23    1.35          0.30    1.40
128    32          0.25    1.24          0.31    1.35
128    64          0.28    1.11          0.35    1.20
128   128          0.31    1.00          0.38    1.11

512    16     14    7.6    1.84     22     10    2.20
512    32           7.3    1.92            10    2.20
512    64           8.2    1.71            11    2.00
512    80           7.8    1.79            10    2.20
512   128           7.8    1.79            11    2.00

1024    16    309     69    4.48    514    103    4.99
1024    32            58    5.33            88    5.84
1024    64            68    4.54            99    5.19
1024    80            64    4.83            94    5.47
1024    80            74    4.18           105    4.90
1024   128            84    3.68           120    4.28

DHGEQZ       (S, Q, Z)

n       nb    Old    New   Ratio

64     8    0.1    0.1    1.00
64    16          0.09    1.11
64    32          0.09    1.11
64    64          0.09    1.11

128     8   0.64   0.50    1.28
128    16          0.45    1.42
128    32          0.46    1.39
128    64          0.50    1.28
128   128          0.53    1.21

512    16     38     22    1.73
512    32            21    1.81
512    64            23    1.65
512    80            21    1.81
512   128            21    1.81

1024    16    646    206    3.14
1024    32           175    3.69
1024    64           197    3.28
1024    80           182    3.55
1024    80           202    3.20
1024   128           222    2.91

Table VI. Performance Results in Seconds for the Routine DHGEQZ on a single P2SC Processor. Leading dimension (LDA) = matrix size n.
DHGEQZ            (E)                   (S)

n    nb    Old    New    Ratio   Old    New    Ratio

64     8    0.04   0.05   0.80    0.05   0.06    0.83
64    16           0.04   1.00           0.05    1.00
64    32           0.05   0.80           0.06    0.83
64    64           0.05   0.80           0.05    1.00

128     8    0.46   0.28   1.64    0.91   0.41    2.22
128    16           0.28   1.64           0.39    2.33
128    32           0.25   1.84           0.38    2.39
128    64           0.27   1.70           0.39    2.33
128   128           0.47   0.98           0.60    1.52

512     8      36    9.9   3.64      57     14    4.07
512    16             11   3.27             15    3.80
512    32             11   3.27             16    3.56
512    64             14   2.57             18    3.17
512   128             17   2.12             21    2.71

1024   16     285     69   4.13     450     99    4.54
1024   32             68   4.19             95    4.73
1024   64             76   3.75            101    4.46
1024   80             87   3.28            118    3.81
1024  128             94   3.03            123    3.66

DHGEQZ         (S, Q, Z)

n    nb    Old    New    Ratio

64     8    0.07   0.09   0.78
64    16           0.09   0.78
64    32           0.08   0.88
64    64           0.08   0.88

128     8    1.2    0.62   1.94
128    16           0.59   2.03
128    32           0.59   2.03
128    64           0.60   2.00
128   128           0.80   1.50

512     8     74      24   3.08
512    16             26   2.85
512    32             25   2.96
512    64             27   2.74
512   128             30   2.47

1024   16    588     168   3.50
1024   32            158   3.72
1024   64            163   3.61
1024   80            187   3.14
1024  128            189   3.11

Table VII. Performance Results in Seconds for the Routine DGEGV on a Single R10000 Processor. Leading dimension (LDA) = matrix size n.
DGEGV             (E)                (Q, Z)

n     nb     Old    Ne    Ratio    Old   New    Ratio

64     8    0.10   0.13   0.77    0.17   0.20   0.85
64    16    0.10   0.10   1.00    0.17   0.16   1.06
64    32    0.10   0.10   1.00    0.17   0.16   1.06
64    64    0.10   0.10   1.00    0.18   0.16   1.12

128     8    0.56   0.58   0.97     1.0   1.0    1.00
128    16    0.57   0.53   1.08     1.1   0.93   1.18
128    32    0.57   0.51   1.12     1.1   0.89   1.24
128    64    0.58   0.55   1.05     1.1   0.91   1.21
128   128    0.58   0.59   0.98     1.1   0.97   1.13

512    16      31     26   1.19      73     57   1.28
512    32      30     23   1.30      71     53   1.34
512    64      29     22   1.32      71     49   1.45
512    80      31     23   1.35      74     49   1.51
512   128      31     23   1.35      75     49   1.53

1024    16     673    200   3.37    1050    374   2.81
1024    32     666    190   3.51     998    375   2.66
1024    64     654    171   3.82     991    350   2.83
1024    80     644    165   3.90     975    334   2.92
1024   128     654    166   3.94     997    331   3.01

Table VIII. Performance Results in Seconds for the Routine DGEGV on a Single P2SC Processor. Leading dimension (LDA) = matrix size n.
DGEGV            (E)              (Q, Z)

n    nb    Old    Ne    Ratio    Old   New    Ratio

64     8   0.08   0.12   0.67    0.12   0.19   0.63
64    16   0.08   0.1    0.80    0.15   0.17   0.88
64    32   0.08   0.09   0.89    0.14   0.15   0.93
64    64   0.08   0.08   1.00    0.15   0.14   1.07

128     8   0.92   0.66   1.39     1.8    1.2   1.50
128    16   0.90   0.58   1.55     1.7    1.1   1.55
128    32   0.93   0.56   1.66     1.8    1.0   1.80
128    64   0.92   0.57   1.61     1.8    1.0   1.80
128   128   0.89   0.69   1.29     1.7    1.2   1.42

512    16     77     28   2.75     126     53   2.38
512    32     74     28   2.64     122     52   2.35
512    64     72     30   2.40     116     54   2.15
512    80     72     32   2.25     119     56   2.12
512   128     71     37   1.92     114     60   1.90

1024    16    609    222   2.74     972    413   2.35
1024    32    621    202   3.07     988    381   2.59
1024    64    637    208   3.06    1020    387   2.64
1024    80    626    209   3.00     997    381   2.62
1024   128    639    228   2.80    1020    403   2.53
1500    16   2150    655   3.28    3170   1220   2.60
1500    80   2150    487   4.41    3170    978   3.24

Table IX. Performance Results in Seconds for the Routine DGEGS on a Single R10000 Processor. Leading dimension (LDA) = matrix size n.
DGEGS            (S)                (Q, Z)

n     nb    Old    New   Ratio    Old   New    Ratio

64     8   0.11   0.14   0.61    0.16   0.18   0.89
64    16   0.11   0.11   1.00    0.15   0.15   1.00
64    32   0.11   0.11   1.00    0.16   0.14   1.14
64    64   0.11   0.11   1.00    0.16   0.14   1.14

128     8   0.67   0.63   1.06    1.0    0.93   1.08
128    16   0.65   0.55   1.18    0.97   0.81   1.20
128    32   0.64   0.55   1.16    0.95   0.79   1.20
128    64   0.68   0.61   1.11    1.0    0.85   1.18
128   128   0.66   0.61   1.08    0.99   0.83   1.19

512    16     38     28   1.36      68     51   1.33
512    32     38     27   1.41      68     49   1.39
512    64     39     25   1.56      71     46   1.54
512    80     38     26   1.46      69     47   1.47
512   128     38     26   1.46      69     46   1.50

1024    16    970    327   2.97    1250    542   2.31
1024    32    919    271   3.39    1180    460   2.57
1024    64    950    261   3.64    1220    439   2.78
1024    80    900    275   3.27    1160    462   2.51
1024   128    909    276   3.29    1170    461   2.54

Table X. Performance Results in Seconds for the Routine DGEGS on a Single P2SC Processor. Leading dimension (LDA) -- matrix size n.
DGEGS           (S)                   (Q, Z)

n     nb    Old    New   Ratio    Old   New    Ratio

64     8   0.08   0.09   0.89    0.11   0.16   0.69
64    16   0.07   0.11   0.64    0.11   0.14   0.79
64    32   0.07   0.08   0.88    0.11   0.13   0.85
64    64   0.08   0.08   1.00    0.11   0.12   0.92

128     8    1.2   0.74   1.62     1.6   1.1    1.45
128    16    1.2   0.66   1.82     1.6   0.98   1.63
128    32    1.3   0.61   2.13     1.7   0.93   1.83
128    64    1.3   0.59   2.20     1.6   0.90   1.78
128   128    1.4   0.79   1.77     1.8   1.1    1.64

512    16     97     32   3.03     120     50   2.40
512    32     96     30   3.20     118     47   2.51
512    64     88     35   2.51     108     52   2.08
512    80     93     37   2.51     114     55   2.07
512   128     94     40   2.35     116     55   2.11

1024    16    776    242   3.21     965    375   2.57
1024    32    805    221   3.64    1000    340   2.94
1024    64    777    223   3.48     966    336   2.88
1024    80    807    239   3.38    1000    361   2.77
1024   128    797    251   3.18     993    366   2.71

All the tables presented here come in pairs, namely III and IV; V and VI; VII and VIII; and IX and X. The first table in a pair shows the results from the SGI Onyx2, and the second table shows the results for the IBM SP. In both cases, only single-processor performance is presented.

Tables III and IV show the results for DGGHRD. Columns 3-5, with heading U = N, V = N, show the timings when only the (H, T) form is computed for the LAPACK and blocked implementations, denoted old and new, respectively, and their ratios. Columns 6-8, with heading U = Y, V = N, show the similar results when also the left-hand-side transformations are accumulated. The last three columns present the timings when both the right- and left-hand-side transformations are accumulated. All results presented correspond to fixed values of p and m (p = 2 in stage 1, m = 2 in stage 2), and the block size r = nb in stage 1 of the reduction. Notice that the LAPACK implementation is unblocked, and therefore only one result per matrix size is displayed. The ratios between the timings for the LAPACK and the blocked implementations are computed with respect to this number for different block sizes.

Tables V and VI show the results for DHGEQZ. Columns 3-5, denoted (E), show the timings when only the eigenvalues are computed for the LAPACK and blocked implementations, respectively. Columns 6-8, denoted (S), show the results when also the generalized Schur form of A and B is computed. The last three columns present the timings when the generalized Schur form as well as the right- and left-hand-side transformations are accumulated. Since the LAPACK implementation is unblocked only one result per matrix size is displayed, and the timing ratios between the unblocked and blocked variants are computed with respect to this number for different block sizes nb.

Tables VII and VIII show the results for DGEGV. Columns 3-5, denoted (E), show the timings when only the generalized eigenvalues are computed for the LAPACK and blocked implementations, respectively. Columns 6-8, with heading (Q, Z), display the results when also the left and right eigenvectors are computed.

Tables IX and X show the results for DGEGS. Columns 3-5, denoted (S), show the timings when the generalized eigenvalues and the real generalized Schur form are computed for the LAPACK and blocked implementations, respectively. Columns 6-8, denoted (Q, Z), display the results when also the left and right generalized Schur vectors are computed.

Since the nonsymmetric generalized eigenvalue LAPACK driver routines DGEGV and DGEGS include the initial blocked reduction of B to upper triangular form we get timings for different block sizes for the original LAPACK routines as well. All routines called by the drivers use the same block size nb.

The results show large improvements in performance of the blocked variants compared to the current unblocked routines implemented in LAPACK. Typically, the performance improvements increase with increasing problem sizes. For example, when computing the transformation matrices involved in the reductions we see for matrices of size 1024 x 1024 up to 250%, 375%, 325%, and 300% improvements for our blocked routines DHGEQZ, DGGHRD, DGEGV, and DGEGS, respectively. Moreover, we see an additional performance improvement if only the reductions of (A, B) are computed. The best performance improvements of the four routines are 223%, 584%, 394%, and 364% (running on an R10000 processor), and 271%, 473%, 441%, and 364% (running on a P2SC processor). As expected, the improvements are not so dramatic for small-sized problems, but we observe similar performance or performance gains even for 64 x 64 matrices for some block sizes in all cases except for one routine (DHGEQZ running on an IBM P2SC processor). Notably, we see over 200% performance improvement already for 128 x 128 matrices for the blocked variants of DGGHRD and DHGEQZ, running on the P2SC processor.

We use the same block size (r = nb) in all routines. The timing results of the reported block sizes differ up to around 25%, but in some cases the performance is quite insensitive to the choice of block size, e.g., see Tables III and IV. This is partly explained by the performance of the two stages in the Hessenberg-triangular reduction. A large block size guarantees high performance of the level 3 BLAS operations in stage 1. On the other hand, the fraction of the second stage is enlarged and performed using lower-level kernels. For a smaller block size we reduce the sizes of the level 3 operations as well as the amount of the slower lower-level operations. Moreover, we use an inner-level blocking in the updates of A, B and the transformation matrices that reduces the impact of the choice of the outer-level block size nb.

In general, the uniprocessor performance results on the SGI Onyx2 system for matrix sizes 128 and 512 are better than the corresponding IBM SP numbers. At a first glance, this may look counterintuitive, since the theoretical peak performance of the P2SC processor is higher. However, these results can be explained by the difference in the memory hierarchy design of the two systems. Each node on the SP system has an on-chip data cache of size 128KB. The Onyx2 system has a smaller on-chip data cache (32KB) and is equipped with a unified second-level off-chip cache of size 4MB. For small matrices (n = 64), the SP system is able to hold the required matrices in the on-chip cache and thereby gain from the higher peak performance. For moderately sized matrices (n = 128-512), the Onyx2 system benefits from the second-level cache. As expected, the SP system show the best performance for large-enough matrices (n [is greater than or equal to] 1024).

4.2 Testing

The blocked variants of the routines DGGHRD and DHGEQZ have been tested using a modified version of the test program for generalized nonsymmetric eigenvalue routines provided by LAPACK. The modifications made concern the extra work arrays required by the blocked algorithms.

The test program generates up to 26 different types of test matrix pairs, where the problem sizes are user-specified. The matrix types include dense and sparse matrices with entries of different magnitude. Except for matrices with random entries, all the matrix pairs include at least one zero, one infinite and one singular generalized eigenvalue.

To verify the correctness of the computations of the generalized Schur form, eigenvectors, and eigenvalues, 12 test ratios are computed. These ratios represent different relative residuals of computed quantities. For a comprehensive description of the testing program, matrix types, and computed ratios see Anderson et al. [1994].

Tests have been performed for all 26 different matrix types for different matrix and block sizes. The results for the unblocked LAPACK and new blocked routines are qualitatively the same.

ACKNOWLEDGMENTS

We would like to thank Zhaojun Bai and Per Ling for valuable discussions about the LAPACK implementation of the QZ method and matrix blocking techniques for memory hierarchies, respectively. We are also grateful to John Reid for several constructive comments. Moreover, we acknowledge the High Performance Computing Center North (HPC2N) for making the IBM SP and SGI Onyx2 systems available for our computational experiments.

REFERENCES

ANDERSON, E., BAI, Z., BISCHOF, C. H., DEMMEL, J., DONGARRA, J. J., Du CROZ, J., GREENBAUM, A., HAMMARLING, S., MCKENNEY, A., OSTROUCHOV, S., AND SORENSEN, D. C. 1995. LAPACK user's guide. SIAM, Philadelphia, PA.

ANDERSON, E., DONGARRA, J., AND OSTROUCHOV, S. 1994. Lapack Working Note 41: Installation guide for LAPACK. Tech. Rep.

BAI, Z. AND DEMMEL, Z. 1989. On a block implementation of Hessenberg multishift QR iteration. Int. J. High Speed Comput. 1, 1 (May 1989), 97-112. Also Lapack Working Note 8.

BERRY, M. W., DONGARRA, J. J., AND KIM, Y. 1995. A parallel algorithm for the reduction of a nonsymmetric matrix to block upper-Hessenberg form. Parallel Comput. 21, 8 (Aug. 1995), 1189-1211.

BISCHOF, C. AND VAN LOAN, C. 1987. The WY representation for products of Householder matrices. SIAM J. Sci. Stat. Comput. 8, 1 (Jan. 2, 1987), 2-13.

BISCHOF, C. H., LANG, B., AND SUN, X. 1996. A framework for symmetric band reduction. Tech. Rep., Preprint BUGHW-SC-96/3. Fachbereich Mathematik, Bergische Universitat, Wuppertal, Germany.

BLACKFORD, L. S., CHOI, J., CLEARLY, A., D'AZEVEDO, E., DEMMEL, J., DHILLON, I., DONGARRA, J., HAMMARLING, S., HENRY, G., PETITET, A., STANLEY, K., WALKER, D., AND WHALEY, R. C. 1997. ScaLAPACK Users' Guide. SIAM, Philadelphia, PA. http://www.netlib.org

DACKLAND, K. 1998. Parallel reduction of a regular matrix pair to block Hessenberg-triangular form--Algorithm design and performance modeling. Tech. Rep. UMINF-98.09. Department of Computing Science, Umea University, Umea, Sweden.

DACKLAND, K. AND KAGSTROM, B. 1995. Reduction of a regular matrix pair (A, B) to block Hessenberg-Triangular form. In Applied Parallel Computing: Computations in Physics, Chemistry and Engineering Science (PARA '95), J. Dongarra, K. Madsen, and J. Wasniewski, Eds. Lecture Notes in Computer Science, vol. 1041. Springer-Verlag, Vienna, Austria, 125-133.

DACKLAND, K. AND KAGSTROM, B. 1996. An hierarchical approach for performance analysis of ScaLAPACK-based routines using the distributed linear algebra machine. In Applied Parallel Computing: Industrial Computation and Optimization (PARA '96), J. Wasniewski, J. Dongarra, and K. Madsen, Eds. Lecture Notes in Computer Science, vol. 1184. Springer-Verlag, Vienna, Austria, 187-195.

DACKLAND, K. AND KAGSTROM, B. 1998. A ScaLAPACK-style algorithm for reducing a regular matrix pair to block Hessenberg-Triangular form. In Applied Parallel Computing: Large-Scale Scientific and Industrial Problems (PARA '98), B. Kagstrom, J. Dongarra, E. Elmroth, and J. Wasniewski, Eds. Lecture Notes in Computer Science, vol. 1541. Springer-Verlag, Vienna, Austria, 95-103.

DONGARRA, J. J., Du CROZ, J., HAMMARLING, S., AND DUFF, I. 1990. A set of level 3 basic linear algebra subprograms: Model implementation and test programs. ACM Trans. Math. Softw. 16, 1 (Mar.), 18-28.

DUBRULLE, A. A. 1991. The multishift QR algorithm: Is it worth the trouble?. Tech. Rep. G320-3558+. Palo Alto Scientific Center, IBM Corp., Palo Alto, CA.

DUBRULLE, A. A. AND GOLUB, G. H. 1994. A multishift QR algorithm without computation of the shifts. Num. Alg. 7, 173-181.

ENRIGHT, W. AND SERBIN, S. 1978. A note on the efficient solution of matrix pencil systems. BIT 18, 276-281.

GOLUB, G. AND VAN LOAN, C. F. 1989. Matrix Computations. 2nd ed. Johns Hopkins University Press, Baltimore, MD.

HENRY, G., WATKINS, D., AND DONGARRA, J. 1997. A parallel implementation of the nonsymmetric QR algorithm for distributed memory architecture. Tech. Rep. CS-97-352. Also Lapack Working Note 121. Department of Computer Science, University of Tennessee, Knoxville, TN.

KAGSTROM, B., LING, P., AND VAN LOAN, C. 1998a. GEMM-based level 3 BLAS: High-performance model implementations and performance evaluation benchmark. ACM Trans. Math. Softw. 24, 3, 268-302.

KAGSTROM, B., LING, P., AND VAN LOAN, C. 1998b. Algorithm 784: GEMM-based level 3 BLAS: Portability and optimization issues. ACM Trans. Math. Softw. 24, 3, 303-316.

LANG, B. 1993. A parallel algorithm for reducing symmetric banded matrices to tridiagonal form. SIAM J. Sci. Comput. 14, 6 (Nov. 1993), 1320-1338.

LANG, B. 1998a. Efficient eigenvalue and singular value computations on shared memory machines. Tech. Rep. BUGHW-SC 98/4. Fachbereich Mathematik, Bergische Universitat, Wuppertal, Germany.

LANG, B. 1998b. Using level 3 BLAS in rotation-based algorithms. SIAM J. Sci. Comput. 19, 2, 626-634.

MOLER, C. B. AND STEWART, G. W. 1973. An algorithm for generalized matrix eigenvalue problems. SIAM J. Numer. Anal. 10, 2, 241-256.

SCHREIBER, R. AND VAN LOAN, C. 1989. A storage-efficient WY representation for products of Householder transformations. SIAM J. Sci. Stat. Comput. 10, 1 (Jan. 1989), 53-57.

WATKINS, D. S. 1994. Shifting strategies for the parallel QR algorithm. SIAM J. Sci. Comput. 15, 4 (July 1994), 953-958.

WATKINS, D. S. 1998. Bulge exchanges in algorithms of QR type. SIAM J. Matrix Anal. Appl. 19, 4, 1074-1096.

WATKINS, D. AND ELSNER, L. 1994. Theory of decomposition and bulge-chasing algorithms for the generalized eigenvalue problem. SIAM J. Matrix Anal. Appl. 15, 3 (July 1994), 943-967.

Received: March 1999; revised: September 1999 and October 1999; accepted: October 1999

This work was partially supported by the Swedish Research Council for Engineering Sciences under grant TFR 98-604.

Authors' address: Department of Computing Science and HPC2N, Umea University, Umea, SE-901 87, Sweden; email dacke@cs.umu.se; bokg@cs.umu.se.

Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee.
COPYRIGHT 1999 Association for Computing Machinery, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.