Roundoff error analysis of the CholeskyQR2 algorithm.

1. Introduction. Let X [member of] [R.sup.mxn] be an m by n matrix with m [greater than or equal to] n of full column rank. We consider the computation of its QR decomposition, X = QR, where Q [member of] [R.sup.mxn] has orthonormal columns and R [member of] [R.sup.nxn] is upper triangular. This is one of the most fundamental matrix decompositions and is used in various scientific computations. Examples include linear least squares, preprocessing for the singular value decomposition of a rectangular matrix , and orthogonalization of vectors arising in block Krylov methods [1, 17] or electronic structure calculations [3, 22]. Frequently in applications, the matrix size is very large, so an algorithm suited for modern high performance computers is desired.

One important feature of modern high performance architectures is that communication is much slower than arithmetic. Here, communication refers to both data transfer between processors or nodes, and data movement between memory hierarchies. Thus, it is essential for higher performance to minimize the frequency and duration of these communications . To minimize interprocessor communications, the algorithm must have a large grain parallelism. To minimize data movement between memory hierarchies, it is effective to reorganize the algorithm to use level 3 BLAS operations as much as possible . Of course, the benefit of using level 3 BLAS operations increases as the size of matrices becomes larger.

Conventionally, three major algorithms have been used to compute the QR decomposition: the Householder QR algorithm, the classical Gram-Schmidt (CGS) algorithm, and the modified Gram-Schmidt (MGS) algorithm. The Householder QR algorithm is widely used due to its excellent numerical stability . MGS, which is less stable, is often preferred when the Q factor is needed explicitly, because it requires only half as much work as the Householder QR in that case. When the matrix A is well conditioned, CGS is also sometimes used since it provides more parallelism. Note that for matrices with 2-norm condition number [[kappa].sub.2](X) at most O([u.sup.-1]), where u is the unit roundoff, repeating CGS or MGS twice leads to algorithms that are as stable as Householder QR . They are known as CGS2 and MGS2, respectively.

For each of these algorithms, variants that can better exploit modern high performance architectures have been developed. There are block versions and recursive versions of Householder QR [6, 18], MGS , and CGS  that can perform most of the computations in the form of level 3 BLAS. There is also a variant of Householder QR called the tall-and-skinny QR (TSQR) , which has large grain parallelism and requires only one reduce and broadcast in a distributed environment.

While these variants have been quite successful, they are not completely satisfactory from the viewpoint of high performance computing. In the block and recursive versions mentioned above, the sizes of matrices appearing in the level 3 BLAS are generally smaller than that of X and become even smaller as the level goes down in the case of recursive algorithms. For the TSQR algorithm, though only one reduce is required throughout the algorithm, the reduction operation is a non-standard one, which corresponds to computing the QR decomposition of a 2nxn matrix formed by concatenating two upper triangular matrices . Thus each reduction step requires O([n.sup.3]) work and this tends to become a bottleneck in parallel environments . In addition, the TSQR algorithm requires non-standard level 3 BLAS operations such as multiplication of two triangular matrices , for which no optimized routines are available on most machines.

There is another algorithm for the QR decomposition, namely the Cholesky QR algorithm. In this algorithm, one first forms the Gram matrix A = [X.sup.T]X, computes its Cholesky factorization A = [R.sup.T]R, and then finds the Q factor by Q = [XR.sup.-1]. This algorithm is ideal from the viewpoint of high performance computing because (1) its computational cost is 2[mn.sup.2] (in the case where m [much greater than] n), which is equivalent to the cost of CGS and MGS and half that of Householder QR, (2) it consists entirely of standard level 3 BLAS operations, (3) the first and third steps are highly parallel large size level 3 BLAS operations in which two of the three matrices are of size m x n, (4) the second step, which is the only sequential part, requires only O([n.sup.3]) work as opposed to the O([mn.sup.2]) work in the first and the third steps, and (5) it requires only one reduce and one broadcast if X is partitioned horizontally. Unfortunately, it is well-known that Cholesky QR is not stable. In fact, deviation from orthogonality of the Q factor computed by Cholesky QR is proportional to [[kappa].sub.2][(X).sup.2] . Accordingly, standard textbooks like  describe the method as "quite unstable and is to be avoided unless we know a priori that R is well conditioned".

In this paper, we show that the Cholesky QR algorithm can be applied to matrices with a large condition number to give a stable QR factorization if it is repeated twice. More specifically, we show that if [[kappa].sub.2](X) is at most O([u.sup.-1/2]), then the Q and R factors obtained by applying Cholesky QR twice satisfy [parallel][Q.sup.T]Q - I[[parallel].sub.F] = O(u) and [parallel]X - QR[[parallel].sub.F] = O(u). Furthermore, we give the coefficients of u in these bounds explicitly as simple low-degree polynomials in m and n. In the following, we call this method CholeskyQR2. Of course, the arithmetic cost of CholeskyQR2 is twice that of Cholesky QR, CGS and MGS, but it is equivalent to the cost of Householder QR, CGS2, and MGS2. Given the advantages stated above, the increase in the computational work might be more than compensated in some cases. Hence, for matrices with [[kappa].sub.2](X) ~ O([u.sup.-1/2]), CholeskyQR2 can be the method of choice in terms of both numerical stability and efficiency on high performance architectures.

Examining the numerical stability of CholeskyQR2 is important because some experimental results suggest it can have very attractive parallel performance. Demmel et al.  report the performance of various QR decomposition algorithms, including Cholesky QR (not iterated twice), TSQR, CGS, and conventional Householder QR, on a Pentium III cluster with up to 64 processors and an IBM BlueGene/L with up to 256 processors. They report that on the former machine, Cholesky QR was more than 6 times faster than TSQR for 1 to 64 processors, while on the latter, Cholesky QR was more than 3 times faster than TSQR for up to 256 processors. Other algorithms were consistently slower than these two. This suggests that CholeskyQR2 would be 1.5 to 3 times faster than TSQR on these machines. In addition, a recent report by the authors  compared the performance of CholeskyQR2, TSQR, and conventional Householder QR on the K computer using up to 16384 nodes. In that experiment, the speedup achieved by CholeskyQR2 over TSQR grew with the number of nodes p. Specifically, CholeskyQR2 was about [member of] times faster than TSQR when p = 1024 and 3 times faster when p = 16384. Detailed performance analysis of both algorithms based on performance models is also given in .

The idea of performing the QR decomposition twice to get better stability is not new. In his textbook , Parlett analyzes Gram-Schmidt orthogonalization of two vectors and introduces the principle of "twice is enough", which he attributes to Kahan. There is also a classical paper by Daniel, Gragg, Kaufman, and Stewart , which deals with the effect of reorthogonalization on the update of the Gram-Schmidt QR decomposition. More recently, Giraud et al. perform a detailed error analysis of CGS2 and MGS2 and show that they give numerically orthogonal Q factor and small residual for matrices with [[kappa].sub.2](X) ~ O([u.sup.-1]) . Stathopoulas et al. experimentally show that the Cholesky QR algorithm can be applied to matrices with a large condition number, if it is applied twice (or more) . Rozloznik et al. analyze the CholeskyQR2 algorithm in a more general setting of orthogonalization under indefinite inner product and derive bounds on both the residual and the deviation from orthogonality . However, their bounds are expressed in terms of the computed Q and R factors along with the matrix B that defines the inner product, and do not constitute a priori error bounds in contrast to the bounds derived in this paper. Also, the coefficients of u are not given explicitly.

Even though the underlying idea of repeating an unstable algorithm twice to improve stability is the same, it is worth noting the inherent disadvantage of CholeskyQR2 when compared with CGS2 and MGS2: numerical breakdown. Specifically, if [[kappa].sub.2](X) [much greater than] O([u.sup.- 1/2]), then the Cholesky factorization of [X.sup.T]X can break down, and so does CholeskyQR2. By contrast, Gram-Schmidt type algorithms are free from such breakdowns (except for very obvious breakdowns due to division by zeros in the normalization) and, as shown in , give stable QR factorizations for a much wider class of matrices [[kappa].sub.2](X) ~ O([u.sup.-1]) when repeated twice.

The rest of this paper is organized as follows. In Section 2, after giving some definitions and assumptions, we introduce the CholeskyQR2 algorithm. A detailed error analysis of CholeskyQR2 is presented in Section 3. Numerical results that support our analysis is provided in Section 4. Section 5 gives some discussion on our results. Finally, some concluding remarks are given in Section 6.

2. The CholeskyQR2 algorithm.

2.1. Notation and assumptions. In the following, we consider computing the QR decomposition of an m by n real matrix X, where m [greater than or equal to] n. Throughout this paper, we assume that computations are performed using IEEE 754 floating point standard and denote the unit roundoff by u. Let [[sigma].sub.i](X) be the ith largest singular value of X and [[kappa].sub.2](X) = [[sigma].sub.1](X)=[[sigma].sub.n](X) be its condition number. We further assume that

(2.1) [delta] [equivalent to] 8[[kappa].sub.2](X)[square root of mnu + n(n + 1)u] [less than or equal to] 1.

This means that the condition number of X is at most O([u.sup.-1/2]). From this assumption and [[kappa].sub.2](X) [greater than or equal to] 1, we also have

(2.2) mnu [less than or equal to] 1/64, n(n + 1)u [less than or equal to] 1/64.

Following , let us define a quantity [[gamma].sub.k] for a positive integer k by

[[gamma].sub.k] = ku/1 - ku.

Then, it is easy to show that under the assumption (2.1)

(2.3) [[gamma].sub.m] = mu/1 - mu [less than or equal to] 1.1mu, [[gamma].sub.n+1] = (n + 1)u/1 - (n + 1)u [less than or equal to] 1.1(n + 1)u.

Throughout the paper, a vector norm is always the 2-norm.

2.2. The algorithm. In the Cholesky QR algorithm, we compute the QR decomposition of X by the following procedure

A = [X.sup.T]X,

R = chol(A),

Y = [XR.sup.-1],

where chol(A) is a function that computes the (upper triangular) Cholesky factor of A. Then, X = Y R can be regarded as the QR decomposition of X.

In the CholeskyQR2 algorithm, after obtaining Y and R by the above procedure, we further compute the following

B = [Y.sup.T]Y,

S = chol(B),

Z = Y [S.sup.1] (= X[(SR).sup.1]),

U = SR.

If the columns of Y are exactly orthonormal, B becomes the identity and Z = Y . However, in finite precision arithmetic, this does not hold in general and Z 6= Y . In the CholeskyQR2 algorithm the QR decomposition of X is given by X = ZU.

3. Error analysis of the CholeskyQR2 algorithm. Our objective is to show that under assumption (2.1), the CholeskyQR2 algorithm delivers an orthogonal factor Z and an upper triangular factor U for which both the orthogonality [parallel][Z.sup.T]Z - I[[parallel].sub.F] and residual [parallel]X - ZU[[parallel].sub.F] =[parallel]X[[parallel].sub.2] are of O(u). Here, the constants in O(u) contain lower order terms in m and n, but not in [[kappa].sub.2](X).

This section is structured as follows. In Section 3.1, we formulate the CholeskyQR2 algorithm in floating point arithmetic and prepare several bounds that are necessary to evaluate the orthogonality of the computed orthogonal factor. Using these bounds, the bound on the orthogonality is derived in Section 3.2. In Section 3.3, several bounds that are needed to evaluate the residual are provided, and they are used in Section 3.4 to give a bound on the residual.

3.1. Preparation for evaluating the orthogonality. Let us denote the matrices A, R and Y computed using floating point arithmetic by [??] = fl([X.sup.T]X), [??] = fl(chol([??])) and [??] = fl(X [[??].sup.-1]), respectively. Taking rounding errors into account, the computed quantities satisfy

(3.1) [??] = [X.sup.T]X + [E.sub.1],

(3.2) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(3.3) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Here, [x.sup.T.sub.i] and [[??].sup.T.sub.i] are the ith row vectors of X and [??], respectively. The forward error of the matrix-matrix multiplication [X.sup.T]X is denoted by [E.sub.1], while [E.sub.2] is the backward error of the Cholesky decomposition of [??]. The matrix [DELTA] [[??].sub.i] denotes the backward error arising from solving the linear simultaneous equation [y.sup.T.sub.i] [??] = [x.sup.T.sub.i] by forward substitution. It would be easier if we could express the backward error of the forward substitution as

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

but we have to use the row-wise expression (3.3) instead, because the backward error [DELTA] [??] depends on the right-hand side vector [x.sup.T.sub.i] .

In the following, we evaluate each of [E.sub.1], [E.sub.2] and [DELTA] [[??].sub.i]. We also give bounds on the 2-norms of [[??].sup.-1] and X [[??].sup.-1] for later use. Furthermore, we derive an alternative form of equation (3.3):

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

in which the backward error enters in the right-hand side vector instead of the coefficient matrix. Equivalently, [DELTA][x.sup.T.sub.i] is the residual of the linear system [y.sup.T.sub.i] R = [x.sup.T.sub.i] . Then, by letting [DELTA]X = [([[increment of x].sub.1], [[increment of x].sub.2], ... , [[increment of x].sub.m).sup.T], we can rewrite (3.3) as

[??] = (X + [DELTA]X) [[??].sup.-1],

which is more convenient to use. We also evaluate the norm of [DELTA]X.

3.1.1. Forward error in the matrix-matrix multiplication [X.sup.T]X. Let A [member of] [R.sup.mxn] and B [member of] [R.sup.nxp]. Then, the componentwise forward error of the matrix-matrix multiplication C = AB can be evaluated as

(3.4) [absolute value of C - [??]] [less than or equal to] [[gamma].sub.n][absolute value of A][absolute value of B],

where [??] = fl(AB), [absolute value of A] denotes the matrix whose (i, j) element is [absolute value of [a.sub.ij]], and the inequality sign means componentwise inequality . The 2-norm of the ith column of X, which we denote by [[??].sub.i], is clearly less than or equal to [parallel]X[[parallel].sub.2]. Hence,

(3.5) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Thus we have

(3.6) [parallel][E.sub.1][[parallel].sub.2] [less than or equal to] [parallel][E.sub.1][[parallel].sub.F] [less than or equal to] [[gamma].sub.m]n[parallel]X[[parallel].sup.2.sub.2] .

Simplifying this result using (2.3) leads to

(3.7) [parallel][E.sub.1][[parallel].sub.2] [less than or equal to] 1.1mnu[parallel]X[[parallel].sup.2.sub.2] .

3.1.2. Backward error of the Cholesky decomposition of [??]. Let A [member of] [R.sup.nxn] be symmetric positive definite and assume that the Cholesky decomposition of A in floating point arithmetic runs to completion and the upper triangular Cholesky factor [??] is obtained. Then, there exists [DELTA]A [member of] [R.sup.nxn] satisfying

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

see Theorem 10.3 of  for details. In our case, we take [??] = A in (3.2) to obtain

(3.8) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Hence,

(3.9) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

On the other hand, we have from equation (3.2),

(3.10)[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Substituting equation (3.10) into the right hand side of equation (3.9) leads to

[parallel][E.sub.2][[parallel].sub.2] [less than or equal to] [[gamma].sub.n+1]n([parallel][??][[parallel].sub.2] + [parallel][E.sub.2][[parallel].sub.2]),

or,

(3.11) [parallel][E.sub.2][[parallel].sub.2] [less than or equal to] [[gamma].sub.n+1]n/1 -[[gamma].sub.n+1]n[parallel][??][[parallel].sub.2].

Noting that

(3.12) [parallel][??][[parallel].sub.2] [less than or equal to] [parallel][X.sup.T]X[[parallel].sub.2] + [parallel][E.sub.1][[parallel].sub.2] [less than or equal to] [parallel]X[[parallel].sup.2.sub.2] + [[gamma].sub.m]n[parallel]X[[parallel].sup.2.sub.2] = (1 + [[gamma].sub.m]n)[parallel]X[[parallel].sup.2.sub.2] ,

from equations (3.1) and (3.6) we have

[parallel][E.sub.2][[parallel].sub.2] [less than or equal to] [[gamma].sub.n+1]n(1 + [[gamma].sub.m]n)/1 - [[gamma].sub.n+1]n [parallel]X[[parallel].sup.2.sub.2] .

This result can be simplified using (2.2) and (2.3) as

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (3.13)

3.1.3. Backward error of the forward substitution. Let U [member of] [R.sup.nxn] be a nonsingular triangular matrix. Then, the solution [??]x obtained by solving the linear simultaneous equations [U.sub.x] = b by substitution in floating point arithmetic satisfies

(3.14) (U + [DELTA]U)[??] = b, [absolute value of [DELTA]U] [less than or equal to] [[gamma].sub.n][absolute value of U],

see Theorem 8.5 of . Note that [DELTA]U depends both on U and b, although the bound in (3.14) does not. In our case, U = [??], so we have for 1 [less than or equal to] i [less than or equal to] m,

(3.15) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

By inserting equation (3.11) into the right-hand side of (3.10), and using (3.12), we have

(3.16) [parallel][??][[parallel].sup.2.sub.2] [less than or equal to] 1/1 - [[gamma].sub.n+1]n[parallel][??][[parallel].sub.2] [less than or equal to] 1 + [[gamma].sub.m]n/1 - [[gamma].sub.n+1]n[parallel]X[[parallel].sup.2.sub.2] .

Inserting this into equation (3.15) leads to

[parallel][DELTA] [[??].sub.i][[parallel].sub.2] [less than or equal to] [[gamma].sub.n][square root of n(1 + [[gamma].sub.m]n)/1 - [[gamma].sub.n+1]n] [parallel]X[[parallel].sub.2].

Simplifying the right-hand side in the same way as in equation (3.13), we obtain

[parallel][DELTA] [[??].sub.i][[parallel].sub.2] [less than or equal to] 1.1 nu[square root of n(1 + 1.1mnu)/1 - 1.1 n(n + 1)u][parallel]X[[parallel].sub.2] [less than or equal to] 1.1 nu[square root of n x (1 + 1.1/64)/1 - 1.1/64] [parallel]X[[parallel].sub.2] [less than or equal to] 1.2 n[square root of nu][parallel]X[[parallel].sub.2]. (3.17)

3.1.4. Bounding the 2-norm of [[??].sup.-1]. Next we evaluate the 2-norm of [[??].sup.-1]. Noting that all the matrices appearing in equation (3.2) are symmetric, we can apply the Bauer-Fike theorem (or Weyl's theorem) to obtain

[([[sigma].sub.n](X)).sup.2] - ([parallel][E.sub.1][[parallel].sub.2] + [parallel][E.sub.2][[parallel].sub.2]) [less than or equal to] [([[sigma].sub.n]([??])).sup.2].

Using assumption (2.1), equations (3.7) and (3.13), we have [parallel][E.sub.1][[parallel].sub.2]+[parallel][E.sub.2][[parallel].sub.2] [less than or equal to] 1.2/64 [([[sigma].sub.n](X)).sup.2] [less than or equal to] (1 - 1/1.12)[([[sigma].sub.n](X)).sup.2]. Hence,

1/1.12 [([[sigma].sub.n](X)).sup.2] [less than or equal to] [([[sigma].sub.n]([??])).sup.2],

leading to the bound on [[??].sup.-1] as

(3.18) [parallel] [[??].sup.-1]][[parallel].sub.2] = ([[sigma].sub.n]([??]))[[sigma].sub.1] [less than or equal to] 1.1[([[sigma].sub.n](X)).sup.-1].

3.1.5. Bounding the 2-norm of X [[??].sup.-1]. From equation (3.2), we have

(3.19) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Thus,

[parallel]X [[??].sup.-1]][[parallel].sup.2.sub.2] [less than or equal to] 1 + [parallel] [[??].sup.- 1]][[parallel].sup.2.sub.2] ([parallel][E.sub.1][[parallel].sub.2] + [parallel][E.sub.2][[parallel].sub.2]).

By using [parallel][E.sub.1][[parallel].sub.2] + [parallel][E.sub.2][[parallel].sub.2] [less than or equal to] 1.2/64 [([[sigma].sub.n](X)).sup.2] again and inserting equation (3.18), we obtain

(3.20) [parallel]X [[??].sup.-1]][[parallel].sub.2] [less than or equal to] 1.1.

3.1.6. Evaluation of the backward error [DELTA]X. From equation (3.3), we have

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Now, let

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Then, since [[??].sub.i] = [[SIGMA].sup.[infinity].sub.k=1][(-[[??].sup.-1] [DELTA][[??].sub.i]).sup.k], we obtain the bound on [parallel] [[??].sub.i]][[parallel].sub.2] as

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (3.21)

where we used equation (3.17) and (3.18) in the last inequality. The denominator of equation (3.21) can be evaluated as

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Inserting this into equation (3.21) and evaluating the numerator using equation (3.17) again, we have

[parallel] [[??].sub.i]][[parallel].sub.2] [less than or equal to] 1/0.95 x 1.1 [[kappa].sub.2](X) x 1.2 n[square root of n]u [less than or equal to] 1.4 [[kappa].sub.2](X)n[square root of n]u.

Now, let

[[increment of x].sup.T.sub.i] = [x.sup.T.sub.i] [[??].sub.i].

Then,

(3.22) [[??].sup.T.sub.i] = ([x.sup.T.sub.i] + ([[increment of x].sup.T.sub.i]) [[??].sup.1].

By defining the matrix [DELTA]X [member of] [R.sup.mxn] as [DELTA]X = [([[increment of x].sub.1], [[increment of x].sub.2], ... , [[increment of x].sub.m]).sup.T], we can rewrite equation (3.22) as

(3.23) [??] = (X + [DELTA]X) [[??].sup.-1].

The bound on [parallel][DELTA]X[[parallel].sub.F] can be given as

(3.24) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where the relationship [square root of [[SIGMA].sup.m.sub.i=1] [parallel][x.sup.T.sub.i] [[parallel].sub.2]] = [parallel]X[[parallel].sub.F] [less than or equal to] [square root of n [parallel]X[[parallel].sub.2]] is used to derive the last inequality.

3.2. Orthogonality of [??] and [??]. Based on the bounds given in the previous subsection, we evaluate the orthogonality of [??] and [??] as computed by the Cholesky QR and CholeskyQR2 algorithms. The following lemma holds.

LEMMA 3.1. Suppose that X [member of] [R.sup.mxn], with m [greater than or equal to] n, satisfies equation (2.1) . Then, the matrix [??] obtained by applying the Cholesky QR algorithm in floating point arithmetic to X satisfies the following inequality, with [delta] defined as in (2.1),

[parallel][[??].sup.T][??] - I[[parallel].sub.2] [less than or equal to] [[delta].sup.2].

Proof. By expanding [[??].sup.T] [??] using equation (3.23), we have

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Here, we used equation (3.19) to derive the last equality. Thus,

(3.25) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

In the fourth inequality, we used equations (3.7), (3.13), (3.18), (3.20), and (3.24). In the last inequality, we simplified the expression using the assumption [delta] [less than or equal to] 1.

The next corollary follows immediately from Lemma 3.1.

COROLLARY 3.2. The condition number of [??] satisfies [[kappa].sub.2]([??]) [less than or equal to] 1.1. Proof. By Lemma 3.1, every eigenvalue [[lambda].sub.i] of [[??].sup.T] [??] satisfies

1 - 5/64 [less than or equal to] [[lambda].sub.i] [less than or equal to] 1 + 5/64.

Hence, every singular value [[lambda].sub.i]([??]) of [??] satisfies

(3.26) [square root of 59/8] [less than or equal to] [[sigma].sub.i]([??]) [less than or equal to] [square root of 69/8].

Thus it follows that

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

In other words, the matrix [??] obtained by applying the Cholesky QR algorithm once is extremely well-conditioned, though its deviation from orthogonality, [parallel] [[??].sup.T] [??] - I [[parallel].sub.2], is still of order 0.1.

Combining Lemma 3.1 and Corollary 3.2, we obtain one of the main results of this paper.

THEOREM 3.3. The matrix [??] obtained by applying CholeskyQR2 in floating point arithmetic to X satisfies the following inequality.

[parallel] [??] [Z.sup.T] [??] - I[[parallel].sub.2] [less than or equal to] 6(mnu + n(n + 1)u).

Proof. Noting that [[kappa].sub.2]([??]) [less than or equal to] [square root of 69/59] , from Corollary 3.2 and applying Lemma 3.1 again to [??] , we have

(3.27) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

3.2.1. Orthogonality error in the Frobenius norm. In the previous sections, we derived the bound on the orthogonality error in terms of the 2-norm, because we wanted to give a bound on the 2-norm condition number of [??]. However, by tracing the derivation of equation (3.25), we can also derive the following bound in the Frobenius norm,

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

As it is clear from equations (3.6) and (3.9), the upper bounds on [parallel][E.sub.1][[parallel].sub.2] and [parallel][E.sub.2][[parallel].sub.2] that were used in equation (3.25) are also bounds on [parallel][E.sub.1][[parallel].sub.F] and [parallel][E.sub.2][[parallel].sub.F]. Thus, the same bound given in equation (3.27) holds for the Frobenius norm as well. We summarize this observation as a corollary as follows.

COROLLARY 3.4. The matrix [??] obtained by applying CholeskyQR2 in floating point arithmetic to X satisfies the following inequality.

(3.28) [parallel][[??].sup.T][??] - I[[parallel].sub.F] [less than or equal to] 6 (mnu + n(n+1)u).

3.3. Preparation for evaluating the residual. Let the matrices B, S, Z, and U, computed by floating point arithmetic, be denoted by [??] = fl([[??].sup.T] [??]), [??] = fl(chol([??])), [??] = fl([??] [[??].sup.1]), and [??] = fl([??] [??]), respectively. Then we have

(3.29) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(3.30) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Here, [[??].sup.T.sub.i] is the ith row vector of [??], [E.sub.3] and [E.sub.5] are the forward errors of the matrix multiplications [[??].sup.T] [??] and [??] [??] , respectively, while [E.sub.4] is the backward error of the Cholesky decomposition of [??]. The matrix [DELTA][[??].sub.i] is the backward error introduced in solving the linear simultaneous equation [z.sup.T].sub.i] [??] = [[??].sup.T.sub.i] by forward substitution.

As a preparation to evaluating the residual, we first give estimates for the norms of [??], [??], [DELTA][[??].sub.i], [E.sub.5] and [??].

3.3.1. Evaluation of [??]. From equation (3.16), we have

(3.31) [parallel][??][[parallel].sub.2]/[parallel]X[[parallel].sub.2] [less than or equal to] [square root of 1 + [[gamma].sub.m]n/1 - [[gamma].sub.n+1]n [less than or equal to] [square root of 1 + mnu/1-mu/1 - n(n+1)u/1-(n+1)u] [less than or equal to] [square root of 1 + 1/64/1- 1/11/1 - 1/64/1- 1/11] = [square root of 651/629] [less than or equal to] 1.1.

3.3.2. Evaluation of [??]. Noticing that [parallel] [??] [[parallel].sub.2] [less than or equal to] [square root of 69/8] from equation (3.26), we can obtain an upper bound on the norm of [??] by multiplying the bound of equation (3.31) by [square root of 69/8] . Thus,

[parallel] [??] [[parallel].sub.2] [less than or equal to] [square root of 651/629] x [square root of 69/8] [less than or equal to] 1.1.

3.3.3. Evaluation of [DELTA][[??].sub.i]. Similarly, multiplying the bound of equation (3.17) by [square root of 69/8] leads to the following bound on [DELTA][[??].sub.i]

[parallel][DELTA][[??].sub.i][[parallel].sub.2] [less than or equal to] [square root of 69/8] [less than or equal to] 1.2 n[square root of n]u [less than or equal to] 1.3 n[square root of n]u.

3.3.4. Evaluation of E5. By using the error bound on matrix multiplication given in equation (3.4), we have

[absolute value of [E.sub.5]] [less than or equal to] [[gamma].sub.n][absolute value of [??]] [absolute value of [??]].

Hence,

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

3.3.5. Evaluation of [??]. From equation (3.26), we have [parallel] [??] [[parallel].sub.F] [less than or equal to] [square root of 69/8] [square root of n]. This is a bound that does not depend on [parallel]X[[parallel].sub.2], so it holds also for [parallel] [??][[parallel].sub.F] . Hence,

[parallel][??][[parallel].sub.F] [less than or equal to] [square root of 69/8][square root of n] [less than or equal to] 1.1[square root of n].

3.4. Bounding the residual. Based on the above results, we evaluate the residual of the pair ([??], [??]). The following theorem holds, which is also one of our main results. THEOREM 3.5. Assume that an m x n real matrix X (m [greater than or equal to] n) satisfies equation (2.1).

Then the matrices [??] and [??] obtained by applying the CholeskyQR2 algorithm in floating point arithmetic to X satisfy the following inequality

(3.32) [parallel] [??] [??] - X[[parallel].sub.F]/[parallel]X[[parallel].sub.2] [less than or equal to] 5[n.sup.2][square root of n]u.

Proof. Expanding [[??].sup.T.sub.i] [??] - [x.sup.T.sub.i] by the equations (3.30), (3.29), and (3.3), leads to

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (3.33)

Hence,

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

4. Numerical results. In this section we evaluate the numerical stability of CholeskyQR2 and compare it with the stability of other popular QR decomposition algorithms, namely, Householder QR, classical and modified Gram-Schmidt (CGS and MGS, we also run them twice, shown as CGS2 and MGS2), and Cholesky QR. To this end, we generate test matrices with a specified condition number by X := U[SIGMA]V [member of] [R.sup.mxn], where U is an m x n random orthogonal matrix, V is an n x n random orthogonal matrix, and

[SIGMA] = diag(1, [[sigma].sup.1/n-1] , ... , [[SIGMA].sup.n-2/n-1] , [sigma]).

Here, 0 < [sigma] < 1 is some constant. Thus [parallel]X[[parallel].sub.2] = 1 and the 2-norm condition number of X is [[kappa].sub.2](X) = 1/[sigma]. We vary [[kappa].sub.2](X), m, and n, and investigate the dependence of the orthogonality and residual on them. All computations were done on Matlab R2012b, using IEEE standard 754 binary64 (double precision) so that u = [2.sup.-53] [approximately equal to] 1.11 x [10.sup.-16], on a computer running Mac OS X version 10.8, equipped with a 2GHz Intel Core i7 Duo processor.

In Figures 4.1 through 4.6 we show the orthogonality and residual measured by the Frobenius norm under various conditions. Figures 4.1 and 4.2 show the orthogonality [parallel] [[??].sup.T] [??] - I[[parallel].sub.F] and residual [parallel] [??] [??] - X[[parallel].sub.F] , respectively, for the case m = 10000, n = 100, and varying [[kappa].sub.2](X). In Figures 4.3 and 4.4, [[kappa].sub.2](X) = [10.sup.5], n = 100, and m varies from 1000 to 10000. In Figures 4.5 and 4.6, [[kappa].sub.2](X) = [10.sup.5], m = 1000, and n varies from 100 to 1000.

It is clear from Figures 4.1 and 4.2 that both the orthogonality and residual are independent of [[kappa].sub.2](X) and are of order O(u), as long as [[kappa].sub.2](X) is at most O([u.sup.-1/2]). This is in good agreement with the theoretical prediction and is in marked contrast to the results of CGS, MGS, and Cholesky QR, for which the deviation from orthogonality increases in proportion to [[kappa].sub.2](X) and [([[kappa].sub.2](X)).sup.2], respectively. As it can be seen from Figures 4.3 through 4.6, the orthogonality and residual increase only mildly with m and n, which is also in agreement with the theoretical results, although they are inevitably overestimates. Compared with Householder QR, it was observed that CholeskyQR2 generally produces smaller orthogonality and residual. From these results, we can conclude that CholeskyQR2 is stable for matrices with condition number at most O([u.sup.1/2]). As is well-known, Gram-Schmidt type algorithms perform well when repeated twice.

5. Discussion. We discuss now four topics related to the stability of CholeskyQR2. First, we compare the orthogonality and residual bounds of CholeskyQR2 given in Theorems 3.4 and 3.5, respectively, with known bounds for Householder QR  and CGS2 . Second, we consider how to examine the applicability of CholeskyQR2 for a given matrix. Third, we show that CholeskyQR2 is not only norm-wise stable, but also column-wise stable. Finally, we discuss row-wise stability of CholeskyQR2, which cannot be proved, but is nearly always observed in practice.

5.1. Comparison with the error bounds of Householder QR and CGS2.

5.1.1. Orthogonality. For Householder QR, the Q factor is computed by applying n Householder transformations to [I.sub.1.m,1.n], an m x n matrix consisting of the first n columns of the identity matrix of order m. Hence, from Lemma 19.3 of , the computed Q factor satisfies

[??] = [P.sup.T]([I.sub.1.m,1.n] + [DELTA]I),

where P is some mxm exactly known orthogonal matrix, and [DELTA]I is an m x n matrix whose column vectors have a norm bounded by n[[gamma].sub.cm], where c is a small positive constant. From this, it is easy to derive the bound

[parallel][??][Q.sup.T] [??] - I[[parallel].sub.F] [less than or equal to] n[square root of n][[gamma].sub.c'm] [equivalent] c'mn[square root of n]u.

For CGS2, Giraud et al. show the following bound for deviation from orthogonality under the assumption that [[kappa].sub.2](X)[m.sup.2][n.sup.3]u = O(1) .

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Although this assumption is hard to satisfy for large matrices (notice that [[kappa].sub.2](X)[m.sup.2] [n.sup.3] is [10.sup.19] for the largest matrix appearing in Figure 4.3), it has been observed that CGS2 produces near-orthogonal matrices in many cases where this condition is violated .

Comparing these bounds with equation (3.28), we observe that the error bound of CholeskyQR2 is smaller by a factor [square root of n]. This is in qualitative agreement with the results of the numerical experiments given in the previous section.

Note, however, that this difference should not be overemphasized, because these are merely upper bounds. In fact, the Givens QR algorithm admits an error bound that is smaller than that of Householder QR by a factor n, but no difference in accuracy has been observed in practice [11, p. 368].

5.1.2. Residual. According to [11, Sec. 19.3], the upper bound on the residual of the Householder QR algorithm can be evaluated as O(mn[square root of n]u). As for CGS2, it is not difficult to derive a bound of the same order using the results given in . Thus, we can say that the CholeskyQR2 algorithm has a smaller bound also in terms of the residual. This is related to the fact that in the CholeskyQR2 algorithm the computation of Y from X, and Z from Y , is done by row-wise forward substitution. Thus, the backward errors introduced there, or their sum of squares, which is one of the main sources of the residual, do not depend on m when [parallel]X[[parallel].sub.2] is fixed. In addition, the forward error in the computation of [??] [??], which is another source of residual, also involves only n. Thus, the residual depends only on n, which is in marked contrast to Householder QR.

A few more comments are in order regarding the bound (3.32). A close examination of equation (3.33) shows that the highest order term in the residual comes from the forward error of the matrix multiplication [??] [??], which we denoted by [E.sub.5]. This implies that if we compute this matrix multiplication using extended precision arithmetic, we can reduce the upper bound on the residual to O([n.sup.2]u) with virtually no increase in the computational cost (when m [much greater than] n). Moreover, in a situation where only the orthogonal factor [??] is needed, as in the orthogonalization of vectors, we can leave the product [??] [??] uncomputed and say that the triplet ([??], [??], [??]) has residual O([n.sup.2]u).

5.2. Applicability of CholeskyQR2 for a given matrix. There are some cases in which the condition number of X is known in advance to be moderate. An example is orthogonalization of vectors in first-principles molecular dynamics . In this application, we are interested in the time evolution of an orthogonal matrix X(t) [member of] [R.sup.mxn], whose column vectors are an orthogonal basis for the space of occupied-state wave functions. To obtain X(t+[DELTA]t), we first compute [??] = X(t) - F(X) [DELTA]t, where F(X) [member of] [R.sup.mxn] is some nonlinear matrix function of X, and then compute X(t + [DELTA]t) by orthogonalizing the columns of [??]. Since X(t) is orthogonal, we can easily evaluate the deviation from orthogonality of [??] by computing the norm of F(X)[DELTA]t. Usually, the time step [DELTA]t is small enough to ensure that [[kappa].sub.2]([??]) [much less than] [u.sup.-1/2] .

In some cases, however, the condition number of X cannot be estimated in advance and one may want to examine the applicability of CholeskyQR2 from intermediate quantities that are computed in the algorithm. This is possible if [??] has been computed without breakdown in the Cholesky decomposition. Given [??], one can estimate its largest and smallest singular values using the power method and inverse power method on [R.sup.T]R, respectively. Indeed the MATLAB condition number estimator condest first computes the LU factorization of the input matrix, then applies a few iterations of power method to obtain a reliable estimate of the 1-norm condition number. This should not cost too much because [??] is triangular and each step of both methods requires only O([n.sup.2]) work. After that, one can evaluate the condition number of X by using the relations (3.1) and (3.2), the bounds (3.7) and (3.13) on [parallel][E.sub.1][[parallel].sub.2] and [parallel][E.sub.2][[parallel].sub.2], respectively, and the Bauer-Fike theorem.

5.3. Column-wise stability of CholeskyQR2. Thus far, we have investigated the normwise residual of CholeskyQR2. Sometimes the columns of X have widely varying norms, and one may wish to obtain the more stringent column-wise backward stability, which requires

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Here, [[??].sub.j] and [[??].sub.j] denote the jth columns of X and [??] , respectively. In this subsection, we prove that CholeskyQR2 is indeed column-wise backward stable.

To see this, we first consider a single Cholesky QR and show that the computed k[[??].sub.j]k is of the same order as k[[??].sub.j]k. Let us recall equations (3.1) through (3.3). From equation (3.5), we have

(5.1) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

By considering the (j, j)th element of equation (3.2) and substituting equations (5.1) and (3.8), we obtain

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Hence,

(5.2) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Now we demonstrate the column-wise backward stability of a single Cholesky QR. Let the jth column of [??] be denoted by [[??].sub.j] . From equation (3.3), we have

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Here, [DELTA][[??].sup.(i).sub.j] is the jth column of [DELTA] [[??].sub.i]. Thus,

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Squaring both sides and summing over i leads to

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

By using [parallel][??] [[parallel].sub.F] = O(1) (see Lemma 3.1) and equation (5.2), we can establish the column-wise backward stability of Cholesky QR as follows

(5.3) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

To apply the above result to CholeskyQR2, we consider the backward errors in the second QR decomposition Y = ZS and the product of the two upper triangular factors U = SR. These backward errors, which we denote by [DELTA][??] and [DELTA][??], respectively, satisfy

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Here, [[??].sub.j] is the j th column of [??] . To evaluate [DELTA][??] , we note the following inequality, which can be obtained in the same way as equation (5.3)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Summing both sides over i and taking the square root gives

(5.4) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

As for [DELTA][??], the standard result on the error analysis of matrix-vector product, combined with [parallel][??][[parallel].sub.F] [equivalent] [parallel] [??] [[parallel].sub.F] = O(1), leads to

(5.5) [parallel][DELTA][??][[parallel].sub.F] [less than or equal to] [[gamma].sub.n][parallel][??][[parallel].sub.F] = [[gamma].sub.n] x O(1).

On the other hand,

(5.6) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

By substituting equations (5.3), (5.4), and (5.5) into equation (5.6), we finally obtain the column-wise backward stability of CholeskyQR2 as follows.

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

5.4. Row-wise stability of CholeskyQR2. In this subsection, we investigate the row-wise stability of CholeskyQR2, which is defined as

(5.7) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Here [x.sup.T.sub.i] and [[??].sup.T.sub.i] denote the ith rows of the matrices.

The requirement (5.7) is strictly more stringent than the normwise stability, and indeed the standard Householder QR factorization does not always achieve (5.7). It is known [11, Ch. 19] that when row sorting and column pivoting are used, Householder QR factorization gives row-wise stability. However, pivoting involves an increased communication cost and is best avoided in high-performance computing.

Having established the normwise and column-wise stability of CholeskyQR2, we now examine its row-wise stability. To gain some insight we first run experiments with a semi-randomly generated matrix X, whose row norms vary widely. Specifically, we generate a random m x n matrix via the MATLAB command X = randn(m,n), then left-multiply a diagonal matrix X := DX, with [D.sub.jj] = [2.sup.j/2] for j = 1, ... ,m. Here we set m = 100 and n = 50, the matrix thus has rows of exponentially growing norms and [[kappa].sub.2](X) [approximately equal to] [u.sup.-1/2]. Figure 5.1 shows the row-wise residuals of three algorithms: standard Householder QR, Householder QR employing row sorting and column pivoting, and CholeskyQR2.

We make several observations from Figure 5.1. First, we confirm the known fact that the standard Householder QR factorization is not row-wise backward stable, but this can be cured by employing row sorting and column pivoting. Second, CholeskyQR2 gives row-wise stability comparable to Householder QR with row sorting and column pivoting, this is perhaps surprising, considering the fact that CholeskyQR2 employs no pivoting or sorting.

To illustrate the situation, we examine the first step of CholeskyQR2. Recall that [[??].sup.T.sub.i] = fl([x.sup.T.sub.i] [[??].sup.-1]). Hence [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] and by standard triangular solve there exist [DELTA][[??].sub.i] for i = 1, ... ,m such that

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Hence for row-wise stability we need [parallel]fl([x.sup.T.sub.i] [[??].sup.-1])[DELTA][R.sub.i][parallel] = O(u) [parallel] [x.sup.T.sub.i] [parallel]. Since

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

a sufficient condition is

(5.8) [parallel]fl([x.sup.T.sub.i][[??].sup.-1]) [parallel] = O([parallel] [x.sup.T.sub.i] [parallel]/ [parallel] [??][parallel]).

Since the general normwise bound for [parallel][y.sup.T][R.sup.-1] [parallel] is [parallel][y.sup.T][R.sup.-1] [parallel] [less than or equal to] [parallel]y[parallel]/ [parallel]R[parallel][[kappa].sub.2](R), the condition (5.8) is significantly more stringent when R is ill-conditioned.

Even so, as illustrated in the example above, in all our experiments with random matrices the condition (5.8) was satisfied with [parallel] [x.sup.T.sub.i] - [[??].sup.T.sub.i][??][parallel]/ [parallel] [x.sup.T.sub.i] [parallel] < nu for all i. We suspect that this is due to the observation known to experts that triangular linear systems are usually solved to much higher accuracy than the theoretical bound suggests [11, 20]. However, as with this classical observation, counterexamples do exist in our case: For example, taking R to be the Kahan matrix , which is an ill-conditioned triangular matrix known to have special properties, the bound [parallel]fl([y.sup.T] [R.sup.-1]) [parallel] = O([parallel][y.sup.T] [parallel]/ [parallel]R[parallel]) is typically tight for a randomly generated [y.sup.T], which means (5.8) is significantly violated. In view of this we form X so that the Cholesky factor R of [X.sup.T]X is the Kahan matrix. This can be done by taking X = QR for an m x n orthogonal matrix Q. To introduce large variation in the row norms of X we construct Q as the orthogonal factor of a matrix as generated in the example above. For every such X with varying size n, (5.8) was still satisfied. Finally, we then appended a row of random elements at the bottom of X, with much smaller norm than the rest, and repeated the experiment. Now the row-wise residual for the last row was significantly larger than O(u[parallel] [x.sup.T.sub.i] [parallel]), indicating row-wise stability does not always hold. Employing pivoting in the Cholesky factorization did not improve the residual.

A referee has suggested more examples for which CholeskyQR2 fails to have row-wise backward stability. One example is the following: take X to be the off-diagonal parts of the 6 x 6 Hilbert matrix and set the (3, 3) element to 5e6.

Experiments suggest nonetheless that cases in which CholeskyQR2 is not row-wise stable are extremely rare.

6. Conclusion. We performed a roundoff error analysis of the CholeskyQR2 algorithm for computing the QR decomposition of an m x n real matrix X, where m [greater than or equal to] n. We showed that if X satisfies equation (2.1), the computed Q and R factors, which we denote by [??] and [??], respectively, satisfy the following error bounds.

(6.1) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The bounds shown here are of a smaller order than the corresponding bounds for the Householder QR algorithm. Furthermore, it was shown that when only the Q factor is required, the right-hand side of equation (6.1) can be reduced to O([n.sup.2]u). Numerical experiments support our theoretical analysis. CholeskyQR2 is also column-wise backward stable, as Householder QR. We also observed that the row-wise stability, which is a more stringent condition than the norm-wise stability shown by equation (6.1), nearly always holds in practice, though it cannot be proved theoretically.

In this paper, we focused on the stability of CholeskyQR2. Performance results of CholeskyQR2 on large scale parallel machines, along with comparison with other QR decomposition algorithms and detailed performance analysis, are given in our recent paper .

When the matrix is nearly square, it might be more efficient to partition the matrix into panels and apply the CholeskyQR2 algorithm to each panel successively. Development of such an algorithm remains as future work.

Acknowledgments. We thank Professor Nicholas J. Higham for suggesting the Kahan matrix for a possible counterexample of the row-wise backward stability of CholeskyQR2. We also thank Professor Masaaki Sugihara for valuable comments on the first version of our manuscript. We are grateful to the referees for their suggestions, especially for another counterexample for the row-wise backward stability, and a comment on columnwise stability.

REFERENCES

 Z. BAI, J. DEMMEL, J. DONGARRA, A. RUHE, AND H. VAN DER VORST, eds., Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide, SIAM, Philadelphia, 2000.

 G. BALLARD, J. DEMMEL, O. HOLTZ, AND O. SCHWARTZ, Minimizing communication in numerical linear algebra, SIAM J. Matrix Anal. Appl., 32 (2011), pp. 866-901.

 R. CAR AND M. PARRINELLO, Unified approach for molecular dynamics and density-functional theory, Phys. Rev. Lett., 55 (1985), pp. 2471-2474.

 J. W. DANIEL, W. B. GRAGG, L. KAUFMAN, AND G. W. STEWART, Reorthogonalization and stable algorithms for updating the Gram-Schmidt QR factorization, Math. Comp., 30 (1976), pp. 772-795.

 J. DEMMEL, L. GRIGORI, M. HOEMMEN, AND J. LANGOU, Communication-optimal parallel and sequential QR and LU factorizations, SIAM J. Sci. Comput, 34 (2012), pp. 206-239.

 E. ELMROTH AND F. G. GUSTAVSON, Applying recursion to serial and parallel QR factorization leads to better performance, IBM J. Res. Develop., 44 (2000), pp. 605-624.

 T. FUKAYA, T. IMAMURA, AND Y. YAMAMOTO, Performance analysis of the Householder-type parallel tall-skinny QR factorizations toward automatic algorithm selection, in High Performance Computing for Computational Science-VECPAR 2014, M. Dayde, O. Marques, and K. Nakajima, eds., Lecture Notes in Computer Science, 8969, Springer, Cham, 2015, pp. 269-283.

 T. FUKAYA, Y. NAKATSUKASA, Y. YANAGISAWA, AND Y. YAMAMOTO, CholeskyQR2: a simple and communication-avoiding algorithm for computing a tall-skinny QR factorization on a large-scale parallel system, in Proceedings of the 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, IEEE Computer Society Press, Los Alamitos, 2014, pp. 31-38.

 L. GIRAUD, J. LANGOU, M. ROZLOzNIK, AND J. VAN DEN ESHOF, Rounding error analysis of the classical Gram-Schmidt orthogonalization process, Numer. Math., 101 (2005), pp. 87-100.

 G. H. GOLUB AND C. F. VAN LOAN, Matrix Computations, 4th ed., Johns Hopkins University Press, Baltimore, 2012.

 N. J. HIGHAM, Accuracy and Stability of Numerical Algorithms, 2nd ed., SIAM, Philadelphia, 2002.

 J. IWATA, D. TAKAHASHI, A. OSHIYAMA, T. BOKU, K. SHIRAISHI, S. OKADA, AND K. YABANA, A massively-parallel electronic-structure calculations based on real-space density functional theory, J. Comput. Phys., 229 (2010), pp. 2339-2363.

 W. JALBY AND B. PHILIPPE, Stability analysis and improvement of the block Gram-Schmidt algorithm, SIAM J. Sci. Statist. Comput., 12 (1991), pp. 1058-1073.

 S. J. LEON, A. BJORCK, AND W. GANDER, Gram-Schmidt orthogonalization: 100 years and more, Numer. Linear Algebra Appl., 20 (2013), pp. 492-532.

 B. N. PARLETT, The Symmetric Eigenvalue Problem, SIAM, Philadelphia, 1998.

 M. ROZLOZNIK, F. OKULICKA-DLU ZEWSKA, AND A. SMOKTUNOWICZ, Cholesky-like factorization of symmetric indefnite matrices and orthogonalization with respect to bilinear forms, Preprint NCMM/2013/31, Necas Center for Mathematical Modeling, Prague, 2013.

 Y. SAAD, Numerical Methods for Large Eigenvalue Problems, Manchester University Press, Manchester, 1992.

 R. SCHREIBER AND C. F. VAN LOAN, A storage-efficient WY representation for products of Householder transformations, SIAM J. Sci. Statist. Comp., 10 (1989), pp. 53-57.

 A. STATHOPOULOS AND K. WU, A block orthogonalization procedure with constant synchronization requirements, SIAM J. Sci. Comput., 23 (2002), pp. 2165-2182.

 G. W. STEWART: Introduction to Matrix Computations, Academic Press, New York, 1973.

 , Matrix Algorithms, Vol. 1: Basic Decompositions, SIAM, Philadelphia, 1998.

 S. TOLEDO AND E. RABANI, Very large electronic structure calculations using an out-of-core filterdiagonalization method, J. Comput. Phys., 180 (2002), pp. 256-269.

YUSAKU YAMAMOTO ([dagger]), YUJI NAKATSUKASA ([double dagger]), YUKA YANAGISAWA ([section]), AND TAKESHI FUKAYA ([paragraph])

* Received July 18, 2014. Accepted April 10, 2015. Published online on June 14, 2015. Recommended by F. Dopico. The research of Y. Yamamoto was supported by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Scientific Research (B)(No. 26286087), Core Research for Evolutional Science and Technology (CREST) Program "Highly Productive, High Performance Application Frameworks for Post Petascale Computing" of Japan Science and Technology Agency (JST). The research of Y. Nakatsukasa was supported by JSPS Scientific Research Grant No. 26870149.

([dagger]) The University of Electro-Communications, Tokyo, Japan / JST CREST, Tokyo, Japan (yusaku.yamamoto@uec.ac.jp).

([double dagger]) University of Tokyo, Tokyo 113-8656, Japan (nakatsukasa@mist.i.u-tokyo.ac.jp).

([section]) Waseda University, Tokyo, Japan (yuuka@ruri.waseda.jp).

([paragraph]) RIKEN Advanced Institute for Computational Science, Kobe, Japan / Hokkaido University, Hokkaido, Japan / JST CREST, Tokyo, Japan (fukaya@iic.hokudai.ac.jp).
COPYRIGHT 2015 Institute of Computational Mathematics
No portion of this article can be reproduced without the express written permission from the copyright holder.