# Time and memory requirements of the nonequispaced FFT.

AbstractWe consider the fast Fourier transform at nonequispaced nodes (NFFT), and give detailed information on the time and memory requirements of its building blocks. This manuscript reviews the state of the art approaches and focuses within the most successful scheme on the most computationally involved part. Beside a rigorous derivation of a lookup table technique, we compare a wide range of precomputation schemes which lead to substantially different computation times of the NFFT. In particular, we show how to balance accuracy, memory usage, and computation time.

Key words and phrases. Nonequispaced Fast Fourier Transform, FFT

1 Introduction

This paper summarises algorithms for the discrete and fast Fourier transform at nonequispaced nodes. Generalising the famous fast Fourier transform (FFT), we evaluate for N [member of] 2N, M [member of] N, a vector of coefficients f = [([[??].sub.k]).sub.k = -N/2, ..., N/2 - 1] [member of] [C.sup.N], and a set of nodes [x.sub.j] [member of] R the sums

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

In the multivariate setting the discrete Fourier transform at nonequispaced nodes (NDFT) requires O([N.sub.[pi]]M) operations for [N.sub.[pi]] equispaced frequencies and M nonequispaced sampling nodes [x.sub.j] [member of] [R.sup.d]. In contrast, the approximate NFFT takes only O([N.sub.[pi]] log [N.sub.[pi]]+M) floating point operations (flops), where the constant of proportionality depends in theory solely on the prescribed target accuracy and on the space dimension d.

There is a variety of important applications which utilise the NDFT, e.g., in computerised tomography [18, 32, 34, 29, 28], for fast summation algorithms [35, 36, 23], for moving least squares approximation [12], as fast Fourier transform on the sphere [25], or as part of the ridgelet and curvelet transforms [27, 6]. Furthermore, the reconstruction from nonuniform samples is stated in [20, 13, 1, 26] as inversion of the NDFT and used, for example, in magnetic resonance imaging [15, 39, 10] or as polar FFT [2, 14]. In each of these applications the actual computation of the NDFT is the computationally dominant task and one has to deal with different requirements on the NFFT with respect to the target accuracy, the usage of memory, and the actual computation time. An early review of several algorithms for the NFFT is given in [41]. Only later, a unified approach to fast algorithms for the present task was obtained in [38, 37] and recently a particular property of the Gaussian window function was utilised in [19] to speed up computations when no precomputation is possible.

Despite the fact that there exist a few tutorial papers for the NFFT (see also [21]), the aim of this manuscript is to give detailed information on the accuracy, memory, and time requirements of NFFT algorithms and to describe how to balance these factors. In particular, the O(M)-step of the NFFT hides a significant constant and we focus on a variety of precomputation schemes which lead to substantially different computation times of the whole NFFT.

The outline of the paper is as follows. In Section 2 minor improvements in the direct calculation of the NDFT are discussed; cf. [3]. Furthermore the unified approach to the NFFT is reviewed and alternative NFFTs are discussed. In Section 3 we compare different methods for the fast evaluation and precomputation of the window functions utilised in the most popular NFFT-approach. Finally, we compare the different NFFTs numerically in Section 4 and draw conclusions. All used algorithms are available in our widely used software [24].

2 Notation, the NDFT and the NFFT

This section summarises the mathematical theory and ideas behind the NFFT. For d, M [member of] N let the torus [T.sup.d] := [R.sup.d]/[Z.sup.d] ~ [[-1/2, 1/2).sup.d] and the sampling set x := {[x.sub.j] [member of] [T.sup.d] : j = 0, ..., M - 1} be given. Furthermore, let the multi degree N = [([N.sub.0], [N.sub.1], ..., [N.sub.d - 1]}).sup.T] [member of] 2[N.sup.d] and the index set for possible frequencies [I.sub.N] := {-[N.sub.0]/2, [N.sub.0]/2 - 1} x ... x {-[N.sub.d-1]/2, [N.sub.d-1]/2 - 1} be given. We define the space of d-variate trigonometric polynomials [T.sub.N] of multi degree N by

[T.sub.N] := span {[e.sup.-2[pi]ikx] : k [member of] [I.sub.N]}

The dimension of this space and hence the total number of Fourier coefficients is [N.sub.[pi]] = [N.sub.0] ... [N.sub.d-1]. Note, that we abbreviate the inner product between the frequency k and the time/spatial node x by kx = [k.sup.T] x = [k.sub.0][x.sub.0] + [k.sub.1][x.sub.1] + ... + [k.sub.d-1][x.sub.d-1]. For clarity of presentation the multi index k addresses elements of vectors and matrices as well.

2.1 NDFT

For a finite number of given Fourier coefficients [[??].sub.k] [member of] C, k [member of [I.sub.N], one wants to evaluate the trigonometric polynomial

f(x) := [summation over [k [member of] [I.sub.N]] [[??].sub.k][e.sup.-2[pi]ikx] (2.1)

at given nonequispaced nodes [x.sub.j] [member of] [T.sup.d], j = 0, ..., M - 1. Thus, our concern is the computation of the matrix vector product

f = A [??] (2.2)

where

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

The straight forward algorithm for this matrix vector product, which is called NDFT in Algorithm 1, takes O(M [N.sub.[pi]]) arithmetical operations and stores no matrix elements at all, but rather uses M [N.sub.[pi]] direct calls of the function cexp() to evaluate the complex exponentials [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

Algorithm 1: NDFT Input: d, M [member of] N, N [member of] 2[N.sup.d], [x.sub.j] [member of] [[-1/2, 1/2].sup.d], j = 0, ..., M - 1, and [[??].sub.k] [member of] C, k [member of] [I.sub.N]. for j = 0 ..., M - 1 do [f.sub.j] = 0 for k [member of] [I.sub.N] do [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] end for end for Output: values [f.sub.j] = f([x.sub.j]), j = 0, ..., M - 1.

Related matrix vector products axe the adjoint NDFT

[??] = [A.sup.H] f, [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],

where the update step in Algorithm 1 is simply changed to [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], the conjugated NDFT f = [bar.A][??], and the transposed NDFT [??] = [A.sup.T]f where [A.sup.H] = [[bar.A].sup.T]. Note, furthermore, that the inversion formula [F.sup.-1] = [F.sup.H] for the (equispaced and normalised) Fourier matrix F does not hold in the general situation of arbitrary sampling nodes for the matrix A.

NDFT acceleration

Algorithm 1 evaluates M [N.sub.[pi]] complex exponentials. Due to the fact that these direct calls are more expensive than multiplications, we may basically change the update step to [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], i.e., do a Hornet-like-scheme; see also [3]. Hence, in general 2dM direct calls are sufficient for the computations in Algorithm 1. Note, however, that this approach looses numerical stability to some extend; cf. [40]. Trading even more memory for the acceleration of the computation, one might precompute all entries of the matrix A, which is only feasible for small [N.sub.[pi]] and M; see Example 4.1 in Section 4.

2.2 NFFT

The most successful approach for the fast computation of (2.2) [8, 5, 38, 37, 16, 15, 19] is based on the usage of an oversampled FFT and a window function which is simultaneously localised in time/space and frequency. Basically, the scheme utilises the convolution theorem in the following three informal steps:

1. deconvolve the trigonometric polynomial f in (2.1) with the window function in frequency domain,

2. compute an oversampled FFT on the result of step 1., and

3. convolve the result of step 2. with the window function in time/spatial domain; evaluate this convolution at the nodes [x.sub.j].

Throughout the rest of the paper [sigma] > 1 and n = [sigma]N [member of] N will denote the oversampling factor and the FFT size, respectively. Furthermore, for d > 1 let [sigma] [member of] [R.sup.d], [[sigma].sub.0], ..., [[sigma].sub.d-1] > 1, n = [sigma] [dot encircle], and [n.sub.[pi]] = [n.sub.0] .... [n.sub.d-1] denote the oversampling factor, the FFT size, and the total FFT size, respectively. Here, we use for notational convenience the pointwise product [sigma] [dot encircle] [N := [([[sigma].sub.0][N.sub.0], [[sigma].sub.1][N.sub.1], ..., [[sigma].sub.d-1] [N.sub.d-1],).sup.t] with its inverse [N.sup.-1] := [(1/[N.sub.0], 1/[N.sub.1], ..., 1/[N.sub.d-1]).sup.T].

The window function

Starting with a window function [phi] [member of] [L.sub.2](R), which is well localised in the time/spatial domain R and in the frequency domain R, respectively, one assumes that its 1-periodic version [??], i.e.,

[??](x) := [summation over (r[member of]Z)][phi](x + r),

has an uniformly convergent Fourier series, and is well localised in the time/spatial domain T and in the frequency domain Z, respectively. Thus, the periodic window function [??} may be represented by its Fourier series

[??](x) [summation over (k[member of]Z)] [??](k) [e.sup.-2[pi]ikx]

with the Fourier coefficients

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

We truncate this series at the FFT length n which causes an aliasing error. If [phi] is furthermore well localised in time/spatial domain R, it can be truncated with truncation parameter m [member of] N, m [much less than] n, and approximated by the function [phi] x X[-m/n, m/n] which has compact support within the interval [-m/n, m/n]. Moreover, the periodic window function can be approximated by the periodic version of the truncated window function. For d > 1, univaxiate window functions [[phi].sub.0], ..., [[phi].sub.d-1], and a node x = [([x.sub.0], ..., [x.sub.d-1]).sup.T] the multivariate window function is simply given by

[phi](x) := [[phi].sub.0]([x.sub.0])[[phi].sub.1]([x.sub.1]) ... [[phi].sub.d-1] ([x.sub.d-1]), (2.3)

where [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] again denotes the 1-periodic version; an immediate observation is

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

For a single truncation parameter m [member of] N the window function is truncated to the cube [n.sup.-1] [dot encircle][[-m, m].sup.d].

[FIGURE 2.1 OMITTED]

We follow the general approach of [38, 37] and approximate the complex exponentials in the trigonometric polynomial (2.1) by

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (2.4)

where the set

[I.sub.n,m](x) := {l [member of] [I.sub.n] : n [dot encircle] x - m1 [less than or equal to] l [less than or equal to] n [dot encircle] x + m1}

collects these indices where the window function is mostly concentrated (the inequalities have to be fulfilled modulo n and for each component). After changing the order of summation in (2.1) we obtain for [x.sub.j] [member of] [T.sup.d], j = 0,..., M - 1, the approximation

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

This causes a truncation and an aliasing error; see [37, 33] for details. As can be readily seen, after an initial deconvolution step the expression in brackets can be computed via a d-variate FFT of total size [n.sub.[pi]]. The final step consists of the evaluation of sums having at most [(2m + 1).sup.d] summands where the window function is sampled only in the neighbourhood of the node [x.sub.j].

The algorithm and its matrix notation

The proposed scheme reads in matrix vector notation as

A [??] [approximately equal to] B F D [??], (2.5)

where B denotes the real M x [n.sub.[pi]] sparse matrix

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (2.6)

where F is the d-variate Fourier matrix of size [n.sub.[pi]] x [n.sub.[pi]], and where D is the real [n.sub.[pi]] x [N.sub.[[pi] diagonal matrix

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

with zero matrices [O.sub.t] of size [N.sub.t] x [n.sub.t] - [N.sub.t]/2. Obviously, the approximate matrix splitting can by applied to the adjoint matrix as [A.sup.H] [approximately equal to] [D.sup.T][F.sup,H][B.sup.T], where the multiplication with the sparse matrix [B.sup.T] is implemented in a transposed way, summation as outer loop and only using the index sets [I.sub.n,m] ([x.sub.j]).

Table 2.2 shows the computational requirements for the deeonvolution step of the NFFT. We give detailed information on the number of precomputed and stored real numbers (memory), the order of magnitude for the number of floating point operations (flops), and the number of evaluations for the univariate Fourier-transformed window function [??]. Precomputing the factors [[??].sub.t]([k.sub.t]) for t = 0, ..., d - 1 and [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] is denoted by its associated Flag PRE_PHI_HUT within the software library.

This is followed by one FFT of total size [n.sub.[pi]]. Hence, the computational complexity of the NFFT increases for a larger oversampling factor [sigma], affecting both the deconvolution step and the FFT. The time and memory requirements of the convolution and evaluation step are discussed in Section 3 in detail. In summary, we propose Algorithm 2 and Algorithm 3 for the computation of the nonequispaced FFT (2.2) and its adjoint, respectively.

Alternative NFFTs

Taylor based NFFT: A simple but nevertheless fast scheme for the computation of (2.2) in the univariate case d = 1 is presented in [1]. This approach uses for each node [x.sub.j] [member of] [-1/2, 1/2) an m-th order Taylor expansion of the trigonometric polynomial in (2.1) about the nearest neighbouring point on the oversampled equispaced lattice [{[n.sup.-1]k - 1/2}.sub.k=0], ..., n-1] where again n = [sigma]N, a [much greater than] 1. Besides its simple structure and only O(N log N + M) arithmetic operations, this algorithm utilises m FFTs of size n compared to only one in the NFFT approach, uses a medium amount of extra memory, and is not suited for highly accurate computations; see Example 4.2. Furthermore, its extension to higher dimensions has not been considered so far.

Algorithm 2: NFFT Input: d, M [member of] N, N [member of] 2[N.sup.d], [x.sub.j] [member of] [[-1/2, 1/2].sup.d], j = 0, ..., M - 1, and [[??].sub.k] [member of] C, k [member of] [I.sub.N]. For k [member of] [I.sub.N] compute [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. For l [member of] [I.sub.n] compute by d-variate FFT [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. For j = 0, ..., M - 1 compute [s.sub.j] := [summation over (l[member of][I.sub.n,m]([x.sub.j])] [g.sub.l] [??] ([x.sub.j] - [n.sup.-1] [dot encircle] l). Output: approximate values [s.sub.j] [approximately equal to] [f.sub.j], j = 0, ..., M - 1. Algorithm 3: [NFFT.sup.H] Input: d, M [member of] N, N [member of] 2[N.sup.d], [x.sub.j] [member of] [[-1/2, 1/2].sup.d] and [f.sub.j] [member of] C, j = 0, ..., M - 1. Compute tile sparse matrix vector product g := [B.sup.T] f. Apply the d-variate IFFT as [??] := [F.sup.H]g. Multiply by the 'diagonal' matrix, i.e., [??] := [D.sup.T] [??]. Output: approximate values [[??].sub.k], k [member of] [I.sub.N].

Multipole based NFFT: A second approach for the univariate case d = 1 is considered in [9] and based on a Lagrange interpolation technique. After taking a N-point FFT of the vector [??] in (2.2), one uses an exact polynomial interpolation scheme to obtain the values of the trigonometric polynomial f at the nonequispaced nodes [x.sub.j]. Here, the time consuming part is the exact polynomial interpolation scheme which can be realised fast in an approximate way by means of the fast multipole method. This approach is appealing since it also allows for the inverse transform. Nevertheless, numerical experiments in [9] showed that this approach is far more time consuming than Algorithm 2 and the inversion can only be computed in a stable way for almost equispaced nodes [9].

Linear algebra based NFFT: Using a fully discrete approach, one might fix the entries of the diagonal matrix D in (2.5) first and precompute optimised entries for the sparse matrix B to achieve higher accuracy [30, 31]. A similar approach, based on min-max interpolation, has been taken within [15]. While these approaches gain some accuracy for the Gaussian or B-Spline windows, no reasonable improvement is obtained for the Kaiser-Bessel window function. Since it is more expensive to precompute these optimised entries of the matrix B, we do not further consider these schemes.

3 Evaluation techniques for window functions

To keep the aliasing error and the truncation error small, several univariate functions T with good localisation in time and frequency domain were proposed. For an oversampling factor [sigma] > 1, a degree N [member of] 2N, the FFT length n = [sigma]N, and a cut-off parameter m [member of] N, the following window functions are considered:

1. for a shape parameter b = 2[sigma]/2[sigma]-1 m/[pi] the dilated Gaussian window [8, 38, 7]

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (3.1)

2. for [M.sub.2m] denoting the centred cardinal B-Spline of order 2m the dilated B-Spline window [5, 38]

[phi](x) = [M.sub.2m](nx), (3.2)

3. the dilated Sinc window [33]

[phi](x) = [[sinc].sup.2m]((2[sigma] - 1) N/2m [pi]x) (3.3)

with sinc(x) := sin(x) / x for x [not equal to] 0 and sinc(0) := 1

4. and for a shape parameter b = [pi](2 - 1/[sigma]) the dilated Kaiser-Bessel window [35]

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (3.4)

Note, that the latter two have compact support in frequency domain while the second one has compact support in time domain. Further references on the usage of (generalised) Kaiser-Bessel window functions include [22, 16, 28], where some authors prefer to interchange the role of time and frequency domain. For these univariate window functions [phi], the error introduced by Algorithm 2 obeys

[absolute value of f([x.sub.j]) - [s.sub.j]] [less than or equal to] [C.sub.[sigma],m][[parallel][??][parallel].sub.1] (3.5)

where

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

Thus, for fixed [sigma] > 1, the approximation error introduced by the NFFT decays exponentially with the number m of summands in (2.4). Using the tensor product approach, the above error estimates have been generatised for the multivariate setting in [11, 7]. Note furthermore, that it is convenient to replace the periodic window function [??] again by the original one [phi] within the actual computation. This causes an error for functions with large support in time/spatial domain. However, whenever the FFT-length n is reasonable "large, e.g., n [greater than or equal to] max{4m, 12} for the Gaussian, an easy calculation shows that for x [member of] [-m/n, m/n] the estimate

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

holds true, i.e., the resulting error is within machine precision. If the restriction on n is not fulfilled, the NDFT method is competitive, anyway.

In the following, we suggest different methods for the compressed storage and application of the matrix B which are all available within our NFFT library by choosing particular flags in a simple way during the initialisation phase. These methods do not yield a different asymptotic performance but rather yield a lower constant in the amount of computation.

3.1 Fully precomputed window function

One possibility is to precompute all values [phi]([x.sub.j]-[n.sup.-1][dot encircle]l) for j = 0, ..., M-1 and l [member of] [I.sub.n,m]([x.sub.j]) explicitly. Thus, one has to store the large amount of [(2m + 1).sup.d]M real numbers but uses no extra floating point operations during the matrix vector multiplication, beside the necessary [(2m-b 1).sup.d]M flops. Furthermore, we store for this method explicitly the row and column for each nonzero entry of the matrix B. This method, included by the flag PRE_FULL_PSI, is the fastest procedure but can only be used if enough main memory is available.

3.2 Tensor product based precomputation

Using the fact that the window functions are built as tensor products, one can store [[phi].sub.t](([x.sub.j]).sub.t] - [l.sub.t]/[n.sub.t]) for j = 0, ..., M - 1, t = 0, ..., d - 1, and [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] where [([x.sub.j]).sub.t] denotes the t-th component of the j-th node. This method uses a medium amount of memory to store d(2m+1)M real numbers in total. However, one must carry out for each node at most 2[(2m + 1).sup.d] extra multiplications to compute from the factors the multivariate window function [phi]([x.sub.j] - [n.sup.-1] [dot encircle] l) for l [member of] [I.sub.n,m]([x.sub.j]). Note that this technique is available for every window function discussed here and can be used by means of the flag PRE_PSI, which is also the default method within our software library.

3.3 Linear interpolation from a lookup table

For a large number of nodes, M, the amount of memory can be further reduced by the use of lookup table techniques. For a recent example within the framework of gridding, see [4]. We suggest that one precomputes from the even window function the equidistant samples [phi].sub.t]([rm/K[n.sub.t]) for t = 0, ..., d - 1 and r = 0,..., K, K [member of] N, and then compute for the actual node [x.sub.j] during the NFFT the values [[phi].sub.t][(([x.sub.j]).sub.t] - [l.sub.t]/[n.sub.t]) for t = 0, ..., d - 1 and [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] by means of the linear interpolation from its two neighbouring precomputed samples.

Lemma 3.1 For the univariate window functions (3.1) - (3.4) and K [member of] N the linear interpolated window function, denoted by [[phi].sub.K], fulfils

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

Proof: From standard error estimates we know that the linear interpolated window function [[phi].sub.K] fulfils

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (3.6)

The maximum of this second derivative is met for the window functions (3.1)-(3.4) at [xi] = 0. Thus, the assertion follows by

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],

and the estimates [M.sub.2m-2] (0) - [M.sub.2m-2] (1) [less than or equal to] 1 and bm cosh(bm) - sinh(bm) [less than or equal to] 2[pi]m [e.sup.2[pi]m].

This method needs only a storage of dK real numbers in total where K depends solely on the target accuracy, but neither on the number of nodes M nor on the multi degree N. Choosing K to be a multiple of m, we further reduce the computational costs during the interpolation since the distance from ([x.sub.j]).sub.t] - [l.sub.t]/[n.sub.t] to the two neighbouring interpolation nodes and, hence, the interpolation weights remain the same for all [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. This method requires 2[(2m+ 1).sup.d] extra multiplications per node and is used within the NFFT by the flag PRE_LIN_PSI.

3.4 Fast Gaussian gridding

Two useful properties of the Gaussian window function (3.1) within the present framework were recently reviewed in [19]. Beside its tensor product structure for d > 1, which also holds for all other window functions, it is remarkable that the number of evaluations of the form exp() can be greatly decreased. More precisely, for d = 1 and a fixed node [x.sub.j] the evaluations of [phi]([x.sub.j]- l'/n), l' [member of] [I.sub.n,m]([x.sub.j]), can be reduced by the splitting

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where u = min [I.sub.n,m]([x.sub.j]) and l = 0, ..., 2m. Note that the first factor and the exponential within the brackets are constant for each fixed node [x.sub.j]. Once we evaluate the second exponential, its l-th power can be computed consecutively by multiplications only. Furthermore, the last exponential is independent of [x.sub.j] and these 2m + 1 values are computed only once within the NFFT and their amount is negligible. Thus, it is sufficient to store or evaluate 2M exponentials for d = 1. The case d > 1 uses 2dM storages or evaluations by using the general tensor product structure. This method is employed by the flags FG_PSI and PRE_FG_PSI for the evaluation or storage of two exponentials per node, respectively.

3.5 No precomputation of the window function

The last considered method uses no precomputation at all, but rather evaluates the univariate window function [(2m + 1).sup.d]M times. Thus, the computational time depends on how fast we can evaluate the particular window function. However, no additional storage is necessary, which suits this approach whenever the problem size reaches the memory limits of the used computer.

3.6 Summary on the computational costs

The multiplication with the sparse matrix B clearly takes O([m.sup.d]M) operations. Beside this, Table 3.1 summarises the memory requirements for different strategies to store the elements of this matrix and the extra costs it takes to multiply with this compressed matrix.

4 Numerical experiments

We present numerical experiments in order to demonstrate the performance of our algorithms. All algorithms were implemented in C and tested on an AMD Athlon[TM] XP 2700+ with 2GB main memory, SuSe-Linux (kernel 2.4.20-4GBathlon, gcc 3.3) using double precision arithmetic. Moreover, we have used the libraries FFTW 3.0.1 [17] and an extended version of NFFT 2.0.3 [24]. In order to reproduce all of the numerical results, a pre-release of the upcoming NFFT library can be downloaded from our web page [24].

Example 4.1 We start by examining accelerated NDFTs for d = 1. Using the three proposed possibilities to compute the matrix vector product by either M N direct calls of cexp(), a Homer-like scheme, or a fully precomputed matrix A, we obtain the following timings for increasing N in Figure 4.1 (top-left). Clearly, the complete precomputation of the matrix A does not pay off. The Horner-like NDFT uses no extra memory and is considerably faster than the NDFT. Furthermore, it is faster than the default NFFT until an break even of N = 128. []

Example 4.2 Our second example concerns the Taylor expansion based NFFT, again only for d = 1. We note that this scheme actually provides a competitive NFFT with respect to the computation time relative to the problem size, at least within a factor; cf. Figure 4.1 (top-right). The main drawbacks of this approach are its instability, i.e., it is not possible to obtain high accurate results by increasing the order m of the Taylor expansion and, hence, the number of used FFTs. This fact remains even true for a very large oversampling factor [sigma] = 16; see Figure 4.1 (bottom-left). Furthermore, even when the target accuracy [E.sub.[infinity]] = [[parallel]f - s[parallel].sub.[infinity]] / [[parallel]f[parallel].sub.[infinity]], cf. [1], is somewhat larger, the NFFT needs considerably fewer arithmetic operations to reach it, cf. 4.1 (bottom-right). []

Example 4.3 We now compare the computation time for the three tasks within the NFFT, i.e., the deconvolution step, the oversampled FFT, and the convolution/evaluation step for space dimension d = 1. Figure 4.2 shows the timings for increasing degree N, M = N nodes, and a fixed cut-off m = 4. The linear dependence of the computation time with respect to the problem size can be seen for the matrix-vector multiplication with the diagonal matrix D and the sparse matrix B whereas the FFT takes O(N log N) operations.

For the deconvolution step we obtain a speed up of more than 3 by avoiding direct calls of the Fourier-transformed window function [??]; this method is default and turned on by the precomputation flag PRE_PHI_HUT; cf. Figure 4.2 (top-left).

[FIGURE 4.1 OMITTED]

Ways to speed up the FFT by a more exhaustive search of an optimal FFT-plan are discussed in [17]. Figure 4.2 (top-right) shows for larger degree N a speed up of around 2 when we use the planner FFTW_MEASURE, which is also default within the NFFT.

The time to compute the last step of the NFFT differs from no precomputation of the matrix entries of B to explicitly precomputed entries with PRE_FULL_PSI by a factor of 20 to 100 for small degrees IV [less than or equal to] 2048 and by a factor of 5 to 20 for larger degrees. Note, however, that the use of this flag with maximal precomputation is limited by the available memory, e.g., for m = 4, and M = [2.sup.20] we already need 144 MByte just for storing the matrix entries and its indices. []

Example 4.4 Furthermore, we show the timings of the convolution/evaluation step for increasing N, the multi degree N = [(N, ..., N).sup.T], M = [N.sup.d] nodes, a fixed cut-off m = 4, and space dimension d = 2, 3 in Figure 4.3. Note, that for d = 2 and m = 4 the matrix B has already 81 nonzero entries per row. []

Example 4.5 More detailed, we focus on the convolution/evaluation step for space dimension d = 1. Figure 4.4 shows the computation time with respect to achieved target accuracy [E.sub.2] = [[parallel]f - s[parallel].sub.2] / [[parallel]f[parallel].sub.2] cf. [37], by increasing the cut-off m for fixed degree and number of nodes.

We conclude that if no additional memory is used for precomputing the entries of the matrix B, the Gaussian window function in conjunction with the flag FG_PSI performs best; cf. Figure 4.4 (top-left). If no precomputation is used, the particular bad behaviour of the B-Spline window function is due to the fact that evaluating this window function once already takes O(m) operations.

When only a small amount of memory is used for precomputations, the decision between the linear interpolated Kaiser-Bessel window function and the fast Ganssian gridding with precomputation PKE_FG_PSI depends on the accuracy one would like to achieve; here, the linear interpolated Kalser-Bessel window performs better up to single precision (top-right).

Whenever at least 2mM values can be precomputed, the Kaiser-Bessel window performs always best, i.e., it needs the least amount of time to achieve a given target accuracy; cf. Figure 4.4 (bottom). []

[FIGURE 4.2 OMITTED]

[FIGURE 4.3 OMITTED]

[FIGURE 4.4 OMITTED]

Example 4.6 Finally, Figure 4.5 shows the quadratic decay of the error introduced by the linear interpolation of the window function if the method PRE_LIN_PSI is used. The decay of the error [E.sub.2] coincides for all window functions up to the accuracy; they actually can provide for a fixed cut-off m = 10. []

[FIGURE 4.5 OMITTED]

5 Conclusions

Fast algorithms for the nonequispaced discrete Fourier transform have already been known a couple of years. Besides their asymptotic computational complexity of O([N.sub.[pi] log[N.sub.[pi]] + M) for [N.sub.[pi]] equispaced frequencies and M nonequispaced sampling nodes, NFFTs differ substantially in their computation time for interesting problem sizes. For its actual usage, we summarise:

1. If the problem size is really small, e.g., N = M < 32 for d = 1, just use the NDFT or its Horner-like derivative.

2. The simplest fast method is the Taylor expansion based NFFT; it achieves not even single precision, needs a somewhat larger oversampling factor, and is slower than window function based methods.

3. If the problem barely fits into your computer, you should use the fast Gaussian gridding NFFT, i.e., the Gaussian window function in conjunction with the flag FG_PSI which uses no extra memory.

4. Using only a small amount of memory for precomputation and asking for high accuracy, the fast Gaussian gridding NFFT with precomputation per forms best while storing 2d real numbers per node. However, the Kaiser-Bessel window in conjunction with the lookup table method PR]E_LIN_PSI with [2.sup.12] precomputed values suffices for single precision [10.sup.-8], regardless of the problem size, and outperforms the fast Gaussian gridding. Furthermore, the lookup table is the only precomputation method which is independent of the actual sampling set {[x.sub.j]}.

5. If a medium amount of memory can be used for precomputation, the Kaiser-Bessel window function performs best. The tensor product based precomputation scheme PRE_PSI yields a faster NFFT than the lookup table method or the fast Gaussian gridding with precomputation, but stores for each node dm real numbers. For small to medium size problems, one can gain another factor 2 to 5 by means of a fully precomputed window function PRE_FULL_PSI. However, this causes a storage cost of [m.sup.d] real numbers per sampling node.

6. Default precomputation methods, selected by the simple initialisation routine of the NFFT, are: PRE_PHI_HUT for the deconvolution step, the flag FFTW_MEASURE for planning the FFT, and the tensor product based precomputation scheme PRE_PSI for the convolution/evaluation step. Furthermore, the Kaiser-Bessel window function is selected as default at compilation.

ACKNOWLEDGEMENT

The first author is grateful for support of this work by the German Academic Exchange Service (DAAD) and the warm hospitality during his stay at the Numerical Harmonic Analysis Group, University of Vienna. Furthermore, we would like to thank Matt Fulkerson for contributing an early version of the fast Gaussian gridding to the NFFT software.

References

[1] C. Anderson and M. Dahleh, Rapid computation of the discrete Fourier transform, SIAM J. Sci. Comput., 17, 913-919, 1996.

[2] A. Averbuch, R. Coifman, D. L. Donoho, M. Elad, and M. Israeli, Fast and accurate polar Fourier transform, Appl. Comput. Harmon. Anal., 21, 145 -167, 2006.

[3] S. Bagchi and S. K. Mitra, The nonuniform discrete Fourier transform. In F. Marvasti, editor, Nonuniform Sampling: Theory and Practice, Kluwer/Plenum, 2001.

[4] P. J. Beatty, D. G. Nishimura, and J. M. Pauly, Rapid gridding reconstruction with a minimal oversampling ratio, IEEE Trans. Med. Imag., 24, 799 -808, 2005.

[5] G. Beylkin, On the fast Fourier transform of functions with singularities, Appl. Comput. Harmon. Anal., 2, 363-381, 1995.

[6] E. J. Candes, L. Demanet, D. L. Donoho, and L. Ying, Fast discrete curvelet transforms, SIAM Multiscale Model. Simul., 3, 861-899, 2006.

[7] A. J. W. Duijndam and M. A. Schonewille, Nonuniform fast Fourier transform, Geophysics, 64, 539-551, 1999.

[8] A. Dutt and V. Rokhlin, Fast Fourier transforms for nonequispaced data, SIAM J. Sci. Star. Comput., 14, 1368-1393, 1993.

[9] A. Dutt and V. Rokhlin, Fast Fourier transforms for nonequispaced data II, Appl. Comput. Harmon. Anal., 2, 85-100, 1995.

[10] H. Eggers, T. Knopp, and D. Potts, Field inhomogeneity correction based on gridding reconstruction, IEEE Trans. Med. Imag., 26, 374-384, 2007.

[11] B. Elbel and G. Steidl, Fast Fourier transform for nonequispaced data. In C. K. Chui and L. L. Schumaker, editors, Approximation Theory IX, 39-46, Vanderbilt University Press, Nashville, 1998.

[12] G. E. Fasshauer and J. Zhang, Recent results for moving least squares approximation. In L. Lucian and M. Neamtu, editors, Geometric Modeling and Computing, 163-176, Nashboro Press, Brentwood, 2003.

[13] H. G. Feichtinger, K. Gr6chenig, and T. Strohmer, Efficient numerical methods in non-uniform sampling theory, Numer. Math., 69, 423-440, 1995.

[14] M. Fenn, S. Kunis, and D. Potts, On the computation of the polar FFT, Appl. Comput. Harmon. Anal., 22, 257-263, 2007.

[15] J. A. Fessler and B. P. Sutton, Nonuniform fast Fourier transforms using min-max interpolation, IEEE Trans. Signal Process., 51, 560-574, 2003.

[16] K. Fourmont, Non equispaced fast Fourier transforms with applications to tomography, J. Fourier Anal. Appl., 9, 431-450, 2003.

[17] M. Frigo and S. G. Johnson, FFTW, C subroutine library, http://www.fftw.org.

[18] D. Gottlieb, B. Gustafsson, and P. Forssen, On the direct Fourier method for computer tomography, IEEE Trans. Med. Imag., 9, 223-232, 2000.

[19] L. Greengard and J.-Y. Lee, Accelerating the nonuniform fast Fourier transform, SIAM Rev., 46, 443-454, 2004.

[20] K. Grochenig, A discrete theory of irregular sampling, Lin. Alg. Appl., 193, 129-150, 1993.

[21] J. A. Hogan and J. D. Lakey, Time-Frequency and Time-Scale Methods: Wavelets, Sampling, Uncertainty Principles, Applied and Numerical Harmonic Analysis series, Birkhauser, Boston, 2005.

[22] J. I. Jackson, C. H. Meyer, D. G. Nishimura, and A. Macovski, Selection of a convolution function for Fourier inversion using gridding, IEEE Trans. Med. Imag., 10, 473-478, 1991.

[23] J. Keiner, S. Kunis, and D. Potts, Fast summation of radial functions on the sphere, Computing, 78, 1-15, 2006.

[24] S. Kunis and D. Potts, NFFT, Softwarepackage, C subroutine library, http://www.tu-chemnitz.de/~potts/nfft, 2002-2006.

[25] S. Kunis and D. Potts, Fast spherical Fourier algorithms, J. Comput. Appl. Math., 161, 75-98, 2003.

[26] S. Kunis and D. Potts, Stability results for scattered data interpolation by trigonometric polynomials, SIAM J. Sci. Comput., 29, 1403-1419, 2007.

[27] J. Ma and M. Fenn, Combined complex ridgelet shrinkage and total variation minimization, SIAM J. Sci. Comput., 28, 984-1000, 2006.

[28] S. Matej, J. A. Fessler, and I. G. Kazantsev, Iterative tomographic image reconstruction using Fourier-based forward and back-projectors, IEEE Trans. Med. Imag., 23, 401-412, 2004.

[29] F. Natterer and F. Wubbeling, Mathematical Methods in Image Reconstruction, SIAM, Philadelphia, 2000.

[30] N. Nguyen and Q. H. Liu, The regular Fourier matrices and nonuniform fast Fourier transforms, SIAM J. Sci. Comput., 21, 283-293, 1999.

[31] A. Nieslony and G. Steidl, Approximate factorizations of Fourier matrices with nonequispaced knots, Linear Algebra Appl., 266, 337-351, 2003.

[32] J. D. O'Sullivan, A fast sinc function gridding algorithm for Fourier inversion in computer tomography, IEEE Trans. Med. Imag., 4, 200-207, 1985.

[33] D. Potts, Schnelle Fourier-Transformationen fur Nichtaquidistante Daten und Anwendungen, Habilitation, Universitat zu Lubeck, 2003. http://www.tu-chemnitz.de/~potts

[34] D. Potts and G. Steidl, A new linogram algorithm for computerized tomography, IMA J. Numer. Anal., 21, 769-782, 2001.

[35] D. Potts and G. Steidl, Fast summation at nonequispaced knots by NFFTs, SIAM J. Sci. Comput., 24, 2013-2037, 2003.

[36] D. Potts, G. Steidl, and A. Nieslony, Fast convolution with radial kernels at nonequispaced knots, Numer. Math., 98, 329-351, 2004.

[37] D. Potts, G. Steidl, and M. Tasche, Fast Fourier transforms for nonequispaced data: A tutorial. In J. J. Benedetto and P. J. S. G. Ferreira, editors, Modern Sampling Theory: Mathematics and Applications, 247-270, Birkhauser, Boston, 2001.

[38] G. Steidl, A note on fast Fourier transforms for nonequispaced grids, Adv. Comput. Math., 9, 337-353, 1998.

[39] B. P. Sutton, D. C. Noll, and J. A. Fessler, Fast, iterative, field-corrected image reconstruction for MRI in the presence of field inhomogeneities, IEEE Trans. Med. Imag., 22, 178-188, 2003.

[40] M. Tasche and H. Zeuner, Roundoff error analysis for fast trigonometric transforms. In Handbook of Analytic-Computational Methods in Applied Mathematics, 357-406, CRC Press, Boca Raton, FL, 2000.

[41] A. F. Ware, Fast approximate Fourier transforms for irregularly spaced data. SIAM Rev., 40, 838-856, 1998.

Stefan Kunis

Department of Mathematics, Chemnitz University of Technology

Reichenhainer Str. 39, 09107 Chemnitz, Germany

kunis@mathematik.tu-chemnitz.de

Daniel Potts

Department of Mathematics, Chemnitz University of Technology

Reichenhainer Str. 39, 09107 Chemnitz, Germany

potts@mathematik.tu-chemnitz.de

Table 2.1: Number of precomputed and stored complex exponentials (memory), the order of magnitude for the number of floating point operations (flops), and the number of evaluations for the function cexp() (evaluations). NDFT method memory flops evaluations standard -- [MN.sub.[pi]] [MN.sub.[pi]] Horner-like -- [MN.sub.[pi]] 2dM fully precomputed [MN.sub.[pi]] [MN.sub.[pi]] -- Table 2.2: Computational requirements for the deconvolution step in Algorithm 2. Method memory flops evaluations -- -- [N.sub.[pi]] [N.sub.[pi]] PRE_PHI_HUT [N.sub.0] + ... [N.sub.[pi]] -- + [N.sub.d-1] Table 3.1: Theoretical order of magnitude for memory requirements, extra floating point operations, and the evaluations of the window function [phi]. Furthermore, at most 2[(2m + 1).sup.d] multiplications are used within each scheme besides PRE_FULL_PSI to compute the multivariate window function out of its univariate factors, denoted by *. Method memory extra flops evaluations -- -- * [m.sup.d]M FG_PSI -- dm, * dM PRE_LIN_PSI dK dm, * -- PRE_FG_PSI dM dm, * -- PRE_PSI dmM * -- PRE_FULL_PSI [m.sup.d]M -- --

Printer friendly Cite/link Email Feedback | |

Title Annotation: | fast Fourier transform |
---|---|

Author: | Kunis, Stefan; Potts, Daniel |

Publication: | Sampling Theory in Signal and Image Processing |

Article Type: | Technical report |

Geographic Code: | 4EUGE |

Date: | Jan 1, 2008 |

Words: | 6837 |

Previous Article: | The tensor product of frames. |

Next Article: | Preface. |

Topics: |

## Reader Opinion