# A computational realization of a semi-Lagrangian method for solving the advection equation.

1. IntroductionMany physical phenomena in transport processes are modeled by time-dependent hyperbolic conservation laws [1-4]. The finite volume method (FVM) is a standard conservative method to construct numerical approximations for solving hyperbolic conservation problems. Modern modifications of FVM [5-8] provide well-established conservative methods for solving the governing advection equations. Moreover, some of them were developed to treat high gradients and discontinuities of a solution [7, 8]. In spite of their advances for hyperbolic equations, these methods have the limitation consisting in a time step restriction, mainly for stability sake. On the other hand, during the last three decades the idea of applying the method of characteristics to advective quantities forward in time has rapidly developed and has gained popularity in many areas [9-13]. In contrast to traditional Eulerian schemes, semi-Lagrangian algorithms provide unconditional stability and allow using large time steps. Despite unconditional stability, these methods are explicit and therefore they look well suited for parallelization. Now semi-Lagrangian methods are intensively studied and their efficiency for convection-dominated problems is proved. For a more detailed discussion about the comparison of traditional Eulerian and semi-Lagrangian schemes for hyperbolic conservation laws, see [6, 14, 15].

Initially semi-Lagrangian algorithms, as methods of characteristics, were developed with application in climate prediction [16-21]. The simplest schemes use the approximation of a trajectory (or curvilinear characteristic) by a straight line and employ a low-order interpolation to compute a numerical solution. Nowadays, simplicity and efficiency of these schemes make them quite popular in different fields of numerical modeling like fluid dynamics applications [9, 12, 22], shallow water equations [10], fiber dynamics described by the Fokker-Planck equation [11], heat-conduction equation [23], and so forth. Now modern semi-Lagrangian algorithms involve a higher-order approximation of a curvilinear characteristic and employ a higher-order interpolation; see, for example, [22]. Recently, considerable efforts have been made to construct the conservative semi-Lagrangian methods [9, 20, 24-28]. For instance, Scroggs and Semazzi [9] presented a semi-Lagrangian finite volume method that used a rectangular grid for a system of conservation laws and satisfied the discrete conservation relation, but the numerical results demonstrate some violation of full conservation. Early modifications of the semi-Lagrangian approach use a rectangular grid which is fixed throughout the simulation [9, 17, 24, 25]. Semi-Lagrangian schemes allow using spatial grids independently of one another. As a result, adaptive grids are widely used in modern versions of this approach [20, 28]. In spite of the progress in semi-Lagrangian methods, for most of them [9-12, 20, 24, 28] convergence has not been theoretically proved.

In this paper, we present a sketch of the theoretical proof and a numerical justification for the difference scheme of the semi-Lagrangian family. We start with the theorem about an exact equality that involves two spatial integrals over domains at neighboring time levels and the third integral over an inflow boundary. To prove convergence theoretically, we use a square grid, the bilinear interpolation, and the Runge-Kutta method for the fourth-order approximation of characteristics. This allows us to prove first-order convergence in a discrete analogue of the L j-norm. The theoretical convergence estimates are confirmed by numerical results.

In the remaining part of the paper, some parallel implementations of this method are studied. We discuss the design subtleties of parallel implementations of the algorithm by the OpenMP-technology for shared memory computational systems and by the CUDA technology for general-purpose GPU programming. In addition, the influence of the Hyper-Threading technology on the performance of our OpenMP-code is studied. Moreover, the difficulties of the algorithm implementation and the performance for hybrid architecture computation systems are discussed for our CUDA codes.

2. Formulation of the Problem

Let D = [0, 1]x[0, 1] .In the closed domain [0, T] x D consider the two-dimensional advection equation

[[partial derivative][rho]/[partial derivative]t] + [[partial derivative](u[rho])/[partial derivative]x] + [[partial derivative](v[rho])/[partial derivative]y] = 0, (1)

where u(t, x, y) and v(t, x, y) are known and are sufficiently smooth in [0, T] x D. We suppose for simplicity that [for all]T [member of] [0, T] the coefficients satisfy the no-slip conditions at the upper and lower sides of D

u(t, x, y)[|.sub.y=0] = u(t, x, y)[|.sub.y=1] = 0, v(t, x, y)[|.sub.y=0] = v(t, x, y)[|.sub.y=1] = 0 (2)

and the flow conditions at the left and right sides of D

u(t, x, y)[|.sub.x=0] [greater than or equal to] 0, u(t, x, y)[|.sub.x=1] [greater than or equal to] 0. (3)

For the unknown function [rho](t, x, y) the following initial and boundary conditions are specified:

[for all](x, y) [member of] D [rho](0, x, y) = [[rho].sub.init](x, y), (4)

[for all](t, y) [member of] [0, T] x [0, 1] [rho](t, 0, y) = [[rho].in](t, y). (5)

3. Numerical Scheme

Subdivide the time segment [0, T] into K time levels [t.sub.k] = k[tau], k = 0, ..., K, with the time step [tau] = T/K. Let [OMEGA] be a closed quadrangle at the time level [t.sub.k]. For each of its points on the segment t [member of] [[t.sub.k-1], [t.sub.k]] we construct the characteristics defined by the system of ordinary differential equations

[bar.x'](t) = u (t, [bar.x](t), [bar.y](t)),

[bar.y'](t) = v (t, [bar.x](t), [bar.y](t)), (6)

with the initial value at level t = [t.sub.k] as a parameter:

[bar.x]([t.sub.k]) = [x.sub.0], [bar.y]([t.sub.k]) = [y.sub.0], ([x.sub.0], [y.sub.0]) [member of] [OMEGA]. (7)

With the help of these characteristics the edges of [OMEGA] generate four surfaces [S.sub.n], n = 1, ..., 4, with the edges [C.sub.n] at t = [t.sub.k-1] (Figure 1).

If [OMEGA] is located near the inflow boundary x = 0, surfaces [S.sub.n] can cross the plane x = 0. In this case we get an additional curvilinear polygon I on the plane x = 0 (Figure 2).

Generally speaking, I and Q can be triangular, pentagonal, or empty domains. If one of them is empty, then the integral over an empty domain is supposed to be equal to zero. Since there is no fundamental difference, we consider only the most common case with quadrangular domains. For [OMEGA], Q, and I the following statement is valid.

Theorem 1. For a smooth solution of the problem (1)-(5) we have the equality

[[integral].sub.[OMEGA]] [rho]([t.sub.k], x, y) d[OMEGA] = [[integral].sub.Q] [rho]([t.sub.k-1], x, y) dQ + [[integral].sub.I] ([rho]u) (t, 0, y) dI. (8)

Proof. Denote the volume bounded by [OMEGA], Q, I, and surfaces [S.sub.n] by V and its boundary by [GAMMA]. Apply the Gauss-Ostrogradsky theorem to the left-hand side of the equality

[[integral].sub.v] ([[partial derivative][rho]/[partial derivative]t] + [[partial derivative](u[rho])/[partial derivative]x] + [[partial derivative](v[rho])/[partial derivative]y]) dV = 0. (9)

Then

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (10)

Here ([n.sub.t], [n.sub.x], [n.sub.y]) is the outer normal to [GAMMA] and sign "x" means the scalar product. The normal ([n.sub.t], [n.sub.x], [n.sub.y]) equals (1, 0, 0) on [OMEGA], (-1, 0, 0) on Q, and (0, -1, 0) on I. For any [S.sub.n] the normal ([n.sub.t], [n.sub.x], [n.sub.y]) is orthogonal to all tangent directions of [S.sub.n] including the tangent of characteristics [(1 + [u.sub.2] + [v.sub.2]).sup.-1/2] (1, u, v). Therefore (1, u, v) x [([n.sub.t], [n.sub.x], [n.sub.y]).sup.T] = 0. Taking into account this reasoning in (10), we get the equality

[[integral].sub.[OMEGA]] [rho]([t.sub.k], x, y) d[OMEGA] - [[integral].sub.Q] [rho]([t.sub.k-1], x, y) dQ - [[integral].sub.I] ([rho]u)(t, 0, y) dI = 0 (11)

that is equivalent to the statement of theorem.

Now construct the uniform mesh [D.sub.h] with mesh-size h = 1/N, N [greater than or equal to] 2:

[D.sub.h] = {([x.sub.i], [y.sub.j]) : = [x.sub.i] = ih, [y.sub.j] = jh; i, j = 0, ..., N}. (12)

We will find an approximate solution [[rho].sup.h](t, x, y) at each time level t = [t.sub.r] [for all]r = 0, ..., K as a grid function with values

[[??].sup.r.sub.i,j] = [[rho].sup.h] ([t.sub.r], [x.sub.i], [y.sub.j]) [for all]i, j = 0, ..., N, (13)

unlike the values of the exact solution

[[rho].sup.r.sub.i,j] = [rho]([t.sub.r], [x.sub.i], [y.sub.j]). (14)

To construct the difference scheme with a variable stencil, we suppose that the function [[rho].sup.h] at time level [t.sub.k-1] is already known and we need to find it at level [t.sub.k]. To compute [[??].sup.k.sub.i,j] for some i, j = 1, 2, ..., N - 1, we take the square [[OMEGA].sub.i,j] with four vertices ([x.sub.i][+ or -]h/2, [y.sub.j][+ or -]h/2) and apply Theorem 1. To determine [[??].sup.k.sub.i,j] on the boundary of D we use the rectangles [[OMEGA].sub.i,j] which are adjoined to this boundary inside D (Figure 3).

Note that [[??].sup.k.sub.0,j] = [[rho].sup.k.sub.0,j] are known from the boundary condition (5).

Without loss of generality we describe the construction of the difference equations for inner nodes with i, j = 1, 2, ..., N - 1 only. Thus, due to Theorem 1 we get

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (15)

Here the curvilinear polygons [Q.sup.k-1.sub.i,j] and [I.sup.k-1.sub.i,j] are formed by the characteristics (6) that issue out of the edges of square [[OMEGA].sub.i,j]. To compute the second integrals in (15) numerically, first we replace the exact function [rho]([t.sub.k-1], i, x, y) by its bilinear interpolant

[[rho].sup.I.sub.h]([t.sub.k-1], x, y) = [N.summation over (q=0)] [N.summation over (p=0)] [[rho].sup.k-1.sub.p,q] [[phi].sub.p,q] (x, y) (16)

with the help of the basis functions

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (17)

To compute the integral over the domain [I.sup.k-1].sub.i,j], we also use the bilinear interpolant

[([rho]u)sup.I.sub.[tau]](t, y)

= [N.summation over (q=0)] [K.summation over (r=0)] [[rho].sub.in] ([t.sub.r], [y.sub.q]) u ([t.sub.r], 0, [y.sub.q]) [[psi].sub.r,q] (t, y) (18)

with the basis functions

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (19)

The left-hand side of (15) is approximated by the midpoint quadrature rule with second-order accuracy:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (20)

So, instead of the exact equality (15) we get the approximate one:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (21)

To simplify the numerical computation of the right-hand side in (21), we approximate the domains [Q.sup.k-1.sub.i,j] and [I.sup.k-1.sub.i,j] by simpler ones. Since in the more general case both domains are curvilinear quadrangles, we demonstrate the approximation only for quadrangular [Q.sup.k-1.sub.i,j]. Introduce four additional points ([x.sub.i] [+ or -] h/2, [y.sub.j]) and ([x.sub.i], [y.sub.j] [+ or -] h/2) on the square [[OMEGA].sub.i,j] at time level [t.sub.k] and denote each of the eight nodes by [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. Out of each [A.sub.n] we pass the corresponding characteristic to the time level [t.sub.k-1] which gives the point [B.sub.n] = ([[bar.x].sub.n], [[bar.y].sub.n]) (Figure 4).

To compute the coordinates of the point numerically, we solve the system of ordinary differential equations (6) with the initial condition

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (22)

by the fourth-order Runge-Kutta method [29]. Thus, we find the approximation [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] of the point [B.sub.n]. The nodes [B.sup.h.sub.n], n = 1, ..., 8, define the polygon [P.sup.k-1.sub.i,j] which is considered as a quadrangle with four parabolic edges (Figures 4-5). The constructed domain [P.sup.k-1.sub.i,j] approximates [Q.sup.k-1.sub.i,j]. In the same way we construct the polygon [L.sup.k-1.sub.i,j] which approximates [I.sup.k-1.sub.i,j]. For the above approximation the following statement is valid [30].

Lemma 2. Let the coordinates of the nodes [B.sub.n], n = 1, ..., 8, be computed within the fourth-order accuracy [[bar.x].sub.n] [[??].sub.n]([t.sub.k]) = O([[tau].sup.4]), [[bar.y].sub.n] [[??].sub.n](t) = O([[tau].sup.4]). Assume that the ratio between [tau] and h is fixed: [tau] = [??]h. Thenfor all i, j = 0, ..., N

mes ([Q.sup.k-1.sub.i,j] \ [P.sup.k-1.sub.i,j]) + mes ([P.sup.k-1.sub.i,j] \ [Q.sup.k-1.sub.i,j]) = O ([h.sup.4]),

mes ([I.sup.k-1.sub.i,j] \ [L.sup.k-1.sub.i,j]) + mes ([L.sup.k-1.sub.i,j] \ [I.sup.k-1.sub.i,j]) = O ([h.sup.4]), (23)

where the notation mes ([OMEGA]) means the measure of the domain [OMEGA].

Thus, the replacement of [Q.sup.k-1.sub.i,j] by [P.sup.k-1.sub.i,j] and [I.sup.k-1.sub.i,j] by [L.sup.k-1.sub.i,j], i = 1, ..., N, j = 0, ..., N, reduces the approximate equality (21) to another one:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (24)

Divide it by mes ([[OMEGA].sub.i,j]) and replace [[rho].sup.I.sub.h] by the interpolant [[??].sup.I.sub.h] of the known grid function [[rho].sup.h] at the level [t.sub.k-1]:

[[??]sup.I.sub.h] ([t.sub.k-1], x, y) = [N.summation over (q=0)][N.summation over (p=0)] [[rho].sup.k-1.sub.p,q] [[phi].sub.p,q] (x, y). (25)

As a result, we get the equation for finding [[??].sup.k.sub.i,j] as an approximation of [[rho].sup.k.sub.i,j]:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (26)

[[??].sup.k.sub.0,j] = [[rho].sub.in] ([t.sub.k], 0, [y.sub.j]) [for all]j = 0, ..., N. (27)

To compute the integrals numerically, we decompose the domain [P.sup.k-1.sub.i,j] (or [L.sup.k-1.sub.i,j]) into several triangles which lie only in one cell [[x.sub.p-1], [x.sub.p]] x [[y.sub.q], [y.sub.q+1]], p, q = 0, 1, ..., N - 1, and have only one parabolic edge and two straight-line edges parallel to the coordinate axes. Then we replace the integral over the domain [P.sup.k-1.sub.i,j] (or [L.sup.k-1.sub.i,j]) by a sum of integrals. Thus we compute integrals directly without any quadrature rule.

To evaluate the order of convergence, we use the discrete analogue of the [L.sub.1](D)-norm:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (28)

For the numerical solution computed by (26) the following theorem is valid.

Theorem 3. Let the solution [rho](t, x, y) of the problem (1)-(5) be sufficiently smooth and let the discrete solution [[rho].sup.h] be computed by (26). Assume that [tau] = [??]h. Then we have the following estimate: [for all]k = 0, 1, ..., K

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (29)

with the constants [c.sub.1] and [c.sub.2] independent of k, h, [tau], and [??].

Proof. Use the induction on k. For k = 0 inequality (29) is valid because of the exact initial condition (4). Suppose the estimate (29) is valid for some k - 1 [greater than or equal to] 0 and prove it for k.

From Theorem 1 [for all]i = 1, ..., N, j = 0, ..., N we get

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (30)

For the first integral we have the equality

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (31)

where

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (32)

In the second and third integrals we replace the polygons [Q.sup.k-1.sub.i,j] and [I.sup.k-1.sub.i,j] by [P.sup.k-1.sub.i,j] and [L.sup.k-1.sub.i,j], respectively, according to the foregoing approximation. Then we replace [rho]([t.sub.k-1],, x, y) and [rho](t, 0, y)u(t, 0, y) by their piecewise bilinear interpolants. Due to Lemma 2 and boundedness of the functions [rho](t, x, y) and u(t, x, y), formulae (30)-(31) are represented in the following way:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (33)

where [absolute value of [[gamma].sup.k-1].sub.i,j]] [less than or equal to] [[??].sub.3][h.sup.4], [absolute value of [[eta].sup.k-1.sub.i,j]] [less than or equal to] [[??].sub.4][h.sup.2], [absolute value of [[theta].sup.k1.sub.i,j]] [less than or equal to] [[??].sub.5][tau]h. Now multiply (26) by mes([[OMEGA].sub.i,j]) and subtract it from (33). Then we have

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (34)

Now, let us sum up the modulus of the last equation for all i = 1, ..., N, j = 0, ..., N and use the decomposition

[[??].sup.k-1.sub.i,j] = [[rho].sup.k-1.sub.i,j] + [[xi].sup.k-1.sub.i,j] (35)

at level [t.sub.k-1] with a grid function [[xi].sup.k-1](x, y) satisfying the estimate

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (36)

due to the induction hypothesis. Then we get

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (37)

Since

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (38)

we have

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (39)

Finally consider the transformations

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (40)

It leads to the inequality

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (41)

Denote [c.sub.1] = ([[??].sub.1] + 2[[??].sub.2] + [[??].sub.3] + [[??].sub.4]) and [c.sub.2] = [[??}.sub.5]. Thus due to the relation [tau] = [??] h we get inequality (29).

Corollary 4. Let the conditions of Theorem 3 be valid. Then for [t.sub.k] = T one has the estimate

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (42)

Remark 5. Let the functions [[rho].sub.init](x, y) and [[rho].sub.in](t, y) be greater than zero in the initial and boundary conditions (4)-(5). Then the interpolants [[??].sup.I.sub.h]([t.sub.0], x, y) and [([rho]u).sup.I.sub.[tau]](t, 0, y) due to (3) are nonnegative. The integration of them results in nonnegative values in (26). Thus, by induction we can prove the inequality

[[rho].sup.h] ([t.sub.r], [x.sub.i], [y.sub.j]) [greater than or equal to] 0 [for all]r = 1, ..., K, i, j = 0, ..., N. (43)

Remark 6. The strategy of the domain approximation with 8 nodes is not optimal, of course. In fact it is enough to use 4 nodes for rectangles. But a theoretical justification becomes much more complicated. We did not demonstrate it here since the main purpose of the paper is to verify parallel properties of the proposed algorithm. Of course, such an optimization reduces the number of arithmetical operations but has no influence on the parallel properties of the algorithm. The same applies to the difference schemes of higher order.

4. The Numerical Algorithm and Its Parallel Implementations

The constructed algorithm is implemented in the following way.

Algorithm 7 (sequential).

(1) Set [[rho].sup.h](0, [x.sub.i], [y.sub.j]) = [[rho].sub.init]([x.sub.i], [y.sub.j]), i, j = 0, ..., N, as the initial data (4).

(2) Time loop: for each time step k = 1, ..., K do:

Space loop: for each cell [[OMEGA].sub.i,j], i = 1, ..., N, j = 0, ..., N do:

(2.1) For each node [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], n = 1, ..., 4, solve the system (18)-(20) and determine the corresponding vertex coordinates [B.sup.h.sub.n] = ([[??].sub.n]([t.sub.k-1]), [[??].sub.n] ([t.sub.k-1])) of a polygon [P.sup.k-1.sub.i,j].

(2.2) If a certain characteristic [A.sub.n][B.sup.h.sub.n] intersects the plane x = 0 then the coordinates of this cross-point are determined.

(2.3) Compute

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (44)

according to (26) where the integrals are calculated over each nonempty intersection [P.sup.k-1.sub.i,j] [intersection] {[[x.sub.p], [x.sub.p+1]] x [[y.sub.q], [y.sub.q+1]]} and [L.sup.k-1.sub.i,j] [intersection] {[[t.sub.k-1], [t.sub.k]] x [[y.sub.q], [y.sub.q+1]]} separately.

(2.4) Compute [[rho].sup.h]([t.sub.k], [x.sub.i], [y.sub.j]) = [J.sup.k- 1.sub.i,j] / mes([[OMEGA].sub.i,j]).

The end of the space loop.

(2.5) Put [[rho].sup.h]([t.sub.k, 0, [y.sub.j]) = [[rho]sub.in]([t.sub.k], [y.sub.j]) for all j = 0, ..., N.

(2.6) If necessary, calculate the norms of a solution, an error, and other statistic data with respect to an actual time step.

The end of the time loop.

Note that items (2.1)-(2.3) are compute-intensive, especially the procedure of determining the mutual arrangement of [P.sup.k-1.sub.i,j] and the cells [{[[x.sub.p], [x.sub.p+1]] x [[y.sub.q], [y.sub.q+1]]}.sup.N-1.sub.p,q=0] at the previous time level in item (2.3).

The algorithm is explicit with respect to time, since to calculate [[rho].sup.h]([t.sub.k], x, y) at each time level [t.sub.k] the data are used only from the previous time level [t.sub.k-1].

Another advantage of Algorithm 7 is data independence in the general space loop; that is, the items (2.1)-(2.4) are carried out for any pair (i, j), i = 1, ..., N, j = 0, ..., N, independently. In this connection the data parallelism is used.

In the shared memory case for the OpenMP-technology it is sufficient to parallelize the general space loop at each time level using an OpenMP directive like the following one:

#pragma omp parallel for collapse (2) ...

In order for paralleling to be correct, the data-sharing attributes of all variables to intermediate outcomes of items (2.1)-(2.4) have to be private for each thread.

Another justified approach to paralleling the algorithm is the usage of the NVIDIA CUDA technology for GPU. The main aspects of the parallel implementation related to various features of general-purpose GPU programming are briefly discussed below.

All functions used in the numerical calculations on a CPU must be recompiled for a GPU. If we use NVIDIA CUDA these functions must be declared with the special qualifier _host_ _device_ which indicates that the NVCC compiler creates two versions of its executable code for a CPU (host) and for a GPU (device) separately. GPU will call a device version of a function while CPU will call its host version.

The principles of efficient CUDA programming are as follows: (1) the maximal use of inherent parallelism of the problem and (2) the optimization of memory access.

The first version of our parallel CUDA-algorithm is based on the inherent parallelism of our numerical explicit approach. Every thread treats only one cell [[OMEGA].sub.i,j] [member of] [D.sub.h]; hence, the space loop body (items (2.1)-(2.4) of Algorithm 7) is the general computation kernel.

While programming for GPU, the correct defining of the kernel configuration is important. The kernel configuration includes two parameters, namely, the number of blocks (blockCount) in a grid and the number of threads (blockSize) per block. There is a limit in 1024 threads per block for our GPU NVIDIA hardware. In the first CUDA-algorithm no threads amount optimization is used.

Consequently, the simplest parallel version of Algorithm 7 for the CPU/GPU hybrid architecture is the following.

Algorithm 8 (CUDA parallel, version 1).

(1) Calculate the kernel configuration (blockSize, blockCount) using data about a computational domain.

(2) Allocate host (CPU) and device (GPU) memory; copy initial data from host to device.

(3) Time loop: for each time step k = 1, ..., K do:

(3.1) Call the first CUDA kernel (basic):

(3.1.1) For each cell [[OMEGA].sub.i,j], i = 1, ..., N, j = 0, ..., N, execute items (2.1)-(2.4) of Algorithm 7 in parallel.

(3.2) Synch point: wait for calculations to be completed.

(3.3) Call the second CUDA kernel (assistive):

(3.3.1) Copy data from an actual time level array to the previous one in parallel.

(3.4) Synch point: wait for copying to be completed.

(3.5) If necessary, copy results from device to host.

(3.6) If necessary, calculate the norms of a solution, an error, and other statistic data with respect to an actual time step.

The end of the time loop.

(4) Copy the results from device to host.

In order to decrease the execution time of Algorithm 8, items (3.5)-(3.6) must be performed as rare as possible, for instance, only at time levels where accuracy control, data for drawing, and so forth are needed.

Algorithm 8 has two general disadvantages: (1) small speedup in comparison to the sequential version (Figure 8) and (2) impossibility of execution with a fine computational mesh (Table 2). What is the matter with these problems?

First, the general loop has a lot of selection statements.

The main selection is between two different ways of catching in item (3.1.1) of Algorithm 8 (to be more exact, in the item (2.3) of sequential Algorithm 7). The cells whose trajectories intersect the boundary and the internal cells are processed in different ways. We can use two different kernels which execute in parallel.

We use data parallelism only in Algorithm 8. However, there is yet another class of parallelism to be exploited on NVIDIA GPU. This parallelism is similar to the task parallelism that is found in multithreaded CPU applications. NVIDIA CUDA task parallelism is based on CUDA streams. A CUDA stream represents a queue of GPU operations such as kernel launches, memory copies, and event starts and stops. The order in which operations are added to the stream specifies the order in which they will be executed. Each stream may be considered as a certain task on the GPU, and there are opportunities for these tasks to execute in parallel [31]. Thus we apply CUDA streams to our two kernels to improve parallelism and total GPU utilization.

There are many selections in item (2.3) for determining mutual arrangement of [P.sup.k-1.sub.i,j] and cells of the mesh of the previous time level. Unfortunately, we cannot avoid these selections.

Secondly, the CUDA kernel in (3.3) of Algorithm 8 idles in the context of computation. We apply loop unrolling in order to eliminate this kernel.

Thirdly, the basic CUDA kernel of Algorithm 8 is not optimal for memory access. To optimize concurrent read access global memory of simultaneously running threads, constant memory is preferable to use. Applying this approach in our program we allocate all invariable values in constant memory GPU.

Moreover, for the optimization of parameters of kernels launch we use the CUDA Occupancy Calculator. It calculates optimal streaming multiprocessor (SM) utilization taking into account GPU compute capability, CUDA device properties, the number of blocks in a grid, the number of threads per block, the size of the shared-memory space, and the number of registers per thread.

Consequently, the second CUDA-version of Algorithm 7 for the CPU/GPU hybrid architecture is the following.

Algorithm 9 (CUDA parallel, version 2).

(1) Calculate kernel configuration (blockSize, blockCount) using data about a computational domain.

(2) Allocate host (CPU) and device (GPU) memory, copy initial data from host to device, and copy constant data from host to constant memory of device.

(3) Time loop: for each time step k = 1, 3, 5, ..., K - 1 do:

(3.1) Call the first CUDA kernel to boundary cells access by the first CUDA stream:

(3.1.1) Execute items (2.1)-(2.4) of Algorithm 7 in parallel for each cell [[OMEGA].sub.i,j] whose characteristics intersect the boundary.

(3.2) Call the second CUDA kernel to inner cells access by the second CUDA stream:

(3.2.1) Execute items (2.1), (2.3)-(2.4) of Algorithm 7 in parallel for each internal cell [[OMEGA].sub.i,j].

(3.3) Synch point: wait for calculations of both kernels to be completed.

(3.4) Call the first CUDA kernel to boundary cells access by the first CUDA stream:

(3.4.1) Execute items (2.1)-(2.4) of Algorithm 7 in parallel for each cell [[OMEGA].sub.i,j] whose characteristics intersect the boundary.

(3.5) Call the second CUDA kernel to inner cells access by the second CUDA stream:

(3.5.1) Execute items (2.1), (2.3)-(2.4) of Algorithm 7 in parallel for each internal cell [[OMEGA].sub.i,j].

(3.6) Synch point: wait for calculations of both kernels to be completed.

(3.7) If necessary, copy results from device to host.

(3.8) If necessary, calculate the norms of a solution, an error, and other statistic data with respect to an actual time step.

The end of the time loop.

(4) If K is odd then repeat items (3.1)-(3.2).

(5) Copy results from device to host.

5. Numerical Experiments

Specify the velocities

u(t, x, y) = 100y(1 - y) [pi]/2 - arctg (x)],

v(t, x, y) = arctg (x(1 - x) y (1 - y) (1 + t)/10) (45)

and take the initial and boundary conditions in the following form:

[for all] (x, y) [member of] D [rho](0, x, y) = [[rho].sub.init] (x, y) = 1,

[for all] (t, y) [member of] [0, T] x [0, 1] [rho] (t, 0, y) = [[rho].sub.in] (t, y) = 1. (46)

Numerical experiments were performed with the ICM SB RAS FLAGMAN computation system of the following configuration.

Hardware. CPU: INTEL XEON, 2 x 6 cores, HyperThreading; 40 Gb DDR3; GPU: NVIDIA TESLA C2050 (CC 2.0).

Software. OS: UBUNTU 11.04; C/C++: GCC 4.5.2, INTEL C++ Compiler 13.1.0; CUDA C/C++: NVCC 5.0; NVIDIA CUDA 5.0; BOOST 1.53; NVIDIA CUDA-GDB 5.0.

One of the purposes for the numerical experiments was to check the convergence order in [tau] and h. Therefore the computations were performed on the sequence of N x N regular square grids, N = 10 x [2.sup.n], n = 0, ..., 6. The number of time steps is defined by [tau] = h/5.

Assume that [{[[rho].sup.n.sub.h]}.sup.6.sub.n=0] is the set of solutions found on the sequence of square grids. The expression [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] as a function in n can be considered as the order of convergence (Figure 6). The corresponding exact values of [[rho].sup.K.sub.i,j] were computed by the characteristics method directly. In Figure 6 we can see the first order of convergence in h.

In our sequential and OpenMP computational experiments we compare the GCC and Intel C++ compilers. The execution time for the better Intel compiling code is on average by 15% less than the better GCC one. All presented numerical results were compiled with -O2 optimization level. Let us remark that we try to use other compiler options (-O3, -parallel, -AVX), but the performance increases only slightly or sometimes even decreases. For the CUDA-version we used the NVCC compiler.

The results of computation speedup of the OpenMP-version are presented in Table 1 and in Figure 8. The first line of the table shows speedup (or rather slowing down) of one thread that executed the OpenMP-code as compared with the sequential code. Generally, the compiler optimization under the OpenMP library and overhead related to the synchronization of OpenMP-threads can be estimated for these data. As we can see in our case, overhead is small.

One of the purposes of the studies is to assess influence of the HyperThreading (HT) technology on the parallelism. As long as only 12 physical cores are available on the CPU, we have access to 24 logical cores when HT is enabled and to 12 logical cores when HT is disabled, respectively. Experiments show that when HT is enabled the execution time with 24 threads is about 14% less than that with 12 threads when HT is disabled (Figure 7). As our code is compute-intensive, probably, the advantage of HT maybe related to optimization of memory access.

The execution time of the sequential, OpenMP, and two versions of CUDA codes is given in Table 2. The [much less than] * [much greater than] symbol signifies that the result has not been reached in reasonable time; for the CUDA-versions the [much less than] ** [much greater than] symbol means that the kernel launch failed due to registers bottleneck. For OpenMP the results on 12 threads when HT is disabled and on 24 threads when HT is enabled are demonstrated.

Comparative information on a possibility of code optimization by the GCC compiler is also presented in Table 2. The execution time of the code compiled without optimization and with -O2 and -O3 optimization levels, respectively, was measured. It is clear from Table 2 that compiler optimization considerably (more than two times) reduces the execution time of the sequential code. As the compiler does not optimize the CUDA-code, the execution time of the CUDA-versions does not depend on compiling options.

PSEUDOCODE 1 _global_ void unit_foo (float* inA, float out, int size) { int i=blockIdx.x*blockDim.x+threadIdx.x; if (i<size) out[i] = do_smb_with(inA[i]); }

Table 3 and Figure 8 present the data on computation speedup of the best OpenMP-version and two CUDA-versions in comparison with the sequential program compiled with -O2. Speedup of the second CUDA-version in comparison with the OpenMP-version is given as well. Numerical experiments show that on fine grids in comparison with the sequential program the best Open-MP program with 24 threads gives more than 12 times speedup and the second CUDA-version shows about 16 times speedup.

6. Discussion

Nowadays there are a lot of algorithms of the family of semi-Lagrangian methods. As we mentioned above, the presented method is based on square grid only and uses accessory algorithms that makes computation more resource-intensive. Nevertheless, such a complication allows us to prove first-order convergence. Furthermore, the theorem which allows one to take into account a volume of substance passed through a boundary is presented. This enables us to prove the balance equation. Numerical experiments completely confirm the theoretical convergence results.

As for parallel implementation of our approach, we can note the following.

Though the algorithm is explicit we are not satisfied with the results of the CUDA-versions.

First of all, there is a problem with feasibility of CUDA-code on fine grids (more than 640 x 640 nodes). Profile-feedback analysis shows at least two causes: (1) assumed computational domain decomposition and (2) the problem of hardware constraint on the amount of registers on streaming multiprocessors.

Concerning the first item, it should be noted that in our approach a 2D computational domain is mapped to a 1D array of cells, and every thread treats one cell (e.g., see Pseudocode 1).

For the parameters of kernels launch to be readily adaptable, we can apply "thread reuse"; namely, every thread treats some uncoupled cells (see Pseudocode 2).

Besides, we can use several video adapters. In this case we should resolve the problem of the distribution of computational cells between devices under the following restriction: we do not know in advance how many cells of the previous time level are required to calculate an actual value (e.g., varying-width shadow line).

The problem with registers in our case is related to deep nesting level of functions calls in the item (2.3) for determining the mutual arrangement of [P.sup.k-1.sub.i,j] and cells of a mesh of the previous time level. Indeed, this is a bottleneck of our sequential algorithm. To resolve this issue, we should modify the initial sequential algorithm, unfortunately.

In addition, the item (2.3) has many flow control instructions ("if" statements, mainly), but we cannot say that this branching creates some significant performance penalty in terms of SIMT architecture. Indeed, in a CUDA kernel, any flow control instruction (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge [32]. If this happens, the different execution paths must be serialized, since all of the threads of a warp share a program counter; this increases the total number of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to the same execution path.

Fortunately, this rule works in the cases where the control flow depends on the thread ID only. However, in our case we have many branches which do not depend on thread ID. Inside our computational kernel, the main part of branches looks like Pseudocode 3.

As we can see, our "IF" statement does not depend on a thread ID. Thus, this type of branch does not affect the performance of the CUDA kernel.

PSEUDOCODE 2 _global_ void vec_foo (float* inA, float out, int size) { for (int i=blockIdx.x*blockDim.x+threadIdx.x; i<size; i+=blockDim.x*gridDim.x) out[i] = do_smb_with(inA[i]); } PSEUDOCODE 3 if ((indCurSqOx[1] >= 0) && (indCurSqOy [1] >= 0)) { dO_smth(); } else { dO_smth_else(); }

7. Conclusion

We present an unconditionally stable semi-Lagrangian scheme of the first-order accuracy. The numerical experiments confirm the theoretical studies. Performance of sequential and several parallel algorithms implemented with the OpenMP and CUDA technologies in the C language was studied. The optimization potential of the CUDA-version remains unexhausted yet.

http://dx.doi.org/10.1155/2014/610398

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work was supported by Project 14-11-00147 of the Russian Scientific Foundation.

References

[1] S. K. Godunov, A. V. Zabrodin, M. Y. Ivanov, A. N. Kraiko, and G. P. Prokopov, Numerical Solving of Multidimensional Gas Dynamics Problems, Science Publisher, Moscow, Russia, 1976, (In Russian).

[2] S. Godunov, Mathematical Physics Equations, Science Publisher, Moscow, Russia, 1979, (In Russian). [3] J. D. Anderson, Computational Fluid Dynamics: The Basics with Applications, McGraw-Hill, New York, NY, USA, 1995.

[4] M. Lentine, J. T. Gretarsson, and R. Fedkiw, "An unconditionally stable fully conservative semi-Lagrangian method," Journal of Computational Physics, vol. 230, no. 8, pp. 2857-2879, 2011.

[5] D. Levy, G. Puppo, and G. Russo, "Central WENO schemes for hyperbolic systems of conservation laws," Mathematical Modelling and Numerical Analysis, vol. 33, no. 3, pp. 547-571, 1999.

[6] R. J. LeVeque, Finite Volume Methods for Hyperbolic Problems, Cambridge Texts in Applied Mathematics, Cambridge University Press, 2002.

[7] S. Clain, S. Diot, and R. Loubere, "A high-order finite volume method for systems of conservation laws with Multi-dimensional Optimal Order Detection (MOOD)," Journal of Computational Physics, vol. 230, no. 10, pp. 4028-4050, 2011.

[8] M. Kaser and A. Iske, "ADER schemes on adaptive triangular meshes for scalar conservation laws," Journal of Computational Physics, vol. 205, no. 2, pp. 486-508, 2005.

[9] J. S. Scroggs and F. H. M. Semazzi, "A conservative semi-Lagrangian method for multidimensional fluid dynamics applications," Numerical Methods for Partial Differential Equations, vol. 11, no. 5, pp. 445-452, 1995.

[10] J. Behrens, "A parallel adaptive finite-element semi-lagrangian advection scheme for the shallow water equations," in Modeling and Computation in Environmental Sciences, vol. 59 of Notes on Numerical Fluid Mechanics (NNFM), pp. 49-60, 1997.

[11] A. Klar, P. Reutersward, and M. Seaid, "A semi-Lagrangian method for a Fokker-Planck equation describing fiber dynamics," Journal of Scientific Computing, vol. 38, no. 3, pp. 349-367, 2009.

[12] O. Pironneau, "On the transport-diffusion algorithm and its applications to the Navier-Stokes equations," Numerische Mathematik, vol. 38, no. 3, pp. 309-332, 1982.

[13] E. Carlini, M. Falcone, and R. Ferretti, "A time-adaptive semi-Lagrangian approximation to mean curvature motion," in Numerical Mathematics and Advanced Applications, pp. 732-739, Springer, Berlin, Germany, 2006.

[14] D. R. Durran, Numerical Methods for Wave Equations in Geophysical Fluid Dynamics, Springer, New York, NY, USA, 1999.

[15] K. W. Morton, Numerical Solution of Convection-Diffusion Problems, Chapman & Hall, London, UK, 1996.

[16] A. Staniforth and J. Cote, "Semi-Lagrangian integration schemes for atmospheric models--a review," Monthly Weather Review, vol. 119, no. 9, pp. 2206-2223, 1991.

[17] A. Priestley, "A quasi-conservative version of the semi-Lagrangian advection scheme," Monthly Weather Review, vol. 121, no. 2, pp. 621-629, 1993.

[18] H. Ritchie, "Semi-Lagrangian advection on a Gaussian grid," Monthly Weather Review, vol. 116, no. 2, pp. 608-619, 1987.

[19] A. Robert, T. L. Yee, and H. Ritchie, "A semi-Lagrangian and semi-implicit numerical integration scheme for multilevel atmospheric models," Monthly Weather Review, vol. 113, no. 3, pp. 388-394, 1985.

[20] J. Behrens and L. Mentrup, "A conservative scheme for 2D and 3D adaptive semi-Lagrangian advection," Tech. Rep. TUM M0411, Technische Universitat Munchen, Fakultat fur Mathematik, 2004.

[21] A. Wiin-Nielson, "On the application of trajectory methods in numerical forecasting," Tellus, vol. 11, pp. 180-186, 1959.

[22] V. V. Shaidurov, G. I. Shchepanovskaya, and V. Yakubovich, "Numerical simulation of supersonic flows in a channel," Russian Journal of Numerical Analysis and Mathematical Modelling, vol. 27, no. 6, pp. 585-601, 2012.

[23] H. Chen, Q. Lin, V. V. Shaidurov, and J. Zhou, "Error estimates for triangular and tetrahedral finite elements in combination with a trajectory approximation of the first derivatives for advection-diffusion equations," Numerical Analysis and Applications, vol. 4, no. 4, pp. 345-362, 2011.

[24] T. N. Phillips and A. J. Williams, "Conservative semi-Lagrangian finite volume schemes," Numerical Methods for Partial Differential Equations, vol. 17, no. 4, pp. 403-425, 2001.

[25] J. P. R. Laprise and A. Plante, "A class of semi-Lagrangian integrated-mass (SLIM) numerical transport algorithms," Monthly Weather Review, vol. 123, pp. 553-565, 1995.

[26] T. Arbogast and W. Wang, "Convergence of a fully conservative volume corrected characteristic method for transport problems," SIAM Journal on Numerical Analysis, vol. 48, no. 3, pp. 797-823, 2010.

[27] K. Takizawa, T. Yabe, and T. Nakamura, "Multi-dimensional semi-Lagrangian scheme that guarantees exact conservation," Computer Physics Communications, vol. 148, no. 2, pp. 137-159, 2002.

[28] A. Iske and M. Kaser, "Conservative semi-Lagrangian advection on adaptive unstructured meshes," Numerical Methods for Partial Differential Equations, vol. 20, no. 3, pp. 388-411, 2004.

[29] E. A. Novikov, Explicit Methods for Stiff Problems, Science Publisher, Novosibirsk, Russia, 1997, (In Russian).

[30] V. I. Krylov, Approximate Computation of Integrals, Science Publisher, Moscow, Russia, 1967, (In Russian).

[31] J. Sanders and E. Kandrot, CUDA by Example: An Introduction to General-Purpose GPU Programming, Addison-Wesley, 2010. [32] CUDA C Best Practices Guide V.6.0, NVIDIA Corporation, 2014, http://docs.nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf.

Alexander Efremov, (1) Eugeniya Karepova, (1, 2) Vladimir Shaydurov, (1, 3) and Alexander Vyatkin (1, 2)

(1) Institute of Computational Modeling SB RAS, Krasnoyarsk, Akademgorodok 660036, Russia

(2) Siberian Federal University, Svobodny Prospect, Krasnoyarsk 660041, Russia

(3) Beihang University, Haidian District, Beijing 100191, China

Correspondence should be addressed to Eugeniya Karepova; e.d.karepova@icm.krasn.ru

Received 3 April 2014; Revised 16 August 2014; Accepted 19 August 2014; Published 27 October 2014

Academic Editor: Xiaohui Yuan

TABLE 1: Speedup of OpenMP-code (HyperThreading is switched Off/On). Number of points in one space dimension (number of the time steps) Number of 80 * 80 (400) 160 * 160 (800) 320 * 320 (1600) threads 1 0.94/1.00 0.99/1.00 0.99/1.00 4 3.86/3.89 3.90/3.93 3.92/3.93 8 7.40/7.35 7.48/7.37 7.53/7.57 12 10.86/10.86 11.08/11.04 11.17/11.12 16 4.15/8.76 4.77/8.94 6.32/8.96 20 4.69/10.63 5.82/10.85 6.55/10.94 24 5.71/10.98 6.17/12.80 7.30/12.55 Number of 640 * 640 (3200) 1280 * 1280 (6400) threads 1 0.99/1.00 0.99/1.00 4 3.93/3.95 3.93/3.96 8 7.55/7.59 7.56/7.60 12 11.21/11.14 11.20/10.86 16 7.54/8.98 7.60/8.88 20 7.26/10.96 7.89/10.73 24 7.97/12.83 8.45/12.85 TABLE 2: Execution time of all versions of program. Number of mesh points in one space dimension (number of the time steps) Version 80 * 80 (400) 160 * 160 320 * 320 (800) (1600) Sequential, -O0 20.16 159.40 1268.46 Sequential, -O2 9.99 78.97 626.72 Sequential, -O3 9.87 78.08 619.80 OpenMP(12), -O0, HT Off 1.98 12.52 103.45 OpenMP(12), -O2, HT Off 0.92 7.13 56.10 OpenMP(24), -O2, HT On 0.91 6.17 49.93 CUDA version 1 2.55 13.06 74.02 CUDA version 2 3.20 8.27 42.60 Version 640 * 640 (3200) 1280 * 1280 (6400) Sequential, -O0 10107.80 * Sequential, -O2 4980.61 39598.90 Sequential, -O3 4936.25 39202.91 OpenMP(12), -O0, HT Off 819.06 6519.87 OpenMP(12), -O2, HT Off 444.15 3535.53 OpenMP(24), -O2, HT On 388.20 3080.91 CUDA version 1 ** ** CUDA version 2 308.82 ** TABLE 3: Speedup of parallel versions. Number of mesh points in one space dimension (number of the time steps) 80 * 80 (400) 160 * 160 (800) OpenMP(24), -O2, HT ON/ 10.98 12.80 sequential CUDA version 1/sequential 3.92 6.05 CUDA version 2/sequential 3.12 9.55 CUDA version 2/OpenMP(12), -O2, 0.29 0.86 HT Off CUDA version 2/OpenMP(24), -O2, 0.28 0.75 HT On 320 * 320 (1600) 640 * 640 (3200) OpenMP(24), -O2, HT ON/ 12.55 12.83 sequential CUDA version 1/sequential 8.47 ** CUDA version 2/sequential 14.71 16.13 CUDA version 2/OpenMP(12), -O2, 1.32 1.44 HT Off CUDA version 2/OpenMP(24), -O2, 1.17 1.26 HT On

Printer friendly Cite/link Email Feedback | |

Title Annotation: | Research Article |
---|---|

Author: | Efremov, Alexander; Karepova, Eugeniya; Shaydurov, Vladimir; Vyatkin, Alexander |

Publication: | Journal of Applied Mathematics |

Article Type: | Report |

Date: | Jan 1, 2014 |

Words: | 7781 |

Previous Article: | Cooperative and competitive dynamics model for information propagation in online social networks. |

Next Article: | Dynamic analysis of the nonlinear chaotic system with multistochastic disturbances. |

Topics: |