# Fully Homomorphic Encryption Based On the Parallel Computing.

1. IntroductionThe privacy leakage problem on the untrusted servers has become a serious hazard problem in information security field. For example, the security of the cloud computing platform depends on the situation of solving the privacy leakage problem [1], and meanwhile its solution is also impacts the popularity in the industry. To solve the privacy leakage problem, people mainly use traditional cryptographic methods to encrypt user's privacy data stored on the cloud computing platform, such as the proxy re-encryption algorithm, the property encryption algorithm and so on. Unfortunately, the traditional cryptographic methods don't support the ciphertext calculation, so they need to decrypt the ciphertext firstly when they want to handle user's privacy data stored on the cloud computing platform and encrypted with the traditional cryptographic methods. This not only increases the complexity of operating user's privacy data, but also easily causes the privacy leakage problem during decrypting the ciphertext of user's privacy data. Therefore, it is necessary to find a cryptographic scheme which supports ciphertext calculation.

As is known to all, fully homomorphic encryption(FHE) scheme supports ciphertext calculation, so it satisfies the security requirement of the untrusted servers such as the cloud computing platform. The security of the untrusted servers will be greatly improved if they get the security support of FHE scheme.

At present, since Gentry constructed the first FHE scheme in 2009, more and more FHE schemes have been proposed and improved [2][3][4][5]. However, the existing FHE schemes can't be put into the practical applications because of their low efficiency [6]. For example, it will spend more than 30 minutes when the homomorphic operations of BGV scheme is run, although BGV scheme is the best efficient scheme during the existing FHE schemes. So it is easy to see that the running time can't satisfy the demand of the practical applications.

1.1 Introduction to the mian FHE schemes

At present, the best performance of FHE scheme is BGV(Fully Homomorphic Encryption without Bootstrpping), whose time complexity is [[??].sub.([[gamma].sup.2])] [7]. Besides BGV, there are several FHE schemes as follows.

(1) Fully homomorphic encryption over the integers, namely DGHV, with the time complexity 0([[gamma].sup.14]) [8].

(2) Fully homomorphic encryption without modulus switching from classical gapsvp, namely Bra12, with the time complexity [[??].sub.([[gamma].sup.6])] [9].

(3)Homomorphic encryption of approximate eigenvector: conceptually-simpler, asymptotically-faster, attributed-based, namely GSW, with the time complexity of per gate O([N.sup.[omega]]) [10], where [omega]<2.3727.

(4) FHEW: Bootstrapping Homomorphic Encryption in less than a second [11], namely FHEW. It mainly optimizes bootstrapping technique with PPT algorithm, so its performance is mainly embodied in the running time of bootstrapping technique, and you can know by title.

Moreover, there are many FHE schemes [12-17] which all are optimized and improved on the above schemes. Meanwhile, they have been proved that they still can't be put into the practical applications due to their low efficiency.

1.2 Our motivation

The essence of low efficiency for the existing FHE schemes is due to the noise, which strengthens the security of FHE schemes, but also leds many complex and expensive operations into FHE schemes, such as relinearization technique, modulus switching technique, dimension reduction technique and so on. If we remove the noise, FHE schemes will be unsafety. It is obvious that insecure FHE schemes are useless, so the noise can't be removed. However, if FHE schemes can't be put into the practical applications, they are still useless. So it is an urgent problem to put FHE schemes into the practical applications. According to scientists' predictions, the study of FHE schemes will be a long process.

In order to solve the dilemma that FHE schemes can't be put into the practical applications, we optimize FHE schemes by the parallel computing. Our purpose is to reduce the time complexity of homomorphic operations in FHE scheme by the parallel computing. The main principle is to improve the performance of homomorphic operations by sacrificing hardware resources. The process is shown below: Firstly, we will decompose the homomorphic operations into a number of independent sub-operations in accordance with the principle of the parallel computing. Secondly, we assemble a number of processing units into a high-performance computing system which is a 2-dimensional grid. Thirdly, assign one or a group of processing units to these independent sub-operations. Namely, these sub-operations will be processed on the provided processing units respectively, so these sub-operations are running at the same time.

Through the above steps, the homomorphic operations will be optimized by the parallel computing, the running time of the homomorphic operations will be reduced, and the efficiency of the FHE scheme will be improved.

1.3 Roadmap

Next, we will introduce the relevant knowledge of the parallel computing in section 2, including the fundamental of Cannon algorithm. In section 3, we will introduce our proposed schemes, namely the parallelized GSW(PGSW) scheme. In section 4, we will analyze the performance of PGSW scheme. In section 5, we will introduce the corresponding experiments. Finally, we will analyze and summarize the whole paper in section 6.

2. Preliminaries

Below we usually denote the ordinary parameters by lowercase Greek letters, such as security parameter [gamma], dimension parameter m , n and [kappa], modulus parameter q and so on. In addition, we use [bar.v] to represent the vector, use uppercase letters M to represent the matrix, and use [Z.sub.n] to present the integer ring. Next, we introduce the basic principles of the parallel computing and Cannon algorithm, as shown below.

2.1 The parallel computing [18]

The parallel computing which is evolved from the serial computing is an important research direction in computer science. In simple terms, the parallel computing is a supercomputing implemented on the high-performance computing system. There are many high-performance computing systems available, such as the parallel computer cluster, the distributed computer system and so on. Their basic principles all are to use multiple processors to cooperatively solve the sameone problem. The workflow is as follow: Firstly, decompose the problem which is need to process into many independent parts. Secondly, assign an independent processing unit of the high-performance computing system for each individual part. Thirdly, run all the independent processing units of the high performance computing system at the same time. Then, each independent processing unit of the high performance computing system will produce a operation result respectively. According to the principle of the parallel computing, we merge all the operation results into a final result which exactly we need. Since all the independent processing units of the high performance computing system run simultaneously, so the processing time of the whole problem based on the parallel computing is shortened, and correspondingly improve the performance of solving the problem.

According to GSW, due to the presence of noise, it needs to introduce a number of operations to achieve the FHE scheme, such as BitDecomp(), [BitDecomp.sup.-1](), Powersof 2(), matrix addition, matrix multiplication and so on. Practice has proved that these operations are complex and time-consuming, especially matrix multiplication. At present, the optimal time complexity of matrix multiplication is O([N.sup.[omega]])([omega] = 2.3727) achieved by Williams [19] under the serial computing. After analyzing we can see that it is difficult to furtherly optimize the matrix multiplication under the serial computing, so we consider to optimize matrix multiplication under the parallel computing. Now, there are many parallel computing algorithms to optimize matrix multiplication, such as DNS algorithm [20], Cannon algorithm [21], Fox algorithm [22], Systolic algorithm [23] and so on. In this paper, we choose Cannon algorithm as the parallel computing algorithm for matrix multiplication, and we will introduce Cannon algorithm in the next section.

2.2 Cannon Algorithm

Cannon algorithm [24] was proposed by Alpatov etc. in 1997. Its main idea is to divide the matrix into many blocks, and then assign a processing unit of the high-performance computing system for each block respectively. Next, we will introduce the specific implementation process of matrix multiplication based on Cannon algorithm. For example, there are two matrices, namely matrix A and matrix B , and we use A * B to represent the product of matrix A and matrix B. Firstly, divide matrix A and matrix B into p x p blocks which form a 2-dimensional grid whose size is p x p, and the number of each block is decided by its row numbers and column numbers in the 2-dimensional grid. For example, there is a block [A.sub.ij] (0 [less than or equal to] i, j < p), where i and j represents the i -th row and the j -th column in the 2-dimensional grid. Meanwhile, there is a computer cluster which forms a 2-dimensional grid whose size is px p, and the number of each computer is decided by the number of its row numbers and column numbers in the computer cluster. For example, There is a computer, namely [D.sub.i,j] (0 [less than or equal to] i, j < p), where i and j represents the i -th row and the j -th column in the computer cluster.

According to Cannon algorithm, there are two kinds of partitioning methods for matrix. Respectively, striped partitioning method and checkerboard partitioning method. As shown in Fig. 1, they are the sample graph of two kinds of partitioning methods for matrix.

According to different demands, we take the different partition method. Here we mainly use the checkerboard partitioning method, and the algorithm implementation process is shown below.

(1) Assign blocks [D.sub.ij] and [B.sub.ij] to [D.sub.ij] (0 [less than or equal to]i, j<p) which is a processing unit of the computer cluster, and then compute the product [C.sub.i,j] of [A.sub.ij] and [B.sub.ij] on [D.sub.ij].

(2) Move the block [A.sub.ij] (0 [less than or equal to] i, j < p) loop to the left by i steps, and move the block [B.sub.ij] (0 [less than or equal to]i, j<p) cyclic up by j steps.

(3) Execute the multiplication-addition computation on [D.sub.ij], and finally add all the results to [A.sub.i,j]. Move the block [A.sub.ij] (0 [less than or equal to] i, j < p) loop to the left by 1 step, meanwhile move the block [B.sub.ij] (0 < i, j < p) cyclic up by 1 step.

(4) Repeat (3), execute the multiplication-addition on [D.sub.ij] for p times altogether, and execute single step cycle on the block [A.sub.ij] and the block [B.sub.ij] for p times respectively.

From the above we can see, the parallel computing time of Cannon algorithm is [mathematical expression not reproducible] in the 2-dimensional grid, where p is the dimension of 2-dimensional grid, [t.sub.s] is the startup time, and [t.sub.w] is the text-transmission time. Because [t.sub.s] and [t.sub.w] all are "small" numbers, so the parallel computing time of Cannon algorithm is mainly determined by n and p. It should be noted that the minimum dimension of the sub-matrix is 2, namely, p = 1/2 n , then the time complexity of Cannon algorithm is O(n). It can be seen that the time complexity of matrix multiplication can be linear complexity when selecting the appropriate parameters in the high-performance computing system.

3. The proposed scheme(PGSW)

In this section, we will introduce the proposed scheme, parallelized GSW(PGSW), which is based on the GSW. The reason why we propose a new FHE scheme based on GSW is mainly because GSW scheme has many advantages that other FHE schemes don't have. Below we introduce the unique advantages of GSW scheme, as shown below.

(1) Solved the ciphertext dimension expansion problem caused by the homomorphic operations.

(2) Removed many complex and expensive operations, such as relinearization, modulus switching, dimension reduction and so on.

(3) The current FHE schemes are achieved basically on the ring. So its homomorphic operations mainly include addtion and multiplication. At present, only the homomorphic addition and multiplication of GSW scheme is in accordance with the operational rule of common arithmetic operations, and the other homomorphic operations need to redefine the operation symbols, so GSW scheme is the most natural FHE scheme.

So here we choose GSW scheme as our basis, and optimize it by the parallel computing. The main idea of the optimization is to optimize the complex operations in GSW scheme. After optimizing, we get PGSW scheme finally. Next we introduce the optimization process of the basic operations, encryption module, decryption module etc. of GSW scheme.

3.1 Some Basic Operations and Their Optimization

There are many basic operations in GSW scheme, and some of these operations are not homomorphic operations, but they are very important. For example, BitDecomp(), [BitDecomp.sup.-1](), Powersof 2(), Flatten() and so on. Their main function is to ensure GSW scheme B -strongly-bounded, so that GSW scheme can evaluate a circuit of polynomial depth rather than merely polynomial degree. However, most of them are complex operations which have high time complexity. That is, these operations are extremely time-consuming operations under the serial computing so that GSW scheme is low efficient. In order to improve the efficiency of GSW scheme, we need to optimize these operations. As following, we introduce the optimization process of the above operations based on the parallel computing.

PBitDecomp(): BitDecomp() is an important part of Flatten() which can keep the ciphertexts B -strongly-bounded. The primary mission of BitDecomp() is to convert decimal matrices or vectors into binary matrices or vectors. When the operand of BitDecomp() is vector [bar.a] = ([a.sub.1], [a.sub.2],...[a.sub.k]), then according to GSW scheme, we get:

BitDecomp(a) = ([a.sub.1,0],..., [a.sub.1,l-1],...,[a.sub.k,0],..., [a.sub.k,l-1]) (1)

where l * k = N, and [a.sub.i,j] is the j -th bit in the binary representation of [a.sub.i]. According to the definition of BitDecomp() , we can divide the above operation into two steps.

Step 1, convert all the decimal elements of vectors into the corresponding binary bits. Since all matrices and vectors are B - strongly-bounded, so the decimal elements are "small", then we can build a conversion table between the decimal elements and the binary bits. When we need to convert the decimal elements into the binary bits, we can quickly query the conversion table and then get the corresponding results. Because the value range of decimal elements in matrices or vectors is "small", the size of the conversion table is much smaller than the dimension N of matrix or vector, so the time complexity of quering operation for each decimal element through conversion table is O(1). Since there are [KAPPA] decimal elments, so the time complexity of step 1 is O([KAPPA]).

Step 2, assemble all the l -length binary bits into a N -dimensional binary vector. It is obvious that the time complexity of step 2 is O([KAPPA]).

So the time complexity of BitDecomp() is O([KAPPA]) *O([KAPPA])) = O([[KAPPA].sup.2]) when the operand is vector under the serial computing. It is obvious that the time complexity of BitDecomp() is O(N[[KAPPA].sup.2]) when the operand is matrix under the serial computing. Although their time complexity are not high, we use BitDecomp() with other operations together, then the composite operations have high time complexity which hinders GSW from being put into the practical applications, so they need to be optimized by the parallel computing, and the following is the optimization process.

In step 1, the conversion of [KAPPA] decimal elements are mutually independent, so we can use the parallel computing to optimize them. The main idea is to use [KAPPA] processing units of the high performance computing system to do the conversion of [KAPPA] decimal elements respectively, then the conversion of [KAPPA] decimal elements will work at the same time, so the time complexity of the conversion of [KAPPA] decimal elements is O(1), and the time complexity of step 1 is also O(1).

In step 2, we use [KAPPA] processing units of the high performance computing system to do the assembling of [KAPPA] l -length binary bits, namely, each processing unit does one assembling respectively, and then the [KAPPA] processing units work at the same time, so the time complexity of step 2 is also O(1). Thus we can see that the time complexity of BitDecomp() is O(1) when the operand is vector under the parallel computing.

When the operand is matrix, we can compute each row of the matrix respectively, so we also can optimize it with the parallel computing. Namely, compute each row on one processing unit, and then all the processing units work at the same time. Then the time complexity of computing N rows of matrix is O(1) under the parallel computing, so the time complexity of BitDecomp() is also O(1) when its operand is matrix under the parallel computing.

Here, we call the optimization result of BitDecomp() as PBitDecomp(), whose time complexity is O(1) under the parallel computing no matter the operand is matrix or vector. The following are their computation results when the operand are vector and matrix respectively.

PBitDecomp(bar.a) = ([a.sub.1,0],..., [a.sub.1,l-1,]..., [a.sub.k,0],..., [a.sub.k,l-1]) (2)

[mathematical expression not reproducible] (3)

[PBitDecomp.sup.-1](): [PBitDecomp.sup.-1]() is the inverse process of BitDecomp(), and its main function is to convert the binary matrices or vectors into the decimal matrices or vectors. When the operand is vector, for example, (bar.b) = ([b.sub.1,0],..., [b.sub.1,l-1,]..., [b.sub.k,0],..., [b.sub.k,l-1]), then have:

[mathematical expression not reproducible] (4)

Similarly to BitDecomp(), [PBitDecomp.sup.-1]() can also be divided into two steps.

Step 1, convert l - length binary bits into the corresponding decimal element from left to right respectively. From the above analysis we can see, we can quickly query the conversion table for converting the l - length binary bits into a decimal elments. According to the analysis of the previous section, we can see that the time complexity of quering operation for the l - length binary bits is O(1), correspondingly the time complexity of step 1 is O([KAPPA]).

Step 2, assemble all [KAPPA] decimal elements into a [KAPPA] -dimensional dicimal vector. Obviously, the time complexity of the above operation is O([KAPPA]).

So the time complexity of [BitDecomp.sup.-1] () is O([KAPPA]) *O([KAPPA]) = O([K.sub.2]) when the operand is vector under the serial computing.

When the operand is a matrix, it needs to execute the [BitDecomp.sup.-1] () on each row of the matrix separately. According to the analysis of the previous section, we can see that the time complexity of [BitDecomp.sup.-1]() O(([NK.sub.2])) when the operand is a matrix under the serial computing.

The same reason as BitDecomp(), [BitDecomp.sup.-1]() also needs to be optimized by the parallel computing, and the optimization process of [BitDecomp.sup.-1] () is similar to BitDecomp() , here we will no longer introduce it. Finally, the optimization result of [BitDecomp.sup.-1]() is called as [PBitDecomp.sup.-1]() whose time complexity is O(1) under the parallel computing whether the operand is a matrix or vector.

PPowersof 2(): Powersof 2() can ensure that the secret key [bar.v] has at least one "big" coefficient, so it is an important operation in GSW scheme, and its operand is mainly vector. For example, there is a [KAPPA] -dimensional vector [bar.v] = ([b.sub.1],[b.sub.2],...[b.sub.[kappa]]), then we have:

Powersof 2(b) = ([b.sub.1],[2b.sub.1],...,[2.sup.l-1],[b.sub.1]...,[b.sub.k],[2b.sub.k],...,[2.sup.l-11] [b.sub.k]) (5)

So the main function of Powersof 2() is to extend the k -dimensional vector into the N -dimensional vector. According to GSW, this operation can also be divided into two steps.

(1) Convert each element of the operand vector into a l -dimensional vector whose coefficients is [2.sup.0] to [2.sup.l-1] separately, and the time complexity of this conversion for each element is O(l) , so the time complexity of this conversion for [KAPPA] elements is O([KAPPA] * l) = O(N) under the serial computing.

(2) Assemble all the l -dimensional vectors into a N -dimensional vector according to the order of the elments in the operand vector. Obviously the time complexity of this operation is O([KAPPA]).

From the above we can see, the time complexity of Powersof 2() is O(N * K) under the serial computing.

According to the principle of the parallel computing, we can optimize (1) by the parallel computing, and (2) don't need to be optimized. Finally we get a optimization result which we call it as PPowersof 2(). In (1), when convert all elements of the operand vector into l -dimensional vector whose coefficients is from [2.sup.0] to [2.sup.l-1]. Because we can make the convertion of [KAPPA] elements of the oprand vector happened at the same time on [kappa] processing units of the high-performance computing system, then its time complexity is O(l). Thus we can see that the time complexity of (1) is O(l)under the parallel computing.

When combine all the l -dimensional vectors into a N -dimensional vector according to the order of the elments in the operand vector, the time complexity of (2) is O(k). So the time complexity of PPowersof 2([bar.b]) is O([kappa]*l) = O(N) when the operand is vector under the parallel computing.

When the operand of PPowersof 2() is matrix M, it needs to execute the PPowersof 2([bar.b]) on each row of the matrix separately. It is obviously that PPowersof 2(M ) can be optimized by the parallel computing. Namely, let one processing unit to do PPowersof 2([[bar.b].sub.i]) (0 [less than or equal to] i < N), where [[bar.b].sub.i] is the i -th row of matrix M. So we let N processing unit to do PPowersof 2([[bar.b].sub.i]) at the same time. Then the time complexity of PPowersof2(M) is also 0(N) under the parallel computing.

PFlatten() : Flatten() can ensure all the vectors and matrices are B -strongly-bounded in GSW scheme, and then we can obtain a leveled FHE scheme that can evaluate a circuit of polynomial depth without bootstrapping, relinearization and modulus switching and so on. In fact, Flatten() consists of BitDecomp() and

[BitDecomp.sup.-1] (). By GSW, we have:

Flatten() = BitDecomp([BitDecomp.sup.-1] ()) (6)

So we don't need to specifically optimize it because its members have been optimized. Here we call the optimization result of Flatten() to be PFlatten(), and we directly replace the corresponding operation with their optimization value. So we have:

PFlatten() = PBitDecomp([PBitDecomp.sup.-1] ()) (7)

From the above formula we can see, the time complexity of PFlatten() is the product of the time complexity of PBitDecomp() and [PBitDecomp.sup.-1](), since the time complexity of PBitDecomp() and [BitDecomp.sup.-1]() all are O(1) under the parallel computing, so the time complexity of PFlatten() is also O(1) under the parallel computing.

3.2 Some Homomorphic Operations and Their Optimization

There are many homomorphic operations besides the above basic operations, such as Mult(), add(), MultConst() and so on. They all are important operations in GSW scheme, and also have high time complexity, so they are needed to be optimized. The optimization principle is similar to the above optimization process. Namely, we use the principle of the parallel computing to optimize these operations. Next, we introduce their optimization process.

PMatrixMult([C.sub.1], [C.sub.2]): In GSW scheme, Mult() is used to do the matrix multiplication.

According to GSW, we have:

Mult ([C.sub.1], [C.sub.2]) = Flatten([C.sub.1] * [C.sub.2]) (8)

Because Flatten(C) can be optimized into PFlatten(C) which has a constant time complexity under the parallel computing, thus the time complexity of Mult([C.sub.1],[C.sub.2]) is mainly dominated by the matrix multiplications. So far the optimal time complexity of matrix multiplications is 0([N.sup.13111]) which is acchived by williams [18] under the serial computing. Accroding to the principle of the parallel computing, we can optimize matrix multiplication by Cannon algorithm whose working principle has been presented in section 2.2. Thus it is not necessary to introduce its theoretical knowledge here, we only demonstrate the algorithm through the following example.

[mathematical expression not reproducible] (9)

There are two N x N -dimensional matrices A and B. According to Cannon algorithm, our high-performance computing system is a computer cluster of 2-dimensional grid which consists of p x p computers. Moreover, matrix A, matrix B and the computer cluster P are shown as in equation (9).

Next we will introduce the process to optimize the matrix multiplication with Cannon algorithm, and we can divide the above operation into four steps.

Step 1, divide matrix A and matrix B into [p.sup.2] blocks, each sub-block is represented by [A.sub.ij] and [B.sub.ij] (0 [less than or equal to] i < p,0 [less than or equal to] j < p). Easy to calculate, the size of each sub-block is (N / p) x (N / p). Next, these sub-blocks are assigned to [p.sup.2] processing units, namely, assign the block [A.sub.ij] and [B.sub.ij] to the processing unit [P.sub.i,j] (0 [less than or equal to] i, j < p), and then the corresponding sub-results [C.sub.i,j] can be calculated on the corresponding processing unit [P.sub.i,j], here [C.sub.i,j] is the product of the sub-blocks [A.sub.ij] and [B.sub.ij].

Step 2, move the block [A.sub.ij] (0 [less than or equal to] i, j < p) loop to the left by i steps, and move the block [B.sub.ij] (0 [less than or equal to]i, j<p) cyclic up by j steps.

Step 3, execute the multiplication-addition computation on [P.sub.i,j], then we obtain a sub-result [C.sub.i,j], and there are [C.sub.i,j] = [A.sub.i,j] x [B.sub.i,j]+ [C.sub.i,j]. Move [A.sub.i,j] (0 [less than or equal to] i, j < p) loop to the left by 1 step, meanwhile move [B.sub.ij] (0 [less than or equal to]i, j<p) cyclic up by 1 step.

Step 4, repeat step 3, and execute the multiplication-addition on [P.sub.i,j] for p times altogether, execute single step cyclic on the blocks [A.sub.ij] and [B.sub.ij] for p times respectively.

Finally, we get matrix C which is the product of matrix A and matrix B, the result is shown in the following formula.

[mathematical expression not reproducible] (10)

According to Cannon algorithm, the running time is [mathematical expression not reproducible] on the computer cluster, where p is the dimension of computer cluster of 2-dimensional grid, [t.sub.s] is the startup time, and [t.sub.w] is the text-transmission time. Since both [t.sub.s] and [t.sub.w] are "small", so the time complexity of Cannon algorithm is mainly dominated by N and p. If p = 1/2 N, then the time complexity of Cannon algorithm is 0(N). So the optimal time complexity of matrix multiplication under Cannon algorithm is 0(N) when taking the appropriate parameters.

PMatrixAdd([C.sub.1],[C.sub.2]) : Add() is used to do the matrix addition in GSW scheme. According to GSW, we have:

Add ([C.sub.1], [C.sub.2]) = Flatten([C.sub.1] + [C.sub.2]) (11)

where [C.sub.1], [C.sub.2] [member of][Z.sup.N^N.sub.q] are N x N dimension matrix. Because Flatten() can be optimized by the parallel computing, and its optimization result is called as PFlatten() whose time complexity is O(1). So the time complexity of Add([C.sub.1],[C.sub.2]) is mainly dominated by the matrix addition. The time complexity of matrix addition is O([N.sup.2]) under the serial computing, so we can optimize it by the parallel computing. Below we will introduce the process to optimize the matrix addtion with the paralle computing.

For example, there are two N x N -dimensional matrices A and B, and we will compute their sum. According to the parallel computing theory, the matrix addition can also be optimized by the computer cluster of 2-dimensional grid, which consists of p x p computers. Different from matrix multiplication, the optimization of matrix addition needs more computers. Matrix A , matrix B and the computer cluster are shown as below.

[mathematical expression not reproducible] (12)

Firstly, divide the matrices A and B into Nx N blocks, namely [A.sub.ij] and [B.sub.ij] (0 [less than or equal to] i, j < N) and each block only contains one element. Assign the blocks [A.sub.ij] and [B.sub.ij] to the processing unit [P.sub.i,j] (0 [less than or equal to] i, j < N), and then compute the result [C.sub.i,j] on [P.sub.i,j], where [C.sub.i,j]. is the sum of the blocks [A.sub.ij] and [B.sub.ij]. Then we get C which is a N x N matrix, and its element is [C.sub.i,j] (0 [less than or equal to] i, j < N). In fact, Matrix C is the sum of the matrix A and matrix B.

It is obvious that the time complexity of matrix addition is O(1) under the parallel computing because all the computations work at the same time. We call the optimization value of Add([C.sub.1],[C.sub.2]) to be PMatrixAdd().

PMultConst(C,[alpha]): MultConst(C,[alpha]) executes the matrix-constant-multiplication, where C [member of] [Z.sup.NxN.sub.q] is a ciphertext, [alpha] [member of] [Z.sub.q] is a constant number. According to GSW we have:

[M.sub.[alpha]] [left arrow] Flatten([alpha]*[I.sub.N]) (13)

MultConst(C,[alpha]) = Flatten([M.sub.[alpha]] * C) (14)

Since Flatten() has a optimization value under the parallel computing, namely PFlatten() whose time complexity is O(1) , and the time complexity of [M.sub.[alpha]] * C is Matrix-Multiplication whose time complexity is O(N) under the parallel computing, so the time complexity of MultConst(C,[alpha]) is O(N) under the parallel computing, and we call PMultConst(C,[alpha]) as the optimization value of MultConst(C,[alpha]).

From the above analysis we can see that the time complexity have been greatly improved when using the parallel computing to optimize the basic operations and homomorphic operations of GSW scheme. Below we compare the time complexity of them before and after optimization, as shown in Table 1.

It should be noted in Table 1 that V represents the vector and M represents the matrix. Moreover in Table 1, the original time complexity of some operations is significantly higher than the existing optimal time complexity, such as Mult() , namely matrix multiplication whose the optimal time complexity is O([N.sup.[omega]]), but here is O([N.sup.[omega]+2] * [[kappa].sup.4]). Why? In order to ensure the result ciphertexts is also B -strongly-bounded, it needs additional operations. For example, in GSW, there is:

Mult([C.sub.1], [C.sub.2]) = Flatten([C.sub.1] * [C.sub.2]) (15)

In the above formula, it introduces the additional operation, namely Flatten() whose time complexity is O([N.sup.2][k.sup.4]). Thus, the time complexity of the original operation will become larger, through introducing the additional operation.

3.3 The Encryption Model and Its Optimization

GSW scheme is called to be the approximate eigenvector method. Its main idea is to encrypt the plaintext [micro] into the ciphertext matrix C by the approximate eigenvector [bar.v] and fully homomorphic encryption algorithm, where [micro] is a "small" integer, ciphertext C is a N x N dimension matrix over [Z.sub.Q], the key [bar.v] is a N -dimensional vector over [Z.sub.q] and [bar.e] is a small error vector. From GSW we can see that the encryption equation is:

C*[bar.v] = [micro]*[bar.v] + [bar.e] (16)

As shown in the above equation, key v is an approximate eigenvector of the ciphertext matrix C, and the plaintext [micro] is the approximate eigenvalue of the ciphertext matrix C, [bar.e] is the error vector which is B -bounded and its main function is to ensure the security of the cryptographic scheme. The main security principle of this cryptographic scheme is the LWE problem introduced by Regev [25].

Note that the ciphertext of this cryptographic scheme is matrix, the result of ciphertext calculation is also the same dimension matrix, so this way can eliminate the ciphertext dimension expansion problem, and also furtherly eliminate many complex and expensive operations, such as modulus switching, relinearization and so on.

However, this scheme can only evaluate polynomials of polynomial degree in N , namely, this scheme is only somewhat homomorphic encryption scheme. In order to obtain a leveled fully homomorphic encryption scheme, it requires the plaintext [micro] and the ciphertext matrix C are B -strongly-bounded, and the error vector [bar.e] is B -bounded. Then the encryption scheme is shown as the following formula according to GSW scheme.

C = Flatten([micro] * [I.sub.N] + BitDecomp( R * A)) (17)

However, in order to improve the efficiency of the FHE scheme, we use the parallel computing to optimize the complex operation in GSW scheme. From the above formula we can see, there are some basic operations which have been optimized in the previous section, such as Flatten(), BitDecomp(), matrix addtion, matrix multiplication and so on. In the above formula, it mainly involves three operations, such as MatrixAdd () , Flatten() , BitDecomp() and so on. These operations have been optimized in the previous section, so we can directly use the optimization results to replace the corresponding operations in the above formula. Finally, we get the optimization results, as shown below:

C = PFlatten(PMatrixAdd (u * [I.sub.N], PBitDecomp(PMatrixMult (R, A)))) (18)

Where [micro] is the plaintext, C is the ciphertext corresponding to [micro], [I.sub.N] is a N -dimensional unit matrix, R is a random N x m matrix with 0/1 elements, A is m x (n +1) dimension matrix over [Z.sub.q] Note that both R and A are generated by KeyGen() introduced in GSW, so we no longer introduce it in details.

Next, we analyze the performance of this encryption model. From the above we know, both PFlatten() and PBitDecomp() have constant time complexity, so the main overhead of the encryption model is the product of R * A, the product of u * [I.sub.N] and the matrix addition. Because the matrix multiplication and the matrix addition have been optimized under the parallel computing, and their time complexity are O(N) and O(1) respectively. Although R and A are not N -dimensional matrices, in fact, their dimension is less than N, so the time complexity of R * A is less than the time complexity of the matrix multiplication whose dimension is N. For convenience, we take the time complexity of R * A for O(N). From the above, we can see that the time complexity of matrix addtion and the matrix-constant-multiplication are O(1) and O(N). So the time complexity of encryption model is O(N) under the parallel computing.

Because GSW scheme is called to be the FHE scheme of the approximate eigenvector method, so we verify that the encryption module is consistent with the approximate eigenvector attributes under the under parallel computing. That is, verify that whether the new FHE scheme is correct. Next, we verify that the optimized encryption formula still conforms to the properties of the approximate eigenvector method, then we have:

C*[bar.v] = PFlatten([micro]*[I.sub.N] + PBitDecomp(R.A)).[bar.v] (19)

Accroding to GSW, we know that:

< Flatten([bar.a]),Powersof 2([bar.b]) >=<[bar.a],Powersof 2([bar.b]) > (20)

Correspondingly, we have:

<PFlatten([bar.a]),PPowersof 2([bar.b]) >=<[bar.a],PPowersof 2([bar.b]) > (21)

and [bar.v] = PPowersof 2([bar.s]), so we have:

PFlatten([micro] * [I.sub.N] + PBitDecomp( R * A)) * [bar.v] = ([micro] * [I.sub.N] + PBitDecomp(R * A)) * [bar.v] = [micro] *[bar.v] + PBitDecomp(R * A) * [bar.v] = [micro]*[bar.v]+ R * A * [bar.s] = [micro]*[bar.v] + small

So the optimized encryption formula still satisfies the properties of the approximate eigenvector method, it can be seen that our optimization is correct.

3.4 Decryption Model and Its Optimization

The main function of decryption model is to decrypt the ciphertext obtained by the homomorphic encryption or homomorphic operations, and this decryption operation is shown as follows according to GSW:

x [left arrow] [C.sub.i],[bar.v]>= [micro] * [v.sub.i] + [e.sub.i] (23)

[micro] = [x /[v.sub.i]] (24)

where [C.sub.i] is the i-th row of the ciphertext matrix C, [v.sub.i] is the i-th element of vector [bar.v], and [e.sub.i] is the i -th element of the error vector e. In order to correctly decrypt it, there is a big coefficient in [bar.v] at least, and this can be guaranteed by PPowersof 2(). From the above formula, we know that the time complexity of decryption model is O(1), so it is not necessary to be optimized.

3.5 Circuit Model and Its Optimization

In order to meet the plaintext {0,1} space range, GSW scheme uses boolean circuit to realize the FHE scheme. According to De Morgan theorem, we can see that a cryptographic scheme can be called as fully homomorphic encryption scheme when it just supports one of the following homomorphic operations, such as NAND homomorphic operation, AND and XOR homomorphic operation, or NOR homomorphic operation. It uses NAND homomorphic operation in GSW scheme, so we only optimize NAND circuit here.

PNAND([C.sub.1], [C.sub.2]): Accroding to GSW scheme, we can use NAND gates to construct a leveled FHE scheme whose depth of the arithmetic circuit is L. In fact, it uses NAND([C.sub.1],[C.sub.2]) to express the ciphertext of NAND([[micro].sub.1],[[micro].sub.2]) in GSW, where [[micro].sub.1] and [[micro].sub.2] all are the binary bits, [C.sub.1] and [C.sub.2] are the ciphertext of [[micro].sub.1] and [[micro].sub.2]. In order to optimize the NAND gates, we can only optimize the NAND([C.sub.1],[C.sub.2]).

Accroding to GSW, we know that:

NAND([C.sub.1], [C.sub.2]) = Flatten([I.sub.N] - [C.sub.1].[C.sub.2]) (25)

From the above formula we can see that there are three basic operations, namely Flatten() , matrix subtraction and matrix multiplication. Here Flatten() and matrix multiplication have been optimized in the previous section, so we can use its optimization results directly, namely PFlatten() and PMatrixMult(). Because matrix subtraction is the opposite operation of the matrix addition, so we can use the optimization value of matrix addition to replace the optimization value of matrix subtraction according to the relationship between addition and subtraction. So we can directly use the optimization results to replace the corresponding operation in the above formula, and the result is shown as the following formula:

PNAND([C.sub.1], [C.sub.2]) = PFlatten( PMatrixAdd ([I.sub.N], - PMatrixMult ([C.sub.1], [C.sub.2]))) (26)

From the above formula we can see, the time complexity of PFlatten() and PMatrixAdd() all are O(1), and the time complexity of PMatrixMult() is O(N). Hence, the time complexity of PNAND([C.sub.1], [C.sub.2]) is O(N) under the parallel computing.

4.The Performance of PGSW

According to GSW scheme, we obtain a leveled fully homomorphic encryption scheme which is a circuit of depth- L with NAND gates. For the dimension parameter N and the depth parameter L , GSW scheme evaluates depth- Lcircuits of NAND gates with O([N.sup.4+[omega]] * [[kappa].sup.4]) field operations for per gate, where [omega] < 2.3727; In PGSW scheme, the field operations of per gate is O(N), so the time complexity for evaluating depth- L circuits of NAND gates is O(NL) while GSW scheme is O([N.sup.4+[omega]] * [[kappa].sup.4]*L).

Moreover, although the decryption operation in the original scheme does not need to be optimized, the encryption operation in the original scheme is still optimized by the parallel computing. It can be seen that the time complexity of the encryption operation before being optimized is O([N.sup.[omega]+5] * [[kappa].sup.6]), and the time complexity of the encryption operation after being optimized is O(N). The reason why the time complexity of encryption operation is greatly improved is that its encryption operation is composed of many complex operations, and the time complexity of the encryption operation is the product of the time complexitys of these complex operations. Because these complex operations can be optimized greatly by the parallel computing, so the time complexity of the encryption operation can also been optimized greatly.

It can be seen that the time complexity of PGSW has been greatly improved. There are the performance comparisons of the usual FHE schemes showed as in Table 2 [25].

5.Experiments

In this section, we take some experiments to verify the performance of our scheme. Becasuse the main overhead of this scheme is matrix multiplication, and the time complexity of other operations can not exceed the time complexity of the matrix multiplication, so we mainly implement Cannon algorithm based on MPI. Experimental environment is as follows: the operating system platform is win7, Cpu is i5-3337U@1.80GHz, the development kit is visual C++ 6.0 and the mpich2-1.4p1-win-ia32. The experimental results are shown as in Table 3.

As can be seen from Table 3, when the dimension of matrix is 200 and the number of the processing unit is 1, the overhead time is 3.853772s through running Cannon algorithm which is implemented by MPI; When the dimension of matrix is 200 and the number of the processing unit is 4, the overhead time is 1.530456s through running Cannon algorithm which is implemented by MPI; When the dimension of matrix is 200 and the number of the processing unit is 8, the overhead time is 0.745632s through running Cannon algorithm which is implemented by MPI; When the dimension of matrix is 400 and the number of the processing unit is 1, the overhead time is 5.772368s through running Cannon algorithm which is implemented by MPI; When the dimension of matrix is 400 and the number of the processing unit is 4, the overhead time is 3.163587s through running Cannon algorithm which is implemented by MPI; When the dimension of matrix is 400 and the number of the processing unit is 8, the overhead time is 1.465911s through running Cannon algorithm which is implemented by MPI;

From the Table 3, we know that the overhead time of Cannon algorithm will be reduced when the number of the processing unit increases. According to Cannon algorithm, the largest number of processing unit is N/2, then we can predict that its overhead time of Cannon algorithm will reach microseconds. It can be seen that the performance of the FHE scheme based on the parallel computing has been greatly improved, and it promotes the developement of FHE scheme.

6. Conclusions

With the development of cloud computing, more and more user's privacy datas are stored on the cloud servers which are untrusted platform, so the leakage problem of user's privacy datas will become an unavoidable problem, and then urgently need to be solved. Accroding to the attributes of FHE, it is the best way to solve the leakage problem of privacy data on untrusted servers. However, the current FHE schemes aren't suitable for the practical applications because of their inefficiency, so we urgently need to improve the efficiency of the existing FHE schemes.

In this paper, we optimize GSW scheme by the parallel computing, and then gain PGSW scheme. From the analysis we know that the performance of the new scheme has been greatly improved, as shown in Table 2. Through the experiment we known that the time overhead of matrix multiplication which is the highest time complexity of GSW scheme will reach the level of microsecond when we select appropriate amount of the processing units of high performance computing system. Although PGSW scheme takes up more hardware resources, and also increases the cost of FHE scheme, it play an important role in putting FHE scheme into practical applications.

References

[1] R. W. Huang, X. L. Gui, S. Yu, "Design of Privacy-Preserving Cloud Storage Framework," in Proc. of the Ninth International Conference on Grid and Cloud Computing, pp. 128-132, November 1-5, 2010. Article (CrossRef Link).

[2] J. Alperin-Sheriff, C. Peikert, "Faster bootstrapping with polynomial error,". in Proc. of the International Cryptology Conference, pp. 297-314, August 17-21, 2014. Article (CrossRef Link).

[3] K. Myungsun, H. T. Lee, S. Ling, H. X. Wang, "On the Efficiency of FHE-based Private Queries," IEEE Transactions on Dependable & Secure Computing, vol. 1, no.99, pp. 1176-1189, 2016. Article (CrossRef Link).

[4] H. S. Wang, Q. Tang, "Efficient Homomorphic Integer Polynomial Evaluation based on GSW FHE," Cryptology ePrintArchive, Report 2016/488, pp.488-505, 2016. Article (CrossRef Link).

[5] J. H. Cheon, K. Han, D. Kim, "Faster Bootstrapping of FHE over the Integers," Cryptology ePrint Archive, Report 2017/079, pp.79-91,2017. Article (CrossRef Link).

[6] S. Halevi, V. Shoup, "Bootstrapping for helib," in Proc. of the Annual International Conference on the Theory and Applications of Cryptographic Techniques, pp. 641-670, April 23-30, 2015. Article (CrossRef Link).

[7] Z. Brakerski, C. Gentry, V. Vaikuntanathan, "(Leveled)fully homomorphic encryption without bootstrapping," in Proc. of the 3rd Innovations in Theoretical Computer Science Conference, pp. 309-325, January 8-10, 2012. Article (CrossRef Link).

[8] M. V. Dijk, C. Gentry, S. Halevi, "Fully homomorphic encryption over the integers," in Proc. of the Annual International Conference on the Theory and Applications of Cryptographic Techniques, pp. 24-43, May 30-June 3, 2010. Article (CrossRef Link).

[9] Z. Brakerski, "Fully homomorphic encryption without modulus switching from classical GapSVP," in Proc. of the 32nd Annual Cryptology Conference, pp.868-886, August 19-23, 2012. Article (CrossRef Link).

[10] C. Genry, A. Sahai, B. Water, "Homomorphic encryption from learning with errors: conceptually-simpler, asymptotically-faster, attribute-based," in Proc. of the 33rd Annual Cryptology Conference Advances in Cryptology, pp.75-92, August 18-22, 2013. Article (CrossRef Link).

[11] L. Ducas and et al, "FHEW:Bootstrapping Homomorphic Encryption in less than a second," in Proc. of the Annual International Conference on the Theory and Applications of Cryptographic Techniques, pp.617-640, April 23-30, 2015. Article (CrossRef Link).

[12] Z. Brakerski, V Vaikuntanathan, "Efficient fully homomor-phic encryption from (standard) LWE," in Proc. of IEEE 52nd Annual Symposium on Foundations of Computer Science, pp. 97-106, October 22-25, 2011. Article (CrossRef Link).

[13] C. Gentry, S. Halevi, N. Smart, "Homomorphic evaluation of the AES circuit," in Proc. of the 32nd Annual Cryptology Conference, pp. 850-867, August 19-23, 2012. Article (CrossRef Link).

[14] Z. Brakerski, C. Gentry, S. Halevi, "Packed ciphertexts in LWE-based homomorphic encryption," in Proc. of the 16th International Conference on Practice and Theory in Public-Key Cryptography, pp. 1-13, February 26 - March 1, 2013. Article (CrossRef Link).

[15] R. Hiromasa, M. Abe and et al, "Packing Messages and Optimizing Bootstrapping in GSW-FHE," in Proc. of IACR International Workshop on Public Key Cryptography, pp. 699-715, March 30-April 1, 2015. Article (CrossRef Link).

[16] J.Biasse, L.Ruiz, "FHEW with efficient multibit bootstrapping," in Proc. of the International Conference on Cryptology and Information Security in Latin America, pp. 119-135, August 23-26, 2015. Article (CrossRef Link).

[17] I. Chillotti, N. Gama and et al, " Faster Fully Homomorphic Encryption:Bootstrapping in less than 0.1 Seconds," in Proc. of the International Conference on the Theory and Application of Cryptology and Information Security, pp. 3-33, December 4-8, 2016. Article (CrossRef Link).

[18] S. Nicola, "Design and Analysis of Distributed Algorithms, 1st Edtion," Wiley, New York, 2006. Article (CrossRef Link).

[19] V. V. Williams, "Multiplying matrices faster than coppersmith-winograd," in Proc. of the forty-fourth annual ACM symposium on Theory of computing, pp. 887-898, May 19-22, 2012. Article (CrossRef Link).

[20] E. Dekel, D. Nassimi, S. Sahni, "Parallel matrix and graph algorithms," SIAM J. Comput, vol.10, no.4, pp.657-675, November, 1981. Article (CrossRef Link).

[21] D. G. R. A. Van, J. Watts, "SUMMA: scalable universal matrix multiplication algorithm," Concurrency & Computation Practice & Experience, vol.9, no.4, pp. 29-29, April, 1997. Article (CrossRef Link).

[22] R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar, "A three-dimensional approach to paralle matrix multiplication," IBM Journal of Research and Development, vol. 39, no. 5, pp. 575-582, September, 1995. Article (CrossRef Link).

[23] D. J. Evans, G. M. Megson, "A systolic simplex algorithm, 1st Edition," International Journal of Computer Mathematics, Berkshire, 1991. Article (CrossRef Link).

Delin Tan(1981- ) is a lecturer in SiChuan Normal University, China. He is currently pursuing a doctoral degree in the School of College of Geophysics,Chengdu University of Technology, China. His research interests include fully homomorphic encryption, cloud computing etc.

Huajun Wang(1964-) is a Professor and Ph.D of Chengdu University of Technology. His main research interests are sensor networks, network security, embedded and so on.

Delin Tan (1,2), Huajun Wang (1)

(1) College of Geophysics, Chengdu University of Technology, No.l, Dong san Road, Er xian qiao, Chenghua District, Chengdu, China

[e-mail:hjwang@sdut.edu.cn]

(2) Sichuan Normal University, No.5, Jing'an Road, Jinjiang District, Chengdu, China

[e-mail:tdltcl@ 126. com]

(*) Corresponding author: Huajun Wang

Received April 29, 2017; revised August 10, 2017; accepted September 9, 2017; published January 31, 2018

http://doi.org/10.3837/tiis.2018.01.024

Table 1. The time complexity comparisons before and after optimization BitDecomp() PBitDecomp() [BitDecomp.sup.-1]() V O([k.sup.2]) V O(1) V O([k.sup.2]) M O( N[k.sup.2]) M O(1) M O( N[k.sup.2]) [PBitDecomp.sup.-1]() Powersof 2() V O(1) V O(N * k) M O(1) M O([N.sup.2] * k) PPowersof 2() Flatten() PFlatten() V O( N) V O([k.sup.4]) V O(1) M O( N) M O( N (2) k (4)) M O(1) Mult () PMatrixMult () O([N.sup.[omega]+2] * [[kappa].sup.4]) O(N) Add () PMatrixAdd () O([N.sup.4] * [[kappa].sup.4]) O(1) MultConst () PMultConst () O([N.sup.4+[omega]] * [[kappa].sup.8]) O( N) Table 2. The performance comparisons for the usual FHE schemes Scheme DGHV BGV Performance O([[lambda].sup.14]) [??] ([[lambda].sup.2]) Scheme Bra12 Performance [??] ([[lambda].sup.6]) Scheme GSW PGSW Performance O([N.sup.4+[omega]]*[[kappa].sup.4]) O( N) Table 3. the experimental results basing on Cannon algorithm Experiment 1 Experiment 2 Experiment 3 200 1 200 4 200 8 3.853772s 1.530456s 0.745632s Experiment 4 Experiment 5 Experiment 5 400 1 400 4 400 8 5.772368s 3.163587s 1.465911s

Printer friendly Cite/link Email Feedback | |

Author: | Tan, Delin; Wang, Huajun |
---|---|

Publication: | KSII Transactions on Internet and Information Systems |

Article Type: | Report |

Date: | Jan 1, 2018 |

Words: | 8591 |

Previous Article: | Security Analysis of the PHOTON Lightweight Cryptosystem in the Wireless Body Area Network. |

Next Article: | Adaptive Deadline-aware Scheme (ADAS) for Data Migration between Cloud and Fog Layers. |

Topics: |