# Parallel algorithm for multi-dimensional matrix operations representation.

IntroductionA number of linear systems have matrix operations as the core concept. Numerical applications have efficient matrix multiplication as their critical component. Scientific codes have a large part consisting of matrix computations, and these result in less efficient supercomputers.

But, majority of research work done so far is around 2 D matrices. Little is done on 3 D or higher dimensional matrix. A rich set of array intrinsic functions is provided by FORTRAN 90. These operate on elements of multi-dimensional array objects in a concurrent manner, thus stressing on the importance of efficient matrix operations. A collection of 2 D matrices can be seen as a multi-dimensional matrix. As such, 2 D matrices are used as building blocks for multi-dimensional matrices.

Representation of multi-dimensional matrix

The EKMR was first proposed in (15). Brief overview of the algorithm K -Map technique is sought to be the most efficient tool today with Boolean functions. It provides instant recognition of basic patterns, and can be used to represent all possible combinations in minimal terms. Fig. 1 displays some examples. It is clear that n-input Karnaugh map uses n variables to reserve memory storage and represents all the [2.sup.n] possible combinations. Furthermore, any n input Karnaugh map can be drawn on a plane easily, where n <=4.

[FIGURE 1 OMITTED]

We use the concept of K map to propose our new matrix representation in EKMR. Since EKRM (1) is simply a 1 D array, no new definition is needed. Similarly, EKMR (2) is the traditional 2 D matrix. Therefore, EKMR (n) has the same representation as TMR(n), for n=1,2 ....... We now consider EKMR (3) and EKMR (4) for these are the basic building blocks of EKMR(n).

We use an example to illustrate the structure of EKMR(3). Let A[k][i][j] denote a 3 D matrix in TMR(3). Fig. 2 displays two ways to view a 3 D matrix with a size of 3x4x5. The corresponding EKMR denoted by A[i][j] is shown in fig. 3, where it is represented by a 2 D matrix with size of 4x(3x5). The basic difference between TMR(3) and EKMR(3) is the placement of elements along the direction indexed by k. In EKMR(3), we use index variable i' to indicate the row direction and j' to indicate the column direction. Notice that index i' is the same as I, whereas j' is a combination of j and k. The analogy between EKMR(3) and Karnaugh map with three- input is as follows[Fig. 3 and Fig. 1(c)]. Index variable i,j and k corresponds to variable X, Y and Z respectively. The way to obtain EKMR(4) from EKMR(3) is similar to obtain a four input Karnaugh Map form a three input(1). Fig. 4 illustrates that a (2x4)x(3x5) matrix in EKMR (4) can be used to represents a 2x3x4x5 matrix in TMR (4). The structure EKMR(n) can be found in (15).

[FIGURE 2 OMITTED]

[FIGURE 3 OMITTED]

[FIGURE 4 OMITTED]

Matrix Operations in EKMR(3)

Matrix operations based on EKMR have been represented in previous literature(15). In this section, we study parallel algorithms for 3 D matrix operations. A data parallel algorithm can be divided into three phases : data partition, local computation, and result collection. These three phases are briefly examined and then parallel algorithm is presented. This is given as below:

Data partition: Data is partitioned and distributed into processor for parallel operations. Row, column and 2D are three common schemes for the data partition in matrix algorithm. We have seen that matrices in TMR3 are stored in 2D fashion. Fig. 5 shows the three common ways to perform data partition in TMR(3) and EKMR(3). If non-continuous data are partitioned into a processor, we must collect each block, and store them into a buffer, before sending data to the processor. Usually, the cost of collecting the data and sending them to processors cannot be ignored. Hence, data partition is an important issue in designing parallel algorithm. Many researches have worked on reducing the cost of collecting the data.

Local computation: To perform the computation based on the algorithm and partial data. The work in this phase is the same as the original sequential algorithm. However, there might be some changes in the scope of operation data e.g. change of indices for matrix operations.

Result collection: For the final report, the collection of result computed and scattered among processors is necessary. This phase is similar to data partition. If the partial results computed are non-continuous, they must be fragmented into blocks and then placed in their appropriate locations.

[FIGURE 5 OMITTED]

Row Partition Algorithm Since both TMR(n) and EKMR(n) employ row major storage scheme, we choose to use row partition in designing our parallel algorithm. To get a better performance, we duplicate vector x or matrix B on all processors, and do partition on matrix A in all our designs. The data partition and result collection procedures are almost the same for all our algorithms. These two are examined first, and thereafter the local computation.

Data Partition and Result Collection : Let A[k][i][j] be an nxnxn matrix presented in TMR(3) and there are P processors in the parallel system. It is already observed that TMR(3) can be viewed as a collection (along the k direction) of 2D matrices(along the ixj direction).Therefore, row partition for TMR(3) can be obtained by partitioning each 2D matrix row wise and repeating the partition along the k direction. More precisely, let r = n% P, and row size denote the number of data rows assigned to each processor. With some arithmetic, it can be seen that row_size = [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] for the first r processors and row size = [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] for the remaining (P-r) processors.

Let A'[i'][j'] be an nx(nxn) matrix represented in EKMR(3). It can be seen that matrix A' consists of n rows and each row contains [n.sup.2] elements. Partition A' by row means that we should assign row_size rows to each processor , where row_size is the same as that in the case of TMR(3). Since EKMR(3) is basically a 2D matrix, elements in the same row are stored continually . There is no collection involved in the data partition phase.

Local computation : Generally, the local computation is the same as the original sequential algorithm presented in (15). Except that there might be some change in the scope of operation data for brevity, we do not repeat our work and list the matrix-matrix multiplication algorithm using p processors for TMR and EKMR in figs. 6 and 7 respectively.

Data Partition Procedure Algorithm

For (p_id=0; p_id<P ; p_id++) row_size=integer(discard fractional part if <0.5,and round off if >0.5)./*To change the scope for the index of sequential algorithm*/ for (k=0; k<n; k++) for( i=0; i< row size; for (j=0; j<n; for (m=0; m<n; m++) C[k][i][j]= C[k][i][j] + A[k][i][m] x B[k][m][j]; /*local result matrix size is row_size x n

Result collection procedure algorithm

Fig. 6 row partition matrix-matrix multiplication algorithm in TMR(3) The same Matrix multiplication program we can write in Fortran. There is double precisions are used. subroutine sdgemm(M, A, B, C) c c .. Parameters .. integer M double precision A(M,M) double precision B(M,M) double precision C(M,M) double precision Ctemp c c .. Local variables .. integer i integer j integer k c c .. Nested loops .. do i=1,M do j=1,M Ctemp = C(i,j) do k=1,M Ctemp = Ctemp + A(i,k)*B(k,j) end do C(i,j) = Ctemp end do end do c return c end

Data Partition Procedure Algorithm

For (p_id=0;p_id<P ; p_id++) row_size= [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]/*To change the scope for the index of sequential algorithm*/ for (j=0; j<n;j++) v= j x n ; for(i=0; i< row size; for (m=0; m<n; m++) r = mxn ; for (k=0; k<n;k++) C'[i][k+r]= C'[i][k+r] + A'[i][k+v] x B'j][k+r]; /*local result matrix size is row size x [n.sup.2*/]

Result collection procedure algorithm

Fig. 7 Row partition matrix-matrix multiplication algorithm in EKMR(3) For same algorithm We can supposed to write a program using multi-dimensional arrays for a 6x6 matrix multiplication in C++ programming language. The algorithm for matrix multiplication using C++ #include <iostream> #include <fstream> using namespace std; int main() { ifstream infile("MATRLX.dat"); ofstream outfile ("RESULT.dat"); int m1[6][6], m2[6][6], M[6][6]; int i=0, j=0; for (i=0; i<6; i++) { for (j=0; j<6; j++) {infile>>m1[i][j] >>m2[i][j]; M[i][j] = m1[0][j]*m2[i][0]; cout<<M[i][j]<<" "; } cout<<endl; } infile.close (); outfile.close (); return 0; }

Performance

Both EKMR and TMR matrices are examined in this section. All results indicate that EKMR algorithms are more efficient than the TMR algorithms.

In previous sections, we have seen that all our parallel algorithms consists of three phases. Compression schemes based on two representation schemes, the amount of work for local computation is almost the same. Two important roles in deciding final performance, are partition and result collection.

In this paper it is noteworthy mentioned that buffer and collection is needed if original data partitioned to processor is not stored In continuous memory.

Partition(Row ,Column and 2-Dimension): in row partition for TMR(3), the first n/P rows in the first plane is assigned to processor 1. Then the second is assigned to processor 2 and so on. There are n planes along the k direction. Hence there are n non-continuous data blocks for processor 1. Same is the case with other processors. Therefore, the total number of non -continuous data segments would be summed to Pn. Since data assigned to each processor is not continuous, a buffer is needed for temporary storage and all the [n.sup.3] elements will be copied to buffer. The structure of EKMR(3) is exactly a 2D matrix in row major storage. Therefore, all elements assigned to each processor are in continuous memory locations. Hence, no buffer is needed and both matrices are 0.

In column partition for TMR(3) , there are n non- continuous data blocks for elements in the first plane assigned to the processor 1. Again, there are n plains along the k direction. Hence, there are Q[n.sup.2] non-continuums blocks for data assigned to each processor. This gives a total of P [n.sup.2] non-continuous data blocks. For EKMR (3), there are non-continuous blocks (along the i-direction) for each processor. Hence, there are Pn continuous and non-continuous data blocks for all processors. Since in both cases data are stored in non-continues blocks, there are [n.sup.3] data elements to be collected.

Assuming that [n.sup.3] elements to be partitioned into PxQ processors. For TMR(3), there are n /P non-continuous blocks assigned to processor 1 in the first plane and n total planes. Hence, the number of non -continuous data blocks is P x Q x n x (n/P), which gives Q [n.sup.2]. Therefore, the number of non-continuous data blocks is Qn. All [n.sup.3] elements need to be collected in both cases, since they are stored in non-continuous memory.

The analytical result for the three-data partition schemes in TMR (3) and EKMR(3) are summarized in the table. It can be seen that our proposed parallel algorithms based on EKMR(3) should perform better than those based on TMR(3), regardless of in or out-of-core environment. Result for TMR (4) and EKMR(4) can be obtained similarly. The matrix table summarize their cost of a matrix with size nxnxnxn.

Results

Performance of our proposed algorithm, we have implemented them on an IBM SP2 machine presented in [19]. Parallel algorithms are implemented in C+MPI using SMPS model. Experiments for parallel algorithms consists of two parts. In the first part we study effect of matrix size, which were run on SP2 with Three nods . In the second part investigate effect of number of processors in the parallel machine, which were run with a fixed matrix size 100x100x100 for three Dimensional matrix size and 30x30x30 for four Dimensional matrix and processors number varying from two to sixteen.

Performance of parallel algorithms in EKMR have presented in [15]. Now study performance of parallel algorithms. In addition to execution time, We use comparison metric called relative performance which is calculated as :

Performance (%) = [Time(TMR_Alg) - Time(EKMR_Alg)/Time(TMR_Alg)] x 100

[FIGURE 8 OMITTED]

[FIGURE 9 OMITTED]

[FIGURE 10 OMITTED]

Performance of algorithm in EKMR(4): To see the how well the EKMR scheme can be extended to higher dimensions , we move our focus on to EKMR(4). We have seen in the table1 that the row partition scheme should perform better then column and 2D. we have implemented matrix operation algorithms based on row partition experiment in order to further gain to inside . as we expected experimental results indicate that algorithm based on EKMR outperforms that based on TMR . In fact the difference 52-60% in execution time and 56- 65% in relative performance for addition/ subtraction operation, about 42-52% in execution time and 45-65% in relative performance for matrix--vector performance, about 26-32% in execution time and 30-45% in relative performance for matrix-matrix multiplication.

Conclusion

EKMR is a new structure fore representing a multi dimensional matrix. This structure consists of a tree with 2 D matrices in its leaves. Development of parallel matrix operation algorithm has been done on the basis of EKMR. This has come out to be a better performer than TMR base algorithms. Further work based on this may consist of :

(1) Cache effect study of various algorithms.

(2) Development of compression schemes for sparse matrices in the form of EKMR and TMR.

(3) Application of the proposed algorithm in the research of multi-dimensional data cube.

(4) Application of other schemes like tensor product of Strassen's method in the proposed new matrix multiplication algorithm for further improvement in performance.

References

[1] Aart J.C. Bik and Harry A.G. Wiljshoff, "Compilation Techniques for Spare Matrix Computations", In Proceedings of Supercomputing, pages 430-439, 1993.

[2] Aart J.C. Bik, Peter M.W. Knijnenburg, and Harry A.G. Wijshoff, "Reshaping Access Patterns for Generating Sparse Codes," In Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing, 1994.

[3] Aart J.C. Bik and Harry A.G. Wiljshoff, "Automatic Data Structure Selection and Transformation for Spare Matrix Computations", Technical Report, 1992.

[4] J.K. Cullum and R.A. Willoughby, "Algorithm for Large Symmetric Eignenvalue Computations", (Birkhauser, Boston, 1985)

[5] B.B. Fraguela, R. Doallo, E.L. Zapata, "Modeling Set Associative Caches Behaviour for Irregular Computations," ACM International Conference on Measurement and Modeling of Computer systems (SIGMETRICS'98), pp. 192-201, June, 1998.

[6] B.B. Fraguela, R. Doallo, E.L. Zapata, "Cache Misses Prediction for high Performance Sparse Algorithms." 4th International Euro-Par Conference (Euro-Par'98) pp.224-233, September 1998.

[7] B.B. Fraguela, R. Doallo, E.L. Zapata, "Cache Probabilistic Modeling for Basic Sparse Algebra Kernels Involving Matrices with a Non-Uniform Distribution,"24th IEEE Euromicro Conference, pp 345-348, August 1998.

[8] B.B. Fraguela, R. Doallo, E.L. Zapata, "Automatic Analytical Modeling for the estimation of Cache Misses", International Conference on Parallel Architectures and Compilation Techniques (PACT'99), October 1999.

[9] G.H. Golub and C.F. Van Loan, Matrix Computations, 2nd ed. (John Hopkins Univ. Press, Baltimore, 1989)

[10] Mahmut Kandemir, J. Ramanujam, Alok Choudhary, "Improving Cache Locality by a Combination of Loop and Data Transformations, "IEEE Trans. On Computers Vol. 48, Ao. 2, February,1999.

[11] Vladimir Kotlyar, Keshav Pingali, and Paul Stodghill, "Compiling Parallel Sparse Code for User-Defined Data Structures," In Proceedings of 8th SIAM Conference on Parallel Processing for Scientific Computing, March,1997.

[12] Vladimir Kotlyar, Keshav Pingali, and Paul Stodghill, A Programs", In Euro Par, August 1997.

[13] Vladimir Kotlyar, Keshav Pingali, and Paul Stodghill, A, "Compiling Parallel Code for Sparse Matrix Applications," In Proceedings of the Supercomputing Conference,August,1997.

[14] B. Kumar, C.H. Huang, R.W. Johnson and P. Sadayappan, "A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm with Memory Reduction," In Proceedings of the 7th International Parallel Processing Symposium,1998.

[15] Jen-Shiuh Liu, Jiun-Yuan Lin and Yeh-Ching Chung, "Efficient Representation for Multi-Dimensional Matrix Operation", Proceedings of Workshop on Compiler Techniques for High Performance Computing,pp. 133 144, March, 2000, Taiwan.

[16] W.H. Press, S.A. Teukolsky, W.T. Vetterling and B.P. Flannery, Numerical Recipes, The Art of Scientific Computing, 2nd ed.(Cambridge University Press,1992).

[17] Peter D. Sulatycke and Kanad Ghosh, "Caching Efficient Multithreaded Fast Multiplication of Sparse Matrices," In Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, 1998.

[18] James B. White, III and P.Sadayappan," On Improving the Performance of Sparse Matrix-Matrix Vector Multiplication", "IEEE Conference, 1997.

Satya Prakash Singh1, Anil K Ahlawat2 and PC Saxena3

(1) Sr. Lecturer, Computer Science & Engg. Birla Institute of Technology, Mesra, Noida Campus, India

* Email: spsinghbit@yahoo.co.in, spsingh11in@yahoo.co.in

(2) Professor, Ajay Kumar Garg Engineering College, Ghaziabad, India

(3) Professor, JNU, New Delhi, India

Table 1: Cost for partition procedure in three and four-dimensional matrices. No. of elements to be collected TMR(3) EKMR(3) TMR(4) EKMR(4) Row Partition [n.sup.3] 0 [n.sup.4] 0 (P processor) Column Partition [n.sup.3] [n.sup.3] [n.sup.4] [n.sup.4] (P processor) 2D Partition [n.sup.3] [n.sup.3] [n.sup.4] [n.sup.4] (PxQ processor) No. of non-continuous blocks TMR(3) EKMR(3) TMR(4) EKMR(4) Row Partition Pn 0 [Pn.sup.2] 0 (P processor) Column Partition [Pn.sup.2] Pn [Pn.sup.3] Pn (P processor) 2D Partition [Qn.sup.2] Qn [Qn.sup.3] Qn (PxQ processor)

Printer friendly Cite/link Email Feedback | |

Author: | Singh, Satya Prakash; Ahlawat, Anil K.; Saxena, P.C. |
---|---|

Publication: | International Journal of Computational Intelligence Research |

Date: | Jul 1, 2010 |

Words: | 3060 |

Previous Article: | Recognition of audiovisual celebrity in unrestrained web videos. |

Next Article: | On the stability of an Ammensal- harvested enemy species pair with limited resources. |

## Reader Opinion