Parallel algorithm for multidimensional matrix operations representation.Introduction A number of linear systems have matrix operations as the core concept. Numerical applications have efficient matrix multiplication Noun 1. matrix multiplication  the multiplication of matrices matrix operation  a mathematical operation involving matrices as their critical component. Scientific codes have a large part consisting of matrix computations, and these result in less efficient supercomputers. But, majority of research work done so far is around 2 D matrices. Little is done on 3 D or higher dimensional matrix. A rich set of array intrinsic functions is provided by FORTRAN 90. These operate on elements of multidimensional array objects in a concurrent manner, thus stressing on the importance of efficient matrix operations. A collection of 2 D matrices can be seen as a multidimensional matrix. As such, 2 D matrices are used as building blocks for multidimensional matrices. Representation of multidimensional matrix The EKMR EKMR Enhanced Karnaugh Map Representation was first proposed in (15). Brief overview of the algorithm K Map technique is sought to be the most efficient tool today with Boolean functions. It provides instant recognition of basic patterns, and can be used to represent all possible combinations in minimal terms. Fig. 1 displays some examples. It is clear that ninput Karnaugh map The Karnaugh map, also known as a Veitch diagram (Kmap or KVmap for short), is a tool to facilitate management of Boolean algebraic expressions. A Karnaugh map is unique in that only one variable changes value between squares, in other words, the rows and columns are uses n variables to reserve memory storage and represents all the [2.sup.n] possible combinations. Furthermore, any n input Karnaugh map can be drawn on a plane easily, where n <=4. [FIGURE 1 OMITTED] We use the concept of K map to propose our new matrix representation in EKMR. Since EKRM (1) is simply a 1 D array, no new definition is needed. Similarly, EKMR (2) is the traditional 2 D matrix. Therefore, EKMR (n) has the same representation as TMR TMR total mixed ration. TMR 1 Trainable mentally retarded 2 Transmyocardial revascularization, see there (n), for n=1,2 ....... We now consider EKMR (3) and EKMR (4) for these are the basic building blocks of EKMR(n). We use an example to illustrate the structure of EKMR(3). Let A[k][i][j] denote a 3 D matrix in TMR(3). Fig. 2 displays two ways to view a 3 D matrix with a size of 3x4x5. The corresponding EKMR denoted by A[i][j] is shown in fig. 3, where it is represented by a 2 D matrix with size of 4x(3x5). The basic difference between TMR(3) and EKMR(3) is the placement of elements along the direction indexed by k. In EKMR(3), we use index variable i' to indicate the row direction and j' to indicate the column direction. Notice that index i' is the same as I, whereas j' is a combination of j and k. The analogy between EKMR(3) and Karnaugh map with three input is as follows[Fig. 3 and Fig. 1(c)]. Index variable i,j and k corresponds to variable X, Y and Z respectively. The way to obtain EKMR(4) from EKMR(3) is similar to obtain a four input Karnaugh Map form a three input(1). Fig. 4 illustrates that a (2x4)x(3x5) matrix in EKMR (4) can be used to represents a 2x3x4x5 matrix in TMR (4). The structure EKMR(n) can be found in (15). [FIGURE 2 OMITTED] [FIGURE 3 OMITTED] [FIGURE 4 OMITTED] Matrix Operations in EKMR(3) Matrix operations based on EKMR have been represented in previous literature(15). In this section, we study parallel algorithms for 3 D matrix operations. A data parallel algorithm In computer science, a parallel algorithm, as opposed to a traditional serial algorithm, is one which can be executed a piece at a time on many different processing devices, and then put back together again at the end to get the correct result. can be divided into three phases : data partition, local computation, and result collection. These three phases are briefly examined and then parallel algorithm is presented. This is given as below: Data partition: Data is partitioned and distributed into processor for parallel operations. Row, column and 2D are three common schemes for the data partition in matrix algorithm. We have seen that matrices in TMR3 are stored in 2D fashion. Fig. 5 shows the three common ways to perform data partition in TMR(3) and EKMR(3). If noncontinuous data are partitioned into a processor, we must collect each block, and store them into a buffer, before sending data to the processor. Usually, the cost of collecting the data and sending them to processors cannot be ignored. Hence, data partition is an important issue in designing parallel algorithm. Many researches have worked on reducing the cost of collecting the data. Local computation: To perform the computation based on the algorithm and partial data. The work in this phase is the same as the original sequential algorithm. However, there might be some changes in the scope of operation data e.g. change of indices for matrix operations. Result collection: For the final report, the collection of result computed and scattered among processors is necessary. This phase is similar to data partition. If the partial results computed are noncontinuous, they must be fragmented into blocks and then placed in their appropriate locations. [FIGURE 5 OMITTED] Row Partition Algorithm Since both TMR(n) and EKMR(n) employ row major storage scheme, we choose to use row partition in designing our parallel algorithm. To get a better performance, we duplicate vector x or matrix B on all processors, and do partition on matrix A in all our designs. The data partition and result collection procedures are almost the same for all our algorithms. These two are examined first, and thereafter the local computation. Data Partition and Result Collection : Let A[k][i][j] be an nxnxn matrix presented in TMR(3) and there are P processors in the parallel system. It is already observed that TMR(3) can be viewed as a collection (along the k direction) of 2D matrices(along the ixj direction).Therefore, row partition for TMR(3) can be obtained by partitioning each 2D matrix row wise and repeating the partition along the k direction. More precisely, let r = n% P, and row size denote the number of data rows assigned to each processor. With some arithmetic, it can be seen that row_size = [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII ASCII or American Standard Code for Information Interchange, a set of codes used to represent letters, numbers, a few symbols, and control characters. Originally designed for teletype operations, it has found wide application in computers. ] for the first r processors and row size = [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] for the remaining (Pr) processors. Let A'[i'][j'] be an nx(nxn) matrix represented in EKMR(3). It can be seen that matrix A' consists of n rows and each row contains [n.sup.2] elements. Partition A' by row means that we should assign row_size rows to each processor , where row_size is the same as that in the case of TMR(3). Since EKMR(3) is basically a 2D matrix, elements in the same row are stored continually . There is no collection involved in the data partition phase. Local computation : Generally, the local computation is the same as the original sequential algorithm presented in (15). Except that there might be some change in the scope of operation data for brevity Brevity Adonis’ garden of short life. [Br. Lit.: I Henry IV] bubbles symbolic of transitoriness of life. [Art: Hall, 54] cherry fair cherry orchards where fruit was briefly sold; symbolic of transience. , we do not repeat our work and list the matrixmatrix multiplication algorithm using p processors for TMR and EKMR in figs. 6 and 7 respectively. Data Partition Procedure Algorithm For (p_id=0; p_id<P ; p_id++) row_size=integer(discard fractional part if <0.5,and round off if >0.5)./*To change the scope for the index of sequential algorithm*/ for (k=0; k<n; k++) for( i=0; i< row size; for (j=0; j<n; for (m=0; m<n; m++) C[k][i][j]= C[k][i][j] + A[k][i][m] x B[k][m][j]; /*local result matrix size is row_size x n Result collection procedure algorithm Fig. 6 row partition matrixmatrix multiplication algorithm in TMR(3) The same Matrix multiplication program we can write in Fortran. There is double precisions are used. subroutine sdgemm(M, A, B, C) c c .. Parameters .. integer M double precision A(M,M) double precision B(M,M) double precision C(M,M) double precision Ctemp c c .. Local variables .. integer i integer j integer k c c .. Nested loops .. do i=1,M do j=1,M Ctemp = C(i,j) do k=1,M Ctemp = Ctemp + A(i,k)*B(k,j) end do C(i,j) = Ctemp end do end do c return c end Data Partition Procedure Algorithm For (p_id=0;p_id<P ; p_id++) row_size= [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]/*To change the scope for the index of sequential algorithm*/ for (j=0; j<n;j++) v= j x n ; for(i=0; i< row size; for (m=0; m<n; m++) r = mxn ; for (k=0; k<n;k++) C'[i][k+r]= C'[i][k+r] + A'[i][k+v] x B'j][k+r]; /*local result matrix size is row size x [n.sup.2*/] Result collection procedure algorithm Fig. 7 Row partition matrixmatrix multiplication algorithm in EKMR(3) For same algorithm We can supposed to write a program using multidimensional arrays for a 6x6 matrix multiplication in C++ programming language. The algorithm for matrix multiplication using C++ #include <iostream> #include <fstream> using namespace std; int main() { ifstream infile("MATRLX.dat"); ofstream outfile ("RESULT.dat"); int m1[6][6], m2[6][6], M[6][6]; int i=0, j=0; for (i=0; i<6; i++) { for (j=0; j<6; j++) {infile>>m1[i][j] >>m2[i][j]; M[i][j] = m1[0][j]*m2[i][0]; cout<<M[i][j]<<" "; } cout<<endl; } infile.close (); outfile.close (); return 0; } Performance Both EKMR and TMR matrices are examined in this section. All results indicate that EKMR algorithms are more efficient than the TMR algorithms. In previous sections, we have seen that all our parallel algorithms consists of three phases. Compression schemes based on two representation schemes, the amount of work for local computation is almost the same. Two important roles in deciding final performance, are partition and result collection. In this paper it is noteworthy mentioned that buffer and collection is needed if original data partitioned to processor is not stored In continuous memory. Partition(Row ,Column and 2Dimension): in row partition for TMR(3), the first n/P rows in the first plane is assigned to processor 1. Then the second is assigned to processor 2 and so on. There are n planes along the k direction. Hence there are n noncontinuous data blocks for processor 1. Same is the case with other processors. Therefore, the total number of non continuous data segments would be summed to Pn. Since data assigned to each processor is not continuous, a buffer is needed for temporary storage and all the [n.sup.3] elements will be copied to buffer. The structure of EKMR(3) is exactly a 2D matrix in row major storage. Therefore, all elements assigned to each processor are in continuous memory locations. Hence, no buffer is needed and both matrices are 0. In column partition for TMR(3) , there are n non continuous data blocks for elements in the first plane assigned to the processor 1. Again, there are n plains along the k direction. Hence, there are Q[n.sup.2] noncontinuums blocks for data assigned to each processor. This gives a total of P [n.sup.2] noncontinuous data blocks. For EKMR (3), there are noncontinuous blocks (along the idirection) for each processor. Hence, there are Pn continuous and noncontinuous data blocks for all processors. Since in both cases data are stored in noncontinues blocks, there are [n.sup.3] data elements to be collected. Assuming that [n.sup.3] elements to be partitioned into PxQ processors. For TMR(3), there are n /P noncontinuous blocks assigned to processor 1 in the first plane and n total planes. Hence, the number of non continuous data blocks is P x Q x n x (n/P), which gives Q [n.sup.2]. Therefore, the number of noncontinuous data blocks is Qn. All [n.sup.3] elements need to be collected in both cases, since they are stored in noncontinuous memory. The analytical result for the threedata partition schemes in TMR (3) and EKMR(3) are summarized in the table. It can be seen that our proposed parallel algorithms based on EKMR(3) should perform better than those based on TMR(3), regardless of in or outofcore environment. Result for TMR (4) and EKMR(4) can be obtained similarly. The matrix table summarize their cost of a matrix with size nxnxnxn. Results Performance of our proposed algorithm, we have implemented them on an IBM SP (IBM Scalable POWER) A family of massively parallel (MPP) computer systems from IBM based on its RS/6000 (pSeries) models that incorporate various POWER and PowerPC CPUs. First introduced in 1993, SP configurations support from two to 512 processors. 2 machine presented in [19]. Parallel algorithms are implemented in C+MPI MPI  Message Passing Interface using SMPS SMPS Society for Marketing Professional Services SMPS Switching Mode Power Supply SMPS Switched Mode Power Supply model. Experiments for parallel algorithms consists of two parts. In the first part we study effect of matrix size, which were run on SP2 with Three nods . In the second part investigate effect of number of processors in the parallel machine, which were run with a fixed matrix size 100x100x100 for three Dimensional matrix size and 30x30x30 for four Dimensional matrix and processors number varying from two to sixteen. Performance of parallel algorithms in EKMR have presented in [15]. Now study performance of parallel algorithms. In addition to execution time, We use comparison metric called relative performance which is calculated as : Performance (%) = [Time(TMR_Alg)  Time(EKMR_Alg)/Time(TMR_Alg)] x 100 [FIGURE 8 OMITTED] [FIGURE 9 OMITTED] [FIGURE 10 OMITTED] Performance of algorithm in EKMR(4): To see the how well the EKMR scheme can be extended to higher dimensions , we move our focus on to EKMR(4). We have seen in the table1 that the row partition scheme should perform better then column and 2D. we have implemented matrix operation algorithms based on row partition experiment in order to further gain to inside . as we expected experimental results indicate that algorithm based on EKMR outperforms that based on TMR . In fact the difference 5260% in execution time and 56 65% in relative performance for addition/ subtraction subtraction, fundamental operation of arithmetic; the inverse of addition. If a and b are real numbers (see number), then the number a−b is that number (called the difference) which when added to b (the subtractor) equals operation, about 4252% in execution time and 4565% in relative performance for matrixvector performance, about 2632% in execution time and 3045% in relative performance for matrixmatrix multiplication. Conclusion EKMR is a new structure fore representing a multi dimensional matrix. This structure consists of a tree with 2 D matrices in its leaves. Development of parallel matrix operation algorithm has been done on the basis of EKMR. This has come out to be a better performer than TMR base algorithms. Further work based on this may consist of : (1) Cache effect study of various algorithms. (2) Development of compression schemes for sparse matrices in the form of EKMR and TMR. (3) Application of the proposed algorithm in the research of multidimensional data cube For the data mining concept, see, see . In computer programming contexts, a data cube is a three (or higher) dimensional array of values, commonly used to describe a time series of image data. . (4) Application of other schemes like tensor product (mathematics) tensor product  A function of two vector spaces, U and V, which returns the space of linear maps from V's dual to U. Tensor product has natural symmetry in interchange of U and V and it produces an associative "multiplication" on vector spaces. of Strassen's method in the proposed new matrix multiplication algorithm for further improvement in performance. References [1] Aart J.C. Bik and Harry A.G. Wiljshoff, "Compilation Techniques for Spare Matrix Computations", In Proceedings of Supercomputing, pages 430439, 1993. [2] Aart J.C. Bik, Peter M.W. Knijnenburg, and Harry A.G. Wijshoff, "Reshaping Access Patterns for Generating Sparse Codes," In Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing Solving a problem with multiple computers or computers made up of multiple processors. It is an umbrella term for a variety of architectures, including symmetric multiprocessing (SMP), clusters of SMP systems, massively parallel processors (MPPs) and grid computing. , 1994. [3] Aart J.C. Bik and Harry A.G. Wiljshoff, "Automatic Data Structure Selection and Transformation for Spare Matrix Computations", Technical Report, 1992. [4] J.K. Cullum and R.A. Willoughby, "Algorithm for Large Symmetric Eignenvalue Computations", (Birkhauser, Boston, 1985) [5] B.B. Fraguela, R. Doallo, E.L. Zapata, "Modeling Set Associative as·so·ci·a·tive adj. 1. Of, characterized by, resulting from, or causing association. 2. Mathematics Independent of the grouping of elements. Caches Behaviour for Irregular Computations," ACM (Association for Computing Machinery, New York, www.acm.org) A membership organization founded in 1947 dedicated to advancing the arts and sciences of information processing. In addition to awards and publications, ACM also maintains special interest groups (SIGs) in the computer field. International Conference on Measurement and Modeling of Computer systems (SIGMETRICS'98), pp. 192201, June, 1998. [6] B.B. Fraguela, R. Doallo, E.L. Zapata, "Cache Misses Prediction for high Performance Sparse Algorithms." 4th International EuroPar Conference (EuroPar'98) pp.224233, September 1998. [7] B.B. Fraguela, R. Doallo, E.L. Zapata, "Cache Probabilistic Modeling for Basic Sparse Algebra algebra, branch of mathematics concerned with operations on sets of numbers or other elements that are often represented by symbols. Algebra is a generalization of arithmetic and gains much of its power from dealing symbolically with elements and operations (such as Kernels Involving Matrices with a NonUniform Distribution,"24th IEEE (Institute of Electrical and Electronics Engineers, New York, www.ieee.org) A membership organization that includes engineers, scientists and students in electronics and allied fields. Euromicro Conference, pp 345348, August 1998. [8] B.B. Fraguela, R. Doallo, E.L. Zapata, "Automatic Analytical Modeling for the estimation of Cache Misses", International Conference on Parallel Architectures and Compilation Techniques (PACT'99), October 1999. [9] G.H. Golub and C.F. Van Loan, Matrix Computations, 2nd ed. (John Hopkins Univ. Press, Baltimore, 1989) [10] Mahmut Kandemir, J. Ramanujam, Alok Choudhary, "Improving Cache Locality by a Combination of Loop and Data Transformations, "IEEE Trans. On Computers Vol. 48, Ao. 2, February,1999. [11] Vladimir Kotlyar, Keshav Pingali, and Paul Stodghill, "Compiling Parallel Sparse Code for UserDefined Data Structures," In Proceedings of 8th SIAM Conference on Parallel Processing parallel processing, the concurrent or simultaneous execution of two or more parts of a single computer program, at speeds far exceeding those of a conventional computer. for Scientific Computing, March,1997. [12] Vladimir Kotlyar, Keshav Pingali, and Paul Stodghill, A Programs", In Euro Par, August 1997. [13] Vladimir Kotlyar, Keshav Pingali, and Paul Stodghill, A, "Compiling Parallel Code for Sparse Matrix In the mathematical subfield of numerical analysis a sparse matrix is a matrix populated primarily with zeros. Conceptually, sparsity corresponds to systems which are loosely coupled. Applications," In Proceedings of the Supercomputing Conference,August,1997. [14] B. Kumar, C.H. Huang, R.W. Johnson and P. Sadayappan, "A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm with Memory Reduction," In Proceedings of the 7th International Parallel Processing Symposium,1998. [15] JenShiuh Liu, JiunYuan Lin and YehChing Chung, "Efficient Representation for MultiDimensional Matrix Operation", Proceedings of Workshop on Compiler Techniques for High Performance Computing,pp. 133 144, March, 2000, Taiwan. [16] W.H. Press, S.A. Teukolsky, W.T. Vetterling and B.P. Flannery, Numerical Recipes, The Art of Scientific Computing, 2nd ed.(Cambridge University Press Cambridge University Press (known colloquially as CUP) is a publisher given a Royal Charter by Henry VIII in 1534, and one of the two privileged presses (the other being Oxford University Press). ,1992). [17] Peter D. Sulatycke and Kanad Ghosh, "Caching Efficient Multithreaded Fast Multiplication of Sparse Matrices," In Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing The first term used to describe the distribution of multiple computers throughout an organization in contrast to a centralized system. It started with the first minicomputers. Today, distributed processing is called "distributed computing." See also client/server. , 1998. [18] James B. White James Bain White (June 26, 1835  October 9, 1897) was a U.S. Representative from Indiana. Born in Stirlingshire, Scotland, White attended the common schools.Immigrated to the United States in 1854. He settled in Fort Wayne, Indiana.Calico printer.Tailor. , III and P.Sadayappan," On Improving the Performance of Sparse MatrixMatrix Vector Multiplication", "IEEE Conference, 1997. Satya Prakash Singh Prakash Singh VC (1 April 1913, Jammu & Kashmir  17 February 1945) was an Indian recipient of the Victoria Cross, the highest and most prestigious award for gallantry in the face of the enemy that can be awarded to British and Commonwealth forces. 1, Anil K Ahlawat2 and PC Saxena3 (1) Sr. Lecturer, Computer Science & Engg. Birla Institute of Technology
* Email: spsinghbit@yahoo.co.in, spsingh11in@yahoo.co.in (2) Professor, Ajay Kumar Garg Engineering College Ajay Kumar Garg Engineering College, Ghaziabad is an Engineering and Management college located in Ghaziabad, a suburb of Delhi, India. The college is affiliated to Uttar Pradesh Technical University, Lucknow, Uttar Pradesh. , Ghaziabad, India (3) Professor, JNU JNU Jawaharlal Nehru University (India) JNU Juneau, AK, USA  Juneau (Airport Code) , New Delhi New Delhi (dĕl`ē), city (1991 pop. 294,149), capital of India and of Delhi state, N central India, on the right bank of the Yamuna River. , India Table 1: Cost for partition procedure in three and fourdimensional matrices. No. of elements to be collected TMR(3) EKMR(3) TMR(4) EKMR(4) Row Partition [n.sup.3] 0 [n.sup.4] 0 (P processor) Column Partition [n.sup.3] [n.sup.3] [n.sup.4] [n.sup.4] (P processor) 2D Partition [n.sup.3] [n.sup.3] [n.sup.4] [n.sup.4] (PxQ processor) No. of noncontinuous blocks TMR(3) EKMR(3) TMR(4) EKMR(4) Row Partition Pn 0 [Pn.sup.2] 0 (P processor) Column Partition [Pn.sup.2] Pn [Pn.sup.3] Pn (P processor) 2D Partition [Qn.sup.2] Qn [Qn.sup.3] Qn (PxQ processor) 

Reader Opinion