# Parallel algorithm for multi-dimensional matrix operations representation.

Introduction

A number of linear systems have matrix operations as the core concept. Numerical applications have efficient matrix multiplication Noun 1. matrix multiplication - the multiplication of matrices
matrix operation - a mathematical operation involving matrices
as their critical component. Scientific codes have a large part consisting of matrix computations, and these result in less efficient supercomputers.

But, majority of research work done so far is around 2 D matrices. Little is done on 3 D or higher dimensional matrix. A rich set of array intrinsic functions is provided by FORTRAN 90. These operate on elements of multi-dimensional array objects in a concurrent manner, thus stressing on the importance of efficient matrix operations. A collection of 2 D matrices can be seen as a multi-dimensional matrix. As such, 2 D matrices are used as building blocks for multi-dimensional matrices.

Representation of multi-dimensional matrix

The EKMR EKMR Enhanced Karnaugh Map Representation  was first proposed in (15). Brief overview of the algorithm K -Map technique is sought to be the most efficient tool today with Boolean functions. It provides instant recognition of basic patterns, and can be used to represent all possible combinations in minimal terms. Fig. 1 displays some examples. It is clear that n-input Karnaugh map The Karnaugh map, also known as a Veitch diagram (K-map or KV-map for short), is a tool to facilitate management of Boolean algebraic expressions. A Karnaugh map is unique in that only one variable changes value between squares, in other words, the rows and columns are  uses n variables to reserve memory storage and represents all the [2.sup.n] possible combinations. Furthermore, any n input Karnaugh map can be drawn on a plane easily, where n <=4.

[FIGURE 1 OMITTED]

We use the concept of K map to propose our new matrix representation in EKMR. Since EKRM (1) is simply a 1 D array, no new definition is needed. Similarly, EKMR (2) is the traditional 2 D matrix. Therefore, EKMR (n) has the same representation as TMR TMR

total mixed ration.

TMR 1 Trainable mentally retarded 2 Transmyocardial revascularization, see there
(n), for n=1,2 ....... We now consider EKMR (3) and EKMR (4) for these are the basic building blocks of EKMR(n).

We use an example to illustrate the structure of EKMR(3). Let A[k][i][j] denote a 3 D matrix in TMR(3). Fig. 2 displays two ways to view a 3 D matrix with a size of 3x4x5. The corresponding EKMR denoted by A[i][j] is shown in fig. 3, where it is represented by a 2 D matrix with size of 4x(3x5). The basic difference between TMR(3) and EKMR(3) is the placement of elements along the direction indexed by k. In EKMR(3), we use index variable i' to indicate the row direction and j' to indicate the column direction. Notice that index i' is the same as I, whereas j' is a combination of j and k. The analogy between EKMR(3) and Karnaugh map with three- input is as follows[Fig. 3 and Fig. 1(c)]. Index variable i,j and k corresponds to variable X, Y and Z respectively. The way to obtain EKMR(4) from EKMR(3) is similar to obtain a four input Karnaugh Map form a three input(1). Fig. 4 illustrates that a (2x4)x(3x5) matrix in EKMR (4) can be used to represents a 2x3x4x5 matrix in TMR (4). The structure EKMR(n) can be found in (15).

[FIGURE 2 OMITTED]

[FIGURE 3 OMITTED]

[FIGURE 4 OMITTED]

Matrix Operations in EKMR(3)

Matrix operations based on EKMR have been represented in previous literature(15). In this section, we study parallel algorithms for 3 D matrix operations. A data parallel algorithm In computer science, a parallel algorithm, as opposed to a traditional serial algorithm, is one which can be executed a piece at a time on many different processing devices, and then put back together again at the end to get the correct result.  can be divided into three phases : data partition, local computation, and result collection. These three phases are briefly examined and then parallel algorithm is presented. This is given as below:

Data partition: Data is partitioned and distributed into processor for parallel operations. Row, column and 2D are three common schemes for the data partition in matrix algorithm. We have seen that matrices in TMR3 are stored in 2D fashion. Fig. 5 shows the three common ways to perform data partition in TMR(3) and EKMR(3). If non-continuous data are partitioned into a processor, we must collect each block, and store them into a buffer, before sending data to the processor. Usually, the cost of collecting the data and sending them to processors cannot be ignored. Hence, data partition is an important issue in designing parallel algorithm. Many researches have worked on reducing the cost of collecting the data.

Local computation: To perform the computation based on the algorithm and partial data. The work in this phase is the same as the original sequential algorithm. However, there might be some changes in the scope of operation data e.g. change of indices for matrix operations.

Result collection: For the final report, the collection of result computed and scattered among processors is necessary. This phase is similar to data partition. If the partial results computed are non-continuous, they must be fragmented into blocks and then placed in their appropriate locations.

[FIGURE 5 OMITTED]

Row Partition Algorithm Since both TMR(n) and EKMR(n) employ row major storage scheme, we choose to use row partition in designing our parallel algorithm. To get a better performance, we duplicate vector x or matrix B on all processors, and do partition on matrix A in all our designs. The data partition and result collection procedures are almost the same for all our algorithms. These two are examined first, and thereafter the local computation.

Data Partition and Result Collection : Let A[k][i][j] be an nxnxn matrix presented in TMR(3) and there are P processors in the parallel system. It is already observed that TMR(3) can be viewed as a collection (along the k direction) of 2D matrices(along the ixj direction).Therefore, row partition for TMR(3) can be obtained by partitioning each 2D matrix row wise and repeating the partition along the k direction. More precisely, let r = n% P, and row size denote the number of data rows assigned to each processor. With some arithmetic, it can be seen that row_size = [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII ASCII or American Standard Code for Information Interchange, a set of codes used to represent letters, numbers, a few symbols, and control characters. Originally designed for teletype operations, it has found wide application in computers. ] for the first r processors and row size = [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] for the remaining (P-r) processors.

Let A'[i'][j'] be an nx(nxn) matrix represented in EKMR(3). It can be seen that matrix A' consists of n rows and each row contains [n.sup.2] elements. Partition A' by row means that we should assign row_size rows to each processor , where row_size is the same as that in the case of TMR(3). Since EKMR(3) is basically a 2D matrix, elements in the same row are stored continually . There is no collection involved in the data partition phase.

Local computation : Generally, the local computation is the same as the original sequential algorithm presented in (15). Except that there might be some change in the scope of operation data for brevity Brevity

of short life. [Br. Lit.: I Henry IV]

bubbles

symbolic of transitoriness of life. [Art: Hall, 54]

cherry fair

cherry orchards where fruit was briefly sold; symbolic of transience.
, we do not repeat our work and list the matrix-matrix multiplication algorithm using p processors for TMR and EKMR in figs. 6 and 7 respectively.

Data Partition Procedure Algorithm
```For (p_id=0; p_id<P ; p_id++)
row_size=integer(discard fractional part if <0.5,and round off if
>0.5)./*To
change the scope for the index of sequential algorithm*/
for (k=0; k<n; k++)
for( i=0; i< row size;
for (j=0; j<n;
for (m=0; m<n; m++)

C[k][i][j]= C[k][i][j] + A[k][i][m] x B[k][m][j];

/*local result matrix size is row_size x n
```

Result collection procedure algorithm
```Fig. 6 row partition matrix-matrix multiplication algorithm in TMR(3)

The same Matrix multiplication program we can write in Fortran. There
is double precisions are used.

subroutine sdgemm(M, A, B, C)
c
c   .. Parameters ..
integer M
double precision A(M,M)
double precision B(M,M)
double precision C(M,M)
double precision Ctemp
c
c   .. Local variables ..
integer i
integer j
integer k
c
c   .. Nested loops ..
do i=1,M
do j=1,M
Ctemp = C(i,j)
do k=1,M
Ctemp = Ctemp + A(i,k)*B(k,j)
end do
C(i,j) = Ctemp
end do
end do
c
return
c
end
```

Data Partition Procedure Algorithm
```For (p_id=0;p_id<P ; p_id++)
row_size= [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]/*To
change the scope for the index of sequential
algorithm*/
for (j=0; j<n;j++)
v= j x n ;
for(i=0; i< row size;
for (m=0; m<n; m++)
r = mxn ;
for (k=0; k<n;k++)
C'[i][k+r]= C'[i][k+r] + A'[i][k+v] x B'j][k+r];
/*local result matrix size is row size x [n.sup.2*/]
```

Result collection procedure algorithm
```Fig. 7 Row partition matrix-matrix multiplication algorithm in EKMR(3)

For same algorithm We can supposed to write a program using
multi-dimensional arrays for a 6x6 matrix multiplication in C++
programming language. The algorithm for matrix multiplication using
C++

#include <iostream>
#include <fstream>

using namespace std;
int main()
{
ifstream infile("MATRLX.dat");
ofstream outfile ("RESULT.dat");
int m1[6][6], m2[6][6], M[6][6];
int i=0, j=0;
for (i=0; i<6; i++)
{
for (j=0; j<6; j++)
{infile>>m1[i][j]
>>m2[i][j];
M[i][j] = m1[0][j]*m2[i][0];
cout<<M[i][j]<<" ";
} cout<<endl;
}
infile.close ();
outfile.close ();
return 0;
}
```

Performance

Both EKMR and TMR matrices are examined in this section. All results indicate that EKMR algorithms are more efficient than the TMR algorithms.

In previous sections, we have seen that all our parallel algorithms consists of three phases. Compression schemes based on two representation schemes, the amount of work for local computation is almost the same. Two important roles in deciding final performance, are partition and result collection.

In this paper it is noteworthy mentioned that buffer and collection is needed if original data partitioned to processor is not stored In continuous memory.

Partition(Row ,Column and 2-Dimension): in row partition for TMR(3), the first n/P rows in the first plane is assigned to processor 1. Then the second is assigned to processor 2 and so on. There are n planes along the k direction. Hence there are n non-continuous data blocks for processor 1. Same is the case with other processors. Therefore, the total number of non -continuous data segments would be summed to Pn. Since data assigned to each processor is not continuous, a buffer is needed for temporary storage and all the [n.sup.3] elements will be copied to buffer. The structure of EKMR(3) is exactly a 2D matrix in row major storage. Therefore, all elements assigned to each processor are in continuous memory locations. Hence, no buffer is needed and both matrices are 0.

In column partition for TMR(3) , there are n non- continuous data blocks for elements in the first plane assigned to the processor 1. Again, there are n plains along the k direction. Hence, there are Q[n.sup.2] non-continuums blocks for data assigned to each processor. This gives a total of P [n.sup.2] non-continuous data blocks. For EKMR (3), there are non-continuous blocks (along the i-direction) for each processor. Hence, there are Pn continuous and non-continuous data blocks for all processors. Since in both cases data are stored in non-continues blocks, there are [n.sup.3] data elements to be collected.

Assuming that [n.sup.3] elements to be partitioned into PxQ processors. For TMR(3), there are n /P non-continuous blocks assigned to processor 1 in the first plane and n total planes. Hence, the number of non -continuous data blocks is P x Q x n x (n/P), which gives Q [n.sup.2]. Therefore, the number of non-continuous data blocks is Qn. All [n.sup.3] elements need to be collected in both cases, since they are stored in non-continuous memory.

The analytical result for the three-data partition schemes in TMR (3) and EKMR(3) are summarized in the table. It can be seen that our proposed parallel algorithms based on EKMR(3) should perform better than those based on TMR(3), regardless of in or out-of-core environment. Result for TMR (4) and EKMR(4) can be obtained similarly. The matrix table summarize their cost of a matrix with size nxnxnxn.

Results

Performance of our proposed algorithm, we have implemented them on an IBM SP (IBM Scalable POWER) A family of massively parallel (MPP) computer systems from IBM based on its RS/6000 (pSeries) models that incorporate various POWER and PowerPC CPUs. First introduced in 1993, SP configurations support from two to 512 processors. 2 machine presented in [19]. Parallel algorithms are implemented in C+MPI MPI - Message Passing Interface  using SMPS SMPS Society for Marketing Professional Services
SMPS Switching Mode Power Supply
SMPS Switched Mode Power Supply
model. Experiments for parallel algorithms consists of two parts. In the first part we study effect of matrix size, which were run on SP2 with Three nods . In the second part investigate effect of number of processors in the parallel machine, which were run with a fixed matrix size 100x100x100 for three Dimensional matrix size and 30x30x30 for four Dimensional matrix and processors number varying from two to sixteen.

Performance of parallel algorithms in EKMR have presented in [15]. Now study performance of parallel algorithms. In addition to execution time, We use comparison metric called relative performance which is calculated as :

Performance (%) = [Time(TMR_Alg) - Time(EKMR_Alg)/Time(TMR_Alg)] x 100

[FIGURE 8 OMITTED]

[FIGURE 9 OMITTED]

[FIGURE 10 OMITTED]

Performance of algorithm in EKMR(4): To see the how well the EKMR scheme can be extended to higher dimensions , we move our focus on to EKMR(4). We have seen in the table1 that the row partition scheme should perform better then column and 2D. we have implemented matrix operation algorithms based on row partition experiment in order to further gain to inside . as we expected experimental results indicate that algorithm based on EKMR outperforms that based on TMR . In fact the difference 52-60% in execution time and 56- 65% in relative performance for addition/ subtraction subtraction, fundamental operation of arithmetic; the inverse of addition. If a and b are real numbers (see number), then the number ab is that number (called the difference) which when added to b (the subtractor) equals  operation, about 42-52% in execution time and 45-65% in relative performance for matrix--vector performance, about 26-32% in execution time and 30-45% in relative performance for matrix-matrix multiplication.

Conclusion

EKMR is a new structure fore representing a multi dimensional matrix. This structure consists of a tree with 2 D matrices in its leaves. Development of parallel matrix operation algorithm has been done on the basis of EKMR. This has come out to be a better performer than TMR base algorithms. Further work based on this may consist of :

(1) Cache effect study of various algorithms.

(2) Development of compression schemes for sparse matrices in the form of EKMR and TMR.

(3) Application of the proposed algorithm in the research of multi-dimensional data cube For the data mining concept, see, see .

In computer programming contexts, a data cube is a three- (or higher) dimensional array of values, commonly used to describe a time series of image data.
.

(4) Application of other schemes like tensor product (mathematics) tensor product - A function of two vector spaces, U and V, which returns the space of linear maps from V's dual to U.

Tensor product has natural symmetry in interchange of U and V and it produces an associative "multiplication" on vector spaces.
of Strassen's method in the proposed new matrix multiplication algorithm for further improvement in performance.

References

[1] Aart J.C. Bik and Harry A.G. Wiljshoff, "Compilation Techniques for Spare Matrix Computations", In Proceedings of Supercomputing, pages 430-439, 1993.

[2] Aart J.C. Bik, Peter M.W. Knijnenburg, and Harry A.G. Wijshoff, "Reshaping Access Patterns for Generating Sparse Codes," In Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing Solving a problem with multiple computers or computers made up of multiple processors. It is an umbrella term for a variety of architectures, including symmetric multiprocessing (SMP), clusters of SMP systems, massively parallel processors (MPPs) and grid computing. , 1994.

[3] Aart J.C. Bik and Harry A.G. Wiljshoff, "Automatic Data Structure Selection and Transformation for Spare Matrix Computations", Technical Report, 1992.

[4] J.K. Cullum and R.A. Willoughby, "Algorithm for Large Symmetric Eignenvalue Computations", (Birkhauser, Boston, 1985)

[5] B.B. Fraguela, R. Doallo, E.L. Zapata, "Modeling Set Associative as·so·ci·a·tive
1. Of, characterized by, resulting from, or causing association.

2. Mathematics Independent of the grouping of elements.
Caches Behaviour for Irregular Computations," ACM (Association for Computing Machinery, New York, www.acm.org) A membership organization founded in 1947 dedicated to advancing the arts and sciences of information processing. In addition to awards and publications, ACM also maintains special interest groups (SIGs) in the computer field.  International Conference on Measurement and Modeling of Computer systems (SIGMETRICS'98), pp. 192-201, June, 1998.

[6] B.B. Fraguela, R. Doallo, E.L. Zapata, "Cache Misses Prediction for high Performance Sparse Algorithms." 4th International Euro-Par Conference (Euro-Par'98) pp.224-233, September 1998.

[7] B.B. Fraguela, R. Doallo, E.L. Zapata, "Cache Probabilistic Modeling for Basic Sparse Algebra algebra, branch of mathematics concerned with operations on sets of numbers or other elements that are often represented by symbols. Algebra is a generalization of arithmetic and gains much of its power from dealing symbolically with elements and operations (such as  Kernels Involving Matrices with a Non-Uniform Distribution,"24th IEEE (Institute of Electrical and Electronics Engineers, New York, www.ieee.org) A membership organization that includes engineers, scientists and students in electronics and allied fields.  Euromicro Conference, pp 345-348, August 1998.

[8] B.B. Fraguela, R. Doallo, E.L. Zapata, "Automatic Analytical Modeling for the estimation of Cache Misses", International Conference on Parallel Architectures and Compilation Techniques (PACT'99), October 1999.

[9] G.H. Golub and C.F. Van Loan, Matrix Computations, 2nd ed. (John Hopkins Univ. Press, Baltimore, 1989)

[10] Mahmut Kandemir, J. Ramanujam, Alok Choudhary, "Improving Cache Locality by a Combination of Loop and Data Transformations, "IEEE Trans. On Computers Vol. 48, Ao. 2, February,1999.

[11] Vladimir Kotlyar, Keshav Pingali, and Paul Stodghill, "Compiling Parallel Sparse Code for User-Defined Data Structures," In Proceedings of 8th SIAM Conference on Parallel Processing parallel processing, the concurrent or simultaneous execution of two or more parts of a single computer program, at speeds far exceeding those of a conventional computer.  for Scientific Computing, March,1997.

[12] Vladimir Kotlyar, Keshav Pingali, and Paul Stodghill, A Programs", In Euro Par, August 1997.

[13] Vladimir Kotlyar, Keshav Pingali, and Paul Stodghill, A, "Compiling Parallel Code for Sparse Matrix In the mathematical subfield of numerical analysis a sparse matrix is a matrix populated primarily with zeros.

Conceptually, sparsity corresponds to systems which are loosely coupled.
Applications," In Proceedings of the Supercomputing Conference,August,1997.

[14] B. Kumar, C.H. Huang, R.W. Johnson and P. Sadayappan, "A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm with Memory Reduction," In Proceedings of the 7th International Parallel Processing Symposium,1998.

[15] Jen-Shiuh Liu, Jiun-Yuan Lin and Yeh-Ching Chung, "Efficient Representation for Multi-Dimensional Matrix Operation", Proceedings of Workshop on Compiler Techniques for High Performance Computing,pp. 133 144, March, 2000, Taiwan.

[16] W.H. Press, S.A. Teukolsky, W.T. Vetterling and B.P. Flannery, Numerical Recipes, The Art of Scientific Computing, 2nd ed.(Cambridge University Press Cambridge University Press (known colloquially as CUP) is a publisher given a Royal Charter by Henry VIII in 1534, and one of the two privileged presses (the other being Oxford University Press). ,1992).

[17] Peter D. Sulatycke and Kanad Ghosh, "Caching Efficient Multithreaded Fast Multiplication of Sparse Matrices," In Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing The first term used to describe the distribution of multiple computers throughout an organization in contrast to a centralized system. It started with the first minicomputers. Today, distributed processing is called "distributed computing." See also client/server. , 1998.

[18] James B. White James Bain White (June 26, 1835 - October 9, 1897) was a U.S. Representative from Indiana.

Born in Stirlingshire, Scotland, White attended the common schools.Immigrated to the United States in 1854. He settled in Fort Wayne, Indiana.Calico printer.Tailor.
, III and P.Sadayappan," On Improving the Performance of Sparse Matrix-Matrix Vector Multiplication", "IEEE Conference, 1997.

Satya Prakash Singh Prakash Singh VC (1 April 1913, Jammu & Kashmir - 17 February 1945) was an Indian recipient of the Victoria Cross, the highest and most prestigious award for gallantry in the face of the enemy that can be awarded to British and Commonwealth forces. 1, Anil K Ahlawat2 and PC Saxena3

(1) Sr. Lecturer, Computer Science & Engg. Birla Institute of Technology
, Mesra, Noida Campus, India

* Email: spsinghbit@yahoo.co.in, spsingh11in@yahoo.co.in

(2) Professor, Ajay Kumar Garg Engineering College Ajay Kumar Garg Engineering College, Ghaziabad is an Engineering and Management college located in Ghaziabad, a suburb of Delhi, India. The college is affiliated to Uttar Pradesh Technical University, Lucknow, Uttar Pradesh. , Ghaziabad, India

(3) Professor, JNU JNU Jawaharlal Nehru University (India)
JNU Juneau, AK, USA - Juneau (Airport Code)
, New Delhi New Delhi (dĕl`ē), city (1991 pop. 294,149), capital of India and of Delhi state, N central India, on the right bank of the Yamuna River. , India
```Table 1: Cost for partition procedure in three and four-dimensional
matrices.

No. of elements to be collected

TMR(3)      EKMR(3)       TMR(4)      EKMR(4)

Row Partition       [n.sup.3]        0        [n.sup.4]        0
(P processor)

Column Partition    [n.sup.3]    [n.sup.3]    [n.sup.4]    [n.sup.4]
(P processor)

2D Partition        [n.sup.3]    [n.sup.3]    [n.sup.4]    [n.sup.4]
(PxQ processor)

No. of non-continuous blocks

TMR(3)      EKMR(3)       TMR(4)      EKMR(4)

Row Partition           Pn           0        [Pn.sup.2]       0
(P processor)

Column Partition    [Pn.sup.2]       Pn       [Pn.sup.3]       Pn
(P processor)

2D Partition        [Qn.sup.2]       Qn       [Qn.sup.3]       Qn
(PxQ processor)
```
No portion of this article can be reproduced without the express written permission from the copyright holder.