Some P-RAM Algorithms for Sparse Linear Systems

: PRAM algorithms for Symmetric Gaussian elimination is presented. We showed actual testing operations that will be performed during Symmetric Gaussian elimination, which caused symbolic factorization to occur for sparse linear systems. The array pattern of processing elements (PE) in row major order for the specialized sparse matrix in formulated. We showed that the access function in 2 +jn+k contains topological properties. We also proved that cost of storage and cost of retrieval of a matrix are proportional to each other in polylogarithmic parallel time using P-RAM with a polynomial numbers of processor. We use symbolic factorization that produces a data structure, which is used to exploit the sparsity of the triangular factors. In these parallel algorithms number of multiplication/division in O(log 3 n), number of addition/subtraction in O(log 3 n) and the storage in O(log 2 n) may be achieved.


INTRODUCTION
In this research we will explain the method of representing a sparse matrix in parallel by using P-RAM model. P-RAM model is a shared memory model. According to Gibbons and Rytter [6] , the PRAM model will survive as a theoretically convenient model of parallel computation and as a starting point for a methodology. There are number of processors working synchronously and communicating through a common random access memory. The processors are indexed by the natural number and they synchronously execute the same program (through the central main control). Although performing the same instructions, the processors can be working on different data (located in a different storage location). Such a model is also called a single-instruction, multiple-data stream (SIMD) model. The symmetric factorization is in logarithmic time of the hypercube SIMD model [2,10] .
To develop these algorithms we use method of symbolic factorization on sparse symmetric factorization. Symbolic factorization procedures a data structure that exploits the sparsity of the triangular factors. We will represent an array pattern of processing elements (PE) for sparse matrix on hypercube. We have already developed P-RAM algorithms for linear system (dense case) [7,8,9] . These algorithms are implemented on Hypercube architecture.

BACKGROUND
We consider symmetric Gaussian elimination for the solution of the system of linear Eq.

A x = b
Where A is an n×n symmetric, positive definite matrix. Symmetric Gaussian elimination is equivalent to the square root free Cholesky method, i.e., first factoring A into the product U T DU and then forward and back solving to obtain x and we often talk interchangeably about these two methods of solving A x = b. Furthermore, since almost the entire of A, we restrict most of our attention to that portion of the solution process.
Sherman [13] developed a row-oriented U T DU factorization algorithm for dense matrices A and considers two modifications of it which take advantage of sparseness in A and U. We also discuss the type of information about A and U which is required to allow sparse symmetric factorization to be implemented efficiently.
Here we will present a parallel algorithm for sparse symmetric matrix and will be implemented to hypercube. Previously we have already developed matrix multiplication algorithm on P-RAM model [7] and implementation of back-substitution in Gauss-elimination on P-RAM model [8] . Now we further extend this idea for sparse linear systems of Eq.s.

ANALYSIS OF SYMMETRIC GAUSS ELIMINATION
In this section we suggest the actual testing operations that will be performed during Symmetric Gaussian elimination. Basically the Gaussian elimination is applied to transform matrices of linear Eq. to triangular form. This process may be performed in general by creating zeros in the first column, then the second and so forth. For k = 1,2,...,n-1 we use the formulae where, (1) ij ij a a = , i, j = 1, 2,…,n. The only assumption required is that the inequalities (k) kk a 0 ≠ , k = 1,2,..., n hold. These entries are called pivot in Gaussian elimination. It is convenient to use the notation, For the system obtained after (k-1) steps, k = 1, 2,..., n with A (1) = A and b (1) = b. The final matrix A (n) is upper triangular matrix [3] As we know that in  Proof: -

Proof:
The result is obvious.
In Gauss elimination, a nonzero position in position jk, implied that there was also a nonzero position in kj, which would cause fill to occur in position i, j, when row k is used to eliminate the element in position ik [4] .
Here we present the following analysis when the matrix is symmetric and positive definite.
is less than 0 then the elements a ik (k) or a jk k is negative before elimination and when in Eq. 3 ( ) is greater than 0 then a jk (k) or a ik (k) both are negative or both are positive and for pivot element a ij (k) is always positive and when in Eq. 3 ( ) is equal to zero then no elimination will take place and one of the following condition will be satisfied (a) a ik (k) = 0 or (b) a jk (k) = 0 or both (a) and (b) will be zero.
When i = j in the above explanation of Eq. 3 then for the first case the pivot element will be eliminated and for the second case the element which is to be eliminated will be greater than zero and for the third case elimination will not take place.
We suggested testing operation for variation occurring in the elements of Symmetric Gaussian elimination. There are more testing operations rather than arithmetic operations, so that the running time of the algorithms could be proportional to the amount of testing rather than amount of arithmetic operation, which will cause symbolic factorization to occur. To avoid this problem Sherman pre computed the sets ra k , ru k , cu k and showed that in implementation of this scheme, the total storage required is proportional to the number of non zeroes in A and U and that the total running time is proportional to the number of arithmetic operations on nonzero. In addition to this the preprocessing required to compute the sets { ra k } { ru k } and {cu k } can be performed in time proportional to the total storage. To see how to avoid these problems, let us assume that for each k, 1≤ k ≤ N, we have precomputed: • The set ra k of columns j ≥ k for which a kj ≠ 0; • The set ru k of columns j > k for which u kj ≠ 0; • The set cu k of columns i < k for which u ik ≠ 0; Line 1. For k ← 1 to N do 2.
For j ru k do 5.
For j {n ra k : n > k} do 7.
For i cu k do 9.
For j {n ru i : n >k} do 13.

Algorithm 1
In Algorithm 1 only entries of M, which are used are those corresponding to nonzeroes in A or U. An implementation for Algorithm 1, it is already shown that the total storage required is proportional to the number of nonzeroes in A and U and that the total running time is proportional to the number of arithmetic operations on nonzeroes. In addition, A.H. Sherman [13] showed that the pre-processing required to compute the sets {ra k }, {ru k }, {cu k } can be performed in time proportional to the total storage.

SYMBOLIC FACTORIZATION
The sets {ra k }, which describe the structure of A, are input parameters and the symbolic factorization algorithm, computes the sets {ru k } from them. The sets {cu k } could be computed from the sets {ru k }, at the k-th step of the symbolic factorization algorithm, ru k is computed from ra k and the sets ru i for i < k. An examination of Algorithm 1 shows that for j > k, u kj 0 if and only if either i. a kj 0 or ii. u ik 0 for some i ∈cu k . Thus letting ru k i = {j ∈ ru i : j> k}, we have j ∈ ru k if and only if either iii. j ∈ ra k or iv.
j ∈ ru k i for some i ∈ cu k . Algorithm 2, 3 are a symbolic factorization algorithm based directly on Algorithm 1. At the k-th step, ru k is formed by combining ra k with sets { ru k i } for i∈cu k . However, it is not necessary to examine ru k i for all rows i∈cu k . Let l k be the set of rows i∈cu k for which k is the minimum column index in ru i . Then we have the following result, which expresses a type of transitivity condition for the fill-in in symmetric Gaussian elimination.
There are some important reasons why it is desirable to perform such a symbolic factorization.
• Since a symbolic factorization produce a data structure that exploits the sparsity of the triangular factors, the numerical decomposition can be performed using a static storage scheme. There is no need to perform storage allocations for the fillin during the numerical computations. This reduces both storage and execution time overheads for the numerical phase [4,5] . reasonably light) [4,5] 1. For k ← 1 to N do 2.
For k ← 1 to N-1 do Comment: Form ru k by set Unions 4.
For i ∈ cu k do 7.
[For j ∈ ru i k do 8.
[if j ∉ ru k then 9.
For k ← 1 to N-1 do Comment: Form ru k by set Unions 5.
For i ∈ l k do 8.
[For j ∈ ru i k do 9.

PARALLEL MATRIX ALGORITHM
By an efficient parallel algorithm we mean one that takes polylogarithmic time using a polynomial number of processors. In practical terms, at most a polynomial number of processors is reckoned to be feasible [6]. A polylogarithmic time algorithm takes O (log k n) parallel time for some constant integer k, where n is the problem size. Problems which can be solved within these constraints are universally regarded as having efficient parallel solutions and are said to belong to the class NC(Nick Pippenger's Class).

Representation of Array Pattern of Processing Elements (P.Es.):
Consider a case of three dimensional array pattern with n 3 = 2 3q (Processing Elements) PEs.
Conceptually these PEs may be regarded as arranged, in n×n×n array pattern. If we assume that the PEs are row major order, the PE (i,j,k) in position (i,j,k) of this array has index in 2 +jn+k (note that array indices are in the range[0, (n-1)]. Hence, if r 3q-1 ,.....,r 0 is the binary representation of the PE position (i,j,k) then i = r 3q-1 ,....,r 2q , j = r 2q-1 ,...,r q , k = r q-1 ,....,r 0 using A(i,j,k), B(i,j,k) and C(i,j,k) to represent memory locations in P(i,j,k), we can describe the initial condition for matrix multiplication as: A(0,j,k) = A jk B(0,j,k) = B jk , 0 < = j < k, 0 < = k < n A jk and B jk are the elements of the two matrices to be multiplied. The desired final configuration is C(0,j,k) = C(j,k), 0 < = j < n, 0 < = k < n Where, This algorithm computes the product matrix C by directly making use of (4). The algorithm has three distinct phases. In the first, element of A and B are distributed over the n 3 PEs so that we have A(l,j,k) = A jl and B(l,j,k) = B lk . In the second phase the products C(l,j,k) = A(l,j,k) * B(l,j,k) = A jl B lk are computed. The details are spelled out in Dekel, Nassimi and Sahni 1981. In this procedure all PE references are by PE index (Recall that the index of PE(i,j,k) as in 2 +jn+k). The key to the algorithm of Dekel, Nassimi and Sahni [2] in the data routing strategy 5q = 5 log n routing steps are sufficient to broadcast the initial value through the processor array and to confine the results.
The array pattern of processing elements (PE) in row-major order for the specialized sparse matrix (symmetric) [12] can be formulated in the following manner.
• Representation of lower-triangular matrix: Index of (a ij ) = Total Number of elements in first i-1 rows + Number of elements up to j th column in the i th row = i (i+1) / 2 + j (1 i, j n)

• Representation of upper-triangular matrix:
Index of (a ij ) = Number of elements up to a ij element = (i-1)×(n-i/2)+j (1≤i, j≤n) • Representation of diagonal matrix: In the sparse matrices having the elements only on diagonal following points are evident: Number of elements in a n×n square diagonal matrix = n Any element a ij can be referred as processing element using the formula Address (a ij ) = i[or j] • Representation of tri-diagonal matrix: Index of (a ij ) = Total number of elements in first (i-1) rows + Number of elements up to j th Column in the i th Row = 2+2 x (i-2)+j (1≤i, j≤n) (1≤i, j≤n) • Representation of αβ αβ αβ αβ-band matrix: Case 1: 1≤i≤β Index of (a ij ) = Number of elements in first (i -1) th row + Number of element in i th row up to j th column = α × (i-1) + ((i-1) (i-2)) / 2 + j Case 2: β < i ≤ (n-α+1) Index of (a ij ) = Number of elements in first β row + Number of elements between (β+1) th row and (i-1) th row + Number of elements in i th Row = αβ + (β(β-1))/2+(α+β-1)(i-β-1)+j-i+β Case 3: n-α+1<i Index of (a ij ) = Number of elements in first (n-α+1) rows + Number of elements after (n-α+1) th row and up to (i-1) th row + Number of elements in i th Row and unto j th column = αβ+(β(β-1))/2+(α+β-1) (n-α-β+1)+(α+β) (i-n+α−1)-((i-n+α-1)×(i-n+α-2)) / 2+ 1 Representation of array pattern of processing elements (PE) for lower triangular matrix is presented on a hypercube model. Hypercubes are loosely coupled parallel processors based on the binary n-cube network. A n-cube parallel processor consists of 2 n identical processors, each provided with its own sizable memory and inter connected with n neighbors [1,14,15] . This architecture consists of a large number of identical processors inter connected to one another according to some convenient pattern. In a shared memory system, processors operate on the data from the common memory, each processor reads the data it needs, performs some processing and writes the results back in memory. In a distributed memory system inter processor communication is achieved by message passing and computation of data driven (although some designs incorporate a global bus, this does not constitute the main way of inter communication). By message passing it is meant that data or possibly code are transferred from processor A to processor B by traveling across a sequence of nearest neighbor nodes starting with node A and ending with B, synchronization is driven by data in the sense that computation in some node is performed only when its necessary data are available. The main advantage of such architectures, often referred to as ensemble architectures, is the simplicity of their design. The nodes are identical, or are of a few different kinds and can therefore be fabricated at relatively low cost. The model can easily be made fault tolerant by shutting down failing nodes.
The most important advantages of this class of design is the ability to exploit particular topologies of problem or algorithms in order to minimize communication costs. A hypercube is a multidimensional mesh of nodes with exactly two nodes in each dimension. A d-dimensional hypercube consists of K nodes, where K = 2 n .
• A hypercube has n special dimensions, where n can be any positive integer (including zero) [1], [15] . • A hypercube has 2 n vertices [11] . • There are n connections (lines) that meet at each vertex of a hypercube [11] . • All connections at a hypercube vertex meet at right angles with respect to each other [1], [14] . • The Hypercube can be constructed recursively from lower dimensional cubes. • An architecture where the degree and diameter of the graph is same than they will achieve a good balance between, the communication speed and the complexity of the topology network. Hypercube achieve this equality, which explains why they are one of the today's most popular design (e.g. i psc of intel corp., T-series of FPS, n-cube, connection machine of thinking machines corp.) [11,15] .
When the lower triangular matrix is presented in three dimension then the PE's are indexed in the following manner. Proof: This access function is a polynomial of 3 dimensional discrete space. Where i, j, k are 3 dimensions and n is fixed. It gives a relationship of processing elements (i.e. there are 2 3 connections) that meet at each vertex of a hypercube means that the algorithm can be evaluated in polylogarithmic time using a polynomial (in 2 +jn+k) of three dimensional discrete space. We can easily construct hypercube successively from lower dimensional cubes by using polynomial in 2 +jn+k. A discrete space is a topological space in which all sets are isolated. We conclude that the access function by which we are mapping matrix elements will be pairwise continuous. It is shown in the implementation that because of the hamming distance between the processes the hypercube is modeled as a discrete space with discrete time. This access function is also used to map a matrix of three dimensions into RAM sequentially.
Lemma 10: Cost of storage and cost of retrieval of a matrix are proportional to each other in polylogarithmic parallel time using P-RAM with a polynomial number of processor.
Proof: For storage and retrieval of a matrix we use parallel iteration. Parallel iteration has the property of convergence in log n in parallel. It converge geometrically or at the rate of geometric progression therefore they are proportional to each other for a single value. From the above fact we can write that cost of retrieval is proportional to cost of storage cost of retrieval = k x cost of storage, (Where k is a constant) if k ≥ 1 then it is a Dense matrix and if k < 1 then it is a sparse matrix.
Here we are representing PRAM-CREW Algorithm.
Since we have developed parallel algorithm for sparse linear systems therefore it is not required to discuss storage for sequential algorithms. Although the storage requirements has been discussed in very short here in this research. The storage schemes used for sparse matrices consists of two facts, primary storage used to hold the numerical values and overhead storage, used for pointers, subscripts and other information needed to record the structure of the matrix. The data structures involved for these two different strategies may be compared. The elements of the upper triangle (excluding the diagonal) are stored row by row in a single array with a parallel array holding their column subscripts. A third array indicates the position of each row and a fourth array contains the sophistication of the storage scheme increases. Exploiting more and more zeros, the primary storage decreases, but the overhead usually increases. There is usually a point where it pays to ignore some zeros, because the overhead storage required to exploit them far more than the decrease in primary storage [5] .

CONCLUSION
The above discussion proves the values of the elements of the matrix. The Gauss elimination for