# Biclustering gene expression dataset using enhanced barycenter crossing minimization.

1. INTRODUCTIONDNA is the secret of life. It is the blue print for making new cells or even new organism. It contains the genetic information that move from generation to generation in what we called inheritance.

DNA contains the genes that carry the phenotypes that determine how the organism looks a like or the physical properties of the organism, the hair is black, and the eye is blue, and so on. Every gene can carry phenotype such as the color of eye. The phenotype property can be determined by two, or more genes not just one gene [1]

In humans, genes are arranged into twenty three pairs of chromosomes, or forty six chromosomes. Twenty three chromosomes come from the father, and the other twenty three come from the mother, so the child inherits its properties from the two parents. Everybody have two copies from each gene one from father and one from mother [1].

DNA takes a form of helix of double strands. Each strand connects to the other with a hydrogen bond. Each strand is a long chain of nucleotides; every nucleotide is made up of three components: base, deoxyribose sugar, and phosphate. These nucleotides come up in pairs to make the DNA.

DNA consists of thousands of just four bases, namely Adenine (A), Guanine (G), Thymine (T), and Cytosine (C). Each nucleotide connects to another one with phosphodiester bond. Phosphate and sugar work as the backbone of the DNA like stair and the bases like the steps of the stair.

RNA is more like DNA, it have some different changes than DNA. RNA is a chain of nucleotides, these nucleotides composed of ribose sugar instead of deoxyribose in DNA, phosphate, and base. RNA contains thousands of just four bases but with Uracil in place of Thymine.

Proteins are the basic material that makes anybody. It is responsible for all functions that occur in the body, body structure, and all the physical characteristic of the body. Proteins are the most important molecule in the body, because any mutations happened in the proteins may lead to the protein losing its functionality, and hence its role in the body, and hence diseases begin. Proteins are long chains of Amino Acids (AA). There are only twenty amino acids that made up all the proteins of the body

Gene expression is the process of converting genes into proteins in two basic steps: DNA is converted into mRNA in a process called transcription, and then mRNA converted to proteins in a process called translation

DNA microarray is a technology that it is used to measures the amount of gene expression in the cell. It is used in the study of behavior of genes under certain conditions, or at different time frames. This technology enables scientists to study what happens to genes in different disease conditions [2].

The result of microarray is a gene expression dataset which is a tow dimensional array--data table--where every row represents one gene and every column represents one condition or sample. Every cell in this table represent the expression level [L.sub.ij] of gene i in condition j, as in table 1.

There is many Data mining techniques used for analyzing gene expression datasets in order to extract information and finding relations between these data such as clustering and classification.

Clustering is used to assign data to groups, or classes like classification, but the difference is that in classification the classes are predefined, but in clustering the classes are not determined before applying the algorithm. Clustering algorithms can be applied on gene expression datasets in order to group genes that have the same expression together and this is helping in the study of diseases such as cancer. There are three ways of clustering:

1. Gene-based clustering in this approach genes are treated as objects and samples are treated as features.

2. Sample-based clustering is the opposite of gene-based clustering, samples are treated as objects and genes are treated as features.

3. Subspace clustering this technique combine the previous two approaches the genes and samples can be treated as objects and features so that gene may be object or features and samples are treated the same.

Biclustering or subspace clustering is the most efficient technique in clustering because clustering of genes depending on conditions or vice versa, and this is not what happens exactly in the real life, a subset of genes may have similar behavior on subset of conditions, and here is the Biclustering come to solve this problem.

Gene expression datasets can be represented in bipartite graph where:

1. Genes are represented by the top layer.

2. Conditions are represented by the bottom layer.

3. Edges that connect the genes in the top layer with its corresponding conditions in the bottom layer.

The edge weigh [w.sub.ij] is the expression value of gene i in condition j, where [w.sub.ij] = 0 if there is no edge between node i and j, and this means that the gene i does not express in condition j.

We can represent data in table 1 in a bipartite graph, as in figure 1, where top layer represents top conditions and bottom layer represents genes, then this Bipartite graph can be clustered directly, but the results will be less accurate, and in order to overcome this problem there is one more step before biclustering, which is crossing minimization. Crossing minimization aims to minimizing the number of crossings between lines that connect the two layers, the graph is reordered, and related nodes grouped together, as in figure 2, and this enhances the results of biclustering process.

2. RELATED WORK

Biclustering algorithms can be classified in two main categories: graphical and non-graphical biclustering. We focus our discussions on graphical biclustering which is the main interest point of our research for general discussion on biclustering algorithms can be found in [3, 4]. There is another good reference that focuses on Barycenter crossing minimization which the main goal of our work is to enhance Barycenter algorithm [5].

2.1 CHENG AND CHURCH

Cheng and Church [6] define bicluster as a submatrix for which the mean square residue H(I, J) of each bicluster is below than a predefined threshold. The algorithm is run in two main phases: first, looping through the gene expression matrix removing rows and columns until H(I, J) is less than the threshold; second, looping through the deleted rows and columns and adding to bicluster as long as H(I, J) is less than threshold. The H(I, J) of an element [a.sub.ij] in a submatrix [A.sub.IJ]

H(l'J) = 1/[absolute value of I] [absolute value of J] [[summation].sub.i [member of] I, j [member of]J] [([a.sub.ij] - [a.sub.iJ] - [a.sub.Ij] + [a.sub.Jj]).sup.2] (1)

Where

Sub Row Average

[a.sub.iJ] = 1/[absolute value of J] [summation over (j[member of]J)] [a.sub.ij] (2)

Sub Column Average

[a.sub.ij] = 1/[absolute value of I] [summation over (i[member of]I)] [a.sub.ij] (3)

16/

Sub Matrix Average

1/[absolute value of I] [absolute value of J] [summation over (i[member of]I, j[member of]J)] [a.sub.ij] (4)

2.2 SPHIER

SPHier [7] extract clusters from dataset using bipartite crossing minimization biclustering techniques but instead of serial Bigraph crossing minimization using parallel algorithm in order to increase the performance of the algorithm beside the enhancement made on Cheng and Church biclustering algorithm to enable the local search for clusters instead of global search because after the Bigraph reordering the related values arranged together in blocks and global search is useless.

2.3 TABU SEARCH

Ibrahim [8] using a Tabu search, which is a metaheuristics technique depending on guide local search technique to find global optimum solution. He adapts the Tabu search in Marti [9] to solve the crossing minimization problem and reach to better order of dataset. Tabu search run in two main steps: first, construction of initial solution and second, iterative improvement of this solution until reaching the optimum solution. Iterative improvement has two steps: Intensification and Diversification and each step of them has three steps: normal, influential, and opposite and the algorithm moves between these three steps as it insert or remove nodes depending on their Barycenter and if this move reduces the number of crossings in the graph and if there is no enhancement it stops after a specified number of iteration this process continue running in this way until reaching the optimum solution.

2.4 BIMAX

BiMax Algorithm was proposed by Prelic et al [10]; the main idea behind this algorithm is to transforming the gene expression dataset inputs into binary dataset (or 0's and 1's) by discretization process, where 0's means no change in expression level and 1,s from change in expression level.

A bicluster corresponds to a subset of genes that jointly respond across subset of samples. The algorithm uses divide-and-conquer mechanism. It add row by row to the cluster, when it adds a row the columns set is partitioned into [C.sub.U]--columns in which the added row has ones, and [C.sub.V]--columns in which the added row has ones.

The row set is split into [G.sub.U]-the rows that have only ones in [C.sub.U], [G.sub.V]--the rows that have ones in CV only, and [G.sub.W]--the rows that have ones in both. If U and V do not share any rows and columns of the matrix E, i.e. GW is empty; the two matrices can be processed independently from each other. If U and V have a set GW of rows in common as shown in Figure 3.6, special care is necessary to only generate those biclusters in V, that share at least one common column with CV.

2.1 SAMBA

SAMBA [11] stands for statistical algorithmic for bicluster analysis. The main goal of this algorithm is to make a fast biclustering method that would produce statistically significant results.

It defines a bicluster as a subset of genes that jointly respond across a subset of conditions. The gene is considered as normal if its expression level responding significantly in these conditions.

SMABA algorithm represents the gene expression dataset as bipartite graph where genes represent one layer of nodes, and conditions represents the other layer. The nodes in tow layers are connected with edges. The edges represent the weight (or expression) level of each gene under each condition. The weight of each subgraph (or bicluster) is the sum of the weights of gene-condition pairs in it, including edges and nonedges.

Hash table is created for the bipartite graph using the weights of bipartite graph and a probabilistic function. This hash table is used to find heavy subgraph around each vertex using local search depending on scoring function, and improves this cluster until no score improvements are available.

2.5 DISCUSSION

Cheng and Church was one of the leading algorithms of biclustering algorithms, but this algorithm using global search heuristics to find biclusters, and this technique consumes more time in running the algorithm beside the effect of global search that it gives results that it is not so accurate and efficient.

SPHeir algorithm uses the same technique behind Cheng and Church but using the bipartite graph along with crossing minimization techniques for solving the problem besides using local search heuristics and that overcomes the problem of global search heuristics, hence it gives more efficient and accurate results.

Tabu search algorithm used the same idea as SPHeir algorithm, as it used bipartite graph, crossing minimization, and local search heuristics, but the difference in this algorithm is the local search heuristics for finding the biclusters. This algorithm achieves better, more accurate and efficient results, but this algorithm takes more time for solving the problem, but as the processing power of computers increases rapidly, so we think the time is not factor as efficiency, because at the end the target of biclustering is to get accurate results.

SAMBA algorithm is one of famous algorithms that achieves good statistical results, runs in a good time, it also used the methodology of bipartite graph, and used the weight of edges and non-edges along with statically functions to give weight for each subgraph, so it differs from the previous the other algorithms that we discussed, and which using bipartite graph in that it used the weight of edges and non-edges in calculations, but other algorithm depend on edges only in the calculations, and also depend on local search heuristics to find biclusters.

BiMax algorithm looks like Cheng and Church in that the two algorithms do not use graph representation in solving biclustering problem, BiMax algorithm depended on transforming the dataset set into binary dataset in a process called discretization, and then work on this dataset by divide-and-conquer methodology by partitioning the dataset to rows and columns, and add them to clusters depending on the property they share, the idea of discretization used in other algorithms such as Tabu search algorithm to minimize the time taken in crossing minimization process, this technique decreases the running time.

We can conclude from this discussion that the most significant algorithms of the algorithms we discussed is Tabu search, and SMABA algorithm, because they achieve the best results, and used different valuable processing techniques.

3. PROPOSED ALGORITHM

There are many techniques used for Biclustering one of these techniques is constructing a bipartite graph from gene expression dataset, reordering nodes in this graph using crossing minimization technique then apply biclustering algorithm on the reordered bipartite graph. Most of the algorithms used to crossing minimization in bipartite graph divided into three basic steps: the first step is the construction of bipartite graph, the second step is the initial ordering, and the last step is iterative improvement.

we calculate the rank of each node in layer 0 using Barycenter heuristics, as in equation 5, then reordering these nodes depending on this rank while keeping the layer 1 static then reordering layer 0 by calculating the rank of each node in layer 1 using equation 5, and reordering nodes while keeping the layer 0 static, and we will continue iteratively in this manner until no more changes occur in the order of nodes in the two layers.

Bary Center = [[summation].sub.j [member of] Ni] [W.sub.i,j] x [r.sub.j]/[[summation].sub.j[member of]Ni] [W.sub.i,j] (5)

Where:

"Let [v.sub.i] represent the ith node in the non-static layer, and set Ni represents the set of neighbors of [v.sub.i], also let [r.sub.j] represent the rank of jth member of the set [N.sub.i] "[12].

We success to solve the problem of crossing minimization in a way that achieves better results in crossings number rather than Barycenter heuristics [13]. We use the same way of Barycenter heuristics used in [13] to minimize crossings in bipartite graph, but we make some changes to the calculations of weight of each node and using this weight as a new rank for reordering nodes. We calculate the Barycenter of each node as in algorithm 2, then calculating the new weight by adding the rank of each node to the sum of the ranks of its neighbors as in algorithm 3, and taking the position of each node into consideration of our calculations as in algorithm 4, then using this rank for reordering the nodes.

The algorithm of biclustering has three main basic

steps:

--Construct bipartite graph

--Bipartite graph crossing minimization as in algorithm 1

--Applying biclustering algorithm on bipartite graph as in algorithm 5

Algorithm 1: Bipartite Crossing Minimization 1. Construct Bipartite graph BG 2. Compute the Barycenter for each node in the two layers, as in algorithm 2 3. Compute the new weight for each node in the two layers, as in algorithm 3 4. Reorder nodes in the two layers 5. Compute crossing minimization for bipartite graph 6. Repeat steps from 2 to 5 until the number of crossing of the current bipartite graph is larger than the number of crossings of previous bipartite graph from last iteration Algorithm 2: Computing the Barycenter for each node in the two layers For all i such that [v.sub.i] is a vertex in top or bottom layer Compute weighted mean for [v.sub.i] IF [Node.sub.i] x Rank != WeightedMean Then [Node.sub.i] x Rank = WeightedMean PositionChanged = PositionChanged + 1 End if End for Algorithm 3: Compute the new weight for each node in the two layers FOR ALL i such that [v.sub.i] is a vertex in top - or bottom - layer vi.NewRank = [absolute value of [v.sub.i] x Rank] + get [v.sub.i] neighbors Rank as in algorithm 4 END FOR Algorithm 4: Get neighbors Rank of vertex FOR Each neighbors j of vertex i Sum + = ([absolute value of [v.sub.i] x Rank] - [[neighbor.sub.j] x Rank]) * ([absolute value of [v.sub.i] x Position] - [[neighbor.sub.j] x Position]) END FOR

After the process of crossing minimization we perform the Biclustering process as in algorithm 5, this algorithm find biclusters by performing local search because the fact that the crossing minimization process groups related rows and related columns together so the algorithm does not need global search to find the biclusters.

The algorithm uses mean square residue score MSR --defined in equation 1 where MSR of each biclusters is less than a predefined threshold [delta]

Algorithm 5: Biclustering Algorithm Comment: Identify Biclusters in Reordered Matrix representation of Bigraph BG Comment: Mean Squared Residue Score of Each Bicluster is less than [delta] Input: Bipartite graph after crossing minimization Output : Biclusters * startRow = 0; * startCol = 0; * Index = 0; * FOR i = 0 TO numCols DO * colSum[i] = 0; * blockSum[i] = 0; * Residue[i] = 0; * FOR r = startRow TO numRows DO * rowSum = 0; * FOR c = startCol TO numCols DO ** colSum[c]=colSum[c]+Matrix--BG[r][c]; ** rowSum = rowSum + Matrix BG[r][c]; ** blockSum[c]=blockSum[c-1]+colSum[c]; ** colCount = c - startCol + 1; ** rowCount = r - startRow + 1; ** blockCount = rowCount * colCount; ** rowMean = rowSum / colCount; ** colMean = colSum[c] / rowCount; ** blockMean = blockSum[c]/blockCount; ** Res = Matrix G[r][c] - rowMean colMean + blockMean ** Residue[c] = Residue[c-1] + Res*Res; ** msr = Residue[c]/ blockCount; ** IF msr > [delta] THEN --Blocked = true; --stopCol = c; --BREAK; * IF Blocked = TRUE THEN * Biclusterbix = Rows [startRow TO r], Columns[startCol TO stopCol - 1]; * Biclusters [Index] = bix; * Index = Index + 1; * startRow = r; * r = r - 1; * startCol = stopCol;

4. EXPERIMENTAL RESULTS

The proposed algorithm is written in C-sharp language and the experiments are running on i5 2.3 GHz processor and 4 GB of RAM, and under windows 7 operating system. We apply the algorithm on cancer gene expression datasets: lymphoma dataset [14], gastric cancer dataset [15], and AML prognosis dataset [16] to test the accuracy of the proposed algorithm relative to Barycenter used in SPHier algorithm.

We apply the proposed algorithm and Barycenter algorithm on the three cancer datasets, and compare the results of the two algorithms according the accuracy of crossing minimization.

After applying the proposed algorithm, and Barycenter algorithm of crossing minimization on the cancer datasets we get the results in table 2, that it is represented in figure 3.

In this experiment we are focusing mainly on the accuracy of the algorithm, because the most important thing in the gene expression datasets analysis is the accuracy of results more the time the algorithm takes, because of the revolution in computational power of processors solve many of problems of computations time.

Figure 3 shows enhancements that our algorithm made in the Barycenter algorithm, the proposed algorithm achieves better results especially in large size gene expression datasets with a small number of conditions.

Because of the decrease in crossings number of bipartite graph leads to grouping related node together, the enhancement in crossings number increases the efficiency of biclustering algorithm [7].

5. CONCLUSIONS AND FUTURE WORK

Many of researches used barycenter heuristics for crossing minimization. In this research we used also barycenter heuristics, but with a little modification in way of calculations of the weight that it is given to each node in the graph.

Bipartite graph crossing minimization algorithms enhance the order of gene expression dataset, and this leads to enhancement in the biclusters extracted by the biclustering algorithm.

We find a way to enhance the barycenter heuristics to give better arrangement of the gene expression dataset, and this is help us to get better results of biclustering algorithm.

In the proposed algorithm we add the rank of each node to the rank of its neighbors, and using the position of each node in the calculations, as we discussed in the proposed algorithm, to give a new rank to each node, which we depend on it to reorder the nodes in each layer.

When we analyze the results shown in figure 3, and by knowing that the number of genes, and conditions of each dataset as stated in table 3, we conclude that the best result of gene expression datasets is the result of lymphoma dataset, which has the minimum number of conditions, and maximum number of genes. The efficiency of the proposed algorithm affected by the number of nodes in each layer, as we increase the difference between then number of nodes in each layer, the efficiency of the proposed algorithm increase.

As we stated in section 4, we didn't focus on the performance of the algorithm, but generally the proposed algorithm takes much time than barycenter algorithm, because our algorithm depend on counting the crossings number for every single iteration of the algorithm, and this takes long time to compute.

The area of biclustering is very complex and all researches in this area is considered as a domain specific solution, and in some cases it's a problem specific solution, we can find algorithms that find good biclusters in some cases, and in other cases these algorithms do not perform well, so there is no perfect biclustering algorithm or magic algorithm that can solve all the problems in a prefect way.

We want to state that the proposed algorithm is an idea for finding the relations between numbers, and its positions in the dataset, and how these positions affect the process of crossing minimization. We take this trend to make it our work in the research, and we cannot say that it is the perfect solution of crossing minimization problem, but it is a new trend of research that we want to go through it.

We summarize the Future works for the proposed algorithm in the following directions: firstly, it can be enhanced to increase the efficiency of crossing minimization algorithm for minimum size gene expression datasets or for datasets with large number of conditions, and secondly, the proposed algorithm can be enhanced in a way that enables to implements it in a parallel way, finally, we can extended the proposed algorithm to be applied to different types of datasets other than gene expression datasets.

6. REFERENCES

[1.] Peter Clote, Rolf Backofen," Computational Molecular Biology, an Introduction", John Wiley & Sons Ltd, 2000.

[2.] Macgregor, P.F. and Squire, J.A. 2002. "Application of microarrays to the analysis of gene expression in cancer". Clin.Chem. 48: 1170-1177.

[3.] Amos Tanay, Roded Sharany, and Ron Shamir, "Biclustering Algorithms: A Survey", 2004.

[4.] S. C. Madeira and A. L. Oliveira, "Biclustering algorithms for biological data analysis: A survey." IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1, 2004.

[5.] A. Abdullah & A. Hussain, "A new Biclustering technique based on crossing minimization", Neurocomputing Journal 69, page 1882-1896, 2006.

[6.] Y. Cheng and G.M. Church. "Biclustering of expression data. In Proceedings of Intelligent Systems for Molecular Biology". 2000.

[7.] Waseem Ahmad, Ashfaq Khokhar, "SPHier: Scalable Parallel Biclustering Using Weighted Bigraph Crossing Minimization", 2007.

[8.] I. M. El Henawy, Ahmed Hussain Kamal and Ibrahim Hamed, "Using Tabu Search as a Crossing Minimization Technique toward Biclustering Gene Expression Datasets". In: Forth International Conference on Intelligent Computing and Information Systems, Cairo, Egypt, 2009.

[9.] Marti R. (1998), "A Tabu Search Algorithm for the Bipartite Drawing Problem", European Journal of Operational Research, Vol. 106, pp. 558-569.

[10.] A. Prelic, S. Bleuler, P. Zimmermann, A. Wille, et al. "A systematic comparison and evaluation of biclustering methods for gene expression data", Bioinformatics 22 (2006), 1122-1129.

[11.] Arifa Nisar, Waseem Ahmad, Wei-keng Liao, Alok N. Choudhary: "High Performance Parallel/Distributed Biclustering Using Barycenter Heuristic". SDM 2009: 1050-1062

[12.] A. Tanay, R. Sharan and R. Shamir, "Discovering statistically significant biclusters in gene expression data." Bioinformatics, 18 (2002), 136-144.

[13.] Ahmad, W., Khokhar, A.: chawk: "A highly efficient biclustering algorithm using Bigraph crossing minimization". In: Second International Workshop on Data Mining and Bioinformatics, VDMB 2007, Vienna, Austria (Held in Conjunction with VLDB2007). (2007).

[14.] Raetz EA, Perkins SL, Bhojwani D, Smock K et al. Gene expression profiling reveals intrinsic differences between T-cell acute lymphoblastic leukemia and T-cell lymphoblastic lymphoma. Pediatr Blood Cancer 2006 Aug; 47(2):130-40.

[15.] Hippo Y, Taniguchi H, Tsutsumi S, Machida N et al. Global gene expression analysis of gastric cancer by oligonucleotide microarrays. Cancer Res 2002 Jan 1; 62(1):233-40.

[16.] Yagi T, Morimoto A, Eguchi M, Hibi S et al. Identification of a gene expression signature associated with pediatric AML prognosis. Blood 2003 Sep 1; 102(5):1849-56.

Tamer Mohamed

Helwan University

Faculty of Computers and Information

Biomedical Informatics Department

Sniper_fci_cs@yahoo.com

Table 1. Gene Expression Data Matrix Cond 1 Cond 2 Cond 3 Cond 4 Gene 1 5 0 6 0 Gene 2 0 2 0 1 Gene 3 0 2 0 1 Gene 4 2 0 1 0 Gene 5 5 0 6 0 Table 2. Crossings number of original dataset, Barycenter algorithm and proposed algorithm Original Barycenter Proposed Dataset Algorithm Algorithm Gastric 11,048,714,141 11,048,713,498 11,048,654,635 AML Prognosis 131,451,569,724 131,451,501,126 131,450,516,864 Lymphoma 100,790,992,882 100,790,992,882 100,790,645,942 Table 3 number of genes and conditions of gene expression Datasets used in the experiment Number of genes Number of conditions Gastric 7,129 30 AML Prognosis 12,625 58 Lymphoma 22,283 29

Printer friendly Cite/link Email Feedback | |

Author: | Mohamed, Tamer |
---|---|

Publication: | International Journal of Digital Information and Wireless Communications |

Article Type: | Report |

Geographic Code: | 1USA |

Date: | Jan 1, 2013 |

Words: | 4381 |

Previous Article: | Impact of a form of online materials on the quality of education a case study. |

Next Article: | A new method to channel estimation in OFDM systems based on wavelet transform. |

Topics: |