Printer Friendly
The Free Library
4,658,529 articles and books
Member login
User name  
Password 
 
Join us Forgot password?

The TAO-Gen algorithm for identifying gene interaction networks with application to SOS repair in E. coli.


One major unresolved issue in the analysis of gene expression data is the identification and quantification of gene regulatory networks A gene regulatory network (also called a GRN or genetic regulatory network) is a collection of DNA segments in a cell which interact with each other (indirectly through their RNA and protein expression products) and with other substances in the cell, thereby . Several methods have been proposed for identifying gene regulatory networks, but these methods predominantly focus on the use of multiple pairwise comparisons to identify the network structure. In this article, we describe a method for analyzing gene expression data to determine a regulatory structure consistent with an observed set of expression profiles. Unlike other methods this method goes beyond pairwise evaluations by using likelihood-based statistical methods to obtain the network that is most consistent with the complete data set. The proposed algorithm performs accurately for moderate-sized networks with most errors being minor additions of linkages. However, the analysis also indicates that sample sizes may need to be increased to uniquely identify even moderate-sized networks. The method is used to evaluate interactions between genes in the SOS SOS, code letters of the international distress signal. The signal is expressed in International Morse code as … — — — … (three dots, three dashes, three dots).  signaling pathway in Escherichia colt using gene expression data where each gene in the network is over-expressed using plasmids inserts. Key words: gene networks, microarray See micro array.

microarray - A technique for performing many DNA experiments in parallel. Nothing to do with computers.
, Bayesian model selection, SOS repair, toxicogenomics. Environ Health Perspect 112:1614-1621 (2004). doi:10.1289/txg.7105 available via ht(p://dx.doi.org/ [Online 21 July 2004]

**********

Gene expression microarrays (gene chips) have revolutionized biology by generating vast amounts of data roughly quantifying the level of mRNA expression for thousands of genes in a single sample. The analysis of these data is extraordinarily complex, resulting in a shift in biology from predominantly qualitative evaluations to quantitative approaches. With microarray technologies, scientists are forming global views of the structural and dynamic changes in genome activity during different phases in a cell's development and following exposure to external stimulants Stimulants
A class of drugs, including Ritalin, used to treat people with autism. They may make children calmer and better able to concentrate, but they also may limit growth or have other side effects.

Mentioned in: Autism
 such as environmental agents or growth factors. These views describe the molecular working of a complex information processing information processing: see data processing.
information processing

Acquisition, recording, organization, retrieval, display, and dissemination of information. Today the term usually refers to computer-based operations.
 system: the living cell. Numerous methods have already been proposed for the analysis of gene expression data. The most commonly used methods rely on clustering (Eisen et al. 1995; Tamayo et al. 1999), significance testing (Kerr et aI. 2000) and sequence motif In genetics, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance. For proteins, a sequence motif is distinguished from a structural motif, a motif formed by the three dimensional  identification (Pilpel et al. 2001). These methods do not readily reproduce gene expression networks but are more focused on the fundamental linkage between pairs of genes. Other investigators have proposed methods to identify gene regulatory networks using Boolean networks A Boolean network consists of a set of Boolean variables whose state is determined by other variables in the network. They are a particular case of discrete dynamical networks, where time and states are discrete, i.e. they have a bijection onto an integer series.  (Akutsu et al. 2000) where each gene has one of only two states (on and off), regression methods (Gardner et al. 2003), Bayesian network A Bayesian network (or a belief network) is a probabilistic graphical model that represents a set of variables and their probabilistic independencies. For example, a Bayesian network can be used to calculate the probability of a patient having a specific disease, given the  models (Friedman et al. 2000; Hartemink et al. 2002) and other methods (Johnson et al. 2004).

The use of genomics data in the evaluation of health hazards health hazard Occupational safety Any agent or activity posing a potential hazard to health. Cf Physical hazard.  and risks has received considerable attention focusing on priority setting (Pesch et al. 2004), bio-marker identification (Toraason et ah 2004), hazard identification (Surer et ah 2004), and dose--response analysis (Schonwalder and Olden old·en  
adj.
Of, relating to, or belonging to time long past; old or ancient: olden days.



[Middle English : old, old; see old + -en, adj.
 2003; Simmons and Portier 2002; Waters et ah 2003). If genomics is to play a direct role in dose--response assessment, there will be a need for methods that provide a direct, quantitative assessment of changes in gene expression as a function of dose and changes in toxicity as a function of changes in gene expression. Developing and modeling gene interaction networks can be quantitative and provide direct dose-response data for use in risk assessment. They also are an excellent means of identifying agents that provide identical changes in expression across a broad spectrum of genes and help link agents on the basis of similar mechanistic mech·a·nis·tic
adj.
1. Mechanically determined.

2. Of or relating to the philosophy of mechanism, especially one that tends to explain phenomena only by reference to physical or biological causes.
 changes.

Bayesian networks are well suited for inferring genetic interactions because of their ability to model causal influence between genes linked as a network and because they are an effective method for modeling the joint density of all variables in a system. However, the approaches suggested to date have generally focused on conversion of gene expression data to discrete states and have avoided the use of formal statistical methods for quantifying the joint density of the resulting parameters.

In this article we describe a method for inferring an "optimal" gene interaction network from microarray-based gent expression data. Unlike other network identification methods, the analytical approach presented here uses the actual measured observations on gene expression (rather than discretized data) and incorporates prior distributions for all parameters in the gene interaction network model. The method encompasses model selection theory from Bayesian regression to find gene network structures suitable for given data sets. Computer simulations presented in this article demonstrate that the proposed method is capable of identifying networks, given the sample size is sufficiently large In mathematics, the phrase sufficiently large is used in contexts such as:
is true for sufficiently large
. For small networks the limited number of replicates used for most microarray studies available today are adequate; for larger networks other options are discussed.

Materials and Methods

Figure 1 illustrates the general structure of a four gene regulatory system where the linkage between expression of gene i and expression of its parents (indirect regulators to gene i) is described by weighting the function [w.sub.i]([n.sub.i]), where the subscript (1) In word processing and scientific notation, a digit or symbol that appears below the line; for example, H2O, the symbol for water. Contrast with superscript.

(2) In programming, a method for referencing data in a table.
 i denotes that this weighting function pertains to the control of gene i expression by all genes linked to it and [n.sub.i] denotes the vector of parameters defining the functional relationship. Let N be a directed acyclic graph directed acyclic graph - (DAG) A directed graph containing no cycles. This means that if there is a route from node A to node B then there is no way back.  which consists of p vertices The plural of vertex. See vertex.  (genes). Each edge is also assumed to include information about the linkage between genes (i.e., activation, as in the case for the linkage between expression of gene 1 and expression of gene 4, or suppression, expression of genes 3 and 4). In essence, N is a discrete random variable Discrete random variable

A random variable that can take only a certain specified set of individual possible values-for example, the positive integers 1, 2, 3, . . . For example, stock prices are discrete random variables, because they can only take on certain values, such as $10.
 that takes on any of the different acyclic a·cy·clic  
adj.
1. Botany Not cyclic. Used especially of flowers whose parts are arranged in spirals rather than in whorls, as in magnolias.

2.
 network structures that are possible for a set of p genes. Define [X.sub.i] to he the random variable corresponding to the measured relative level of gene expression (the expression level of a target gene for an "exposed" group to the expression level of the same gene in a "control" group) for gene [G.sub.i], 1 [less than or equal to] i [less than or equal to] p. For a given network, N = n, and for each [X.sub.i], define the conditional density function, [f.sub.Xi]([X.sub.i]|[pa.sub.n]([X.sub.i]),[n.sub.i]) where ])[pa.sub.n]([X.sub.i]) denotes the set of vertices corresponding to the parents of expression for gene i in the network n with paranacters [n.sub.i]. All networks in the support space for N are assumed to satisfy the Markov property In probability theory, a stochastic process has the Markov property if the conditional probability distribution of future states of the process, given the present state and all past states, depends only upon the present state and not on any past states, i.e.  where expression of gene i is independent of all genes not included in [pa.sub.n],([X.sub.i]). Application of the Markov property and imposition of the acyclic restriction allow decomposition decomposition /de·com·po·si·tion/ (de-kom?pah-zish´un) the separation of compound bodies into their constituent principles.

de·com·po·si·tion
n.
1.
 of the joint density function into

[MATHEMATICAL EXPRESSION A group of characters or symbols representing a quantity or an operation. See arithmetic expression.  NOT REPRODUCIBLE IN ASCII ASCII or American Standard Code for Information Interchange, a set of codes used to represent letters, numbers, a few symbols, and control characters. Originally designed for teletype operations, it has found wide application in computers. ]

where n = ([n.sub.1], [n.sub.2], ... [n.sub.p]) is the set of all parameters in the network.

Gene expression data, for the purposes of this analysis, can be expressed as a p by m matrix of the form [x.bar] = [[x.sub.ik]] i = 1,2, ... p, k = 1,2 ... m where m is the number of observations (samples analyzed for gene expression) taken for each gene and [[x.bar].sub.i] = [[x.sub.ik]] k = 1,2 ... m is the vector of all observations of expression for gene i. The observed gene expression levels for the parent set for gene i in vector notation For information on vectors as a mathematical object see vector (spatial). This page is about notation of vectors. Declaration
A vector can be declared in three ways:
  • Parentheses can enclose an ordered set of coordinates:
 is [pa.sub.n] ([x.sub.i]) = [[x.sub.ij]] j = 1,2,... [p.sub.i], k = 1,2 ... m where [p.sub.i] is the number of parents for gene i. Similarly &fine the random vector [X.bar]. Then, conditional on the parameters and the model, the likelihood of the data [x.bar] is given by

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The goal of our analysis is the identification of the "best" network structure using gene expression data. Our criterion for the best network is defined as the network, n", from the set of all acyclic networks that maximizes the posterior posterior /pos·ter·i·or/ (pos-ter´e-er) directed toward or situated at the back; opposite of anterior.

pos·te·ri·or
adj.
1. Located behind a part or toward the rear of a structure.
 likelihood of the network,

[3] [n.sup.*] = arg max/N Pr (N = n|[x.bar]).

The posterior probability The posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned when the relevant evidence is taken into account.  Pr(N = n|[x.bar]) is given by

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where Pr(N = n|x) [infinity] and f[n.sub.i]([n.sub.i]) x are derived from the prior distributions of N and [n.sub.i] respectively, and the [n.sub.i] are assumed independent.

Several methods are available for assigning prior information to the distribution of countable (mathematics) countable - A term describing a set which is isomorphic to a subet of the natural numbers. A countable set has "countably many" elements. If the isomorphism is stated explicitly then the set is called "a counted set" or "an enumeration".  networks for a given set of genes. One approach, which is used here, is to assume no prior knowledge by choosing N to be uniformly distributed (equal probability) over the space of all possible acyclic networks. By this assumption the solution to Equation 3 is identical to finding the maximum of the log of the product term in Equation 4 over the parameter space In generative art people talk about parameter space as the set of possible parameters for a generative system.

In statistics one can study the distribution of a random variable. Several models exist, the most common one being the normal distribution (or Gaussian distribution).
; that is the solution to Equation 3 is identical to

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

This equation is similar to the maximum likelihood estimator in classical statistical theory, but weighted over the prior densities for the parameters in the model. A clear benefit of this approach is that one does not need to estimate the model parameters while finding the best network because the integration removes those parameters from the final solution. A possible criticism of this approach is that the assumption of a uniform prior for network structure fails to completely exploit the prior knowledge of which networks are of greatest interest. This is most certainly true, but in light of our limited understanding of gene interaction networks, this appears to be a reasonable choice for a first step in network identification. When available, prior knowledge can be incorporated into this algorithm or modified algorithms to limit the space of networks to be searched; this is the solution to a different problem and will be discussed in a subsequent report.

Many possible weighting functions [w.sub.i]([n.sub.i]) can be used to relate the relative level of expression of gene i to the relative levels of expression of its parents. The analysis presented here uses a log-linear model log-linear model

a statistical model which models frequency counts in contingency tables by using an analysis of variance approach.


[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where the notation [i.sub.j] refers to the jth parent of gene i, [[beta].bar.sub.i] = [beta].sub.iji]1x[p.sub.i] and [[epsilon]sub.i] is a random variable with mean1 0. From a mechanistic basis, using a model linear in the logarithms of the expression levels is equivalent to approximating the full nonlinear system Noun 1. nonlinear system - a system whose performance cannot be described by equations of the first degree
system, scheme - a group of independent but interrelated elements comprising a unified whole; "a vast system of production and distribution and consumption
 by equations in power-law form (Kikuchi et al. 2003; Volt and Radivoyevitch 2000).

Given prior distributions for the [epsilon]'s and the [beta]'s for all genes, the Markov-Chain Monte-Carlo (MCMC MCMC Markov Chain Monte Carlo
MCMC Malaysian Communications and Multimedia Commission
MCMC Mid-Continent Mapping Center
McMC McMaster-Carr
MCMC Marine Corps Maintenance Contractor
) method developed by Hastings (Hastings 197{)) makes it possible to estimate a solution to Equation 5 and identify the "best" network. It is possible, under further restrictions, to obtain a closed form solution to the argument in Equation 5. The advantage of this approach in the framework of this article is that the entire network space can be searched exhaustively to find the best network for small networks like the ones in our simulation studies.

As is common in Bayesian linear regression In statistics, Bayesian linear regression is a Bayesian alternative to the more well-known ordinary least-squares linear regression.

Consider standard linear regression problem, where we specify the conditional density of y given x predictor variables:
 theory (Gehnan et al. 1995), we assume that [epsilon]sub.i|[[sigma].sup.2.sub.i] ~ Normal(0, [sigma].sup.2.sub.i]), [[beta].sub.i]|[sigma].sup.2.sub.i] ~ Normal([b.bar], [[sigma].sup.2.sub.i] [A.sub.i]-1) and [[sigma].sup.2.subi] ~ Gamma([v.sub.0]/2, [v.sub.1]/2), [v.sub.0], [v.sub.1] = [approximately equal to] 0. These priors do not assume additional or specific information (in Bayesian parlance Parlance - A concurrent language.

["Parallel Processing Structures: Languages, Schedules, and Performance Results", P.F. Reynolds, PhD Thesis, UT Austin 1979].
 these are uninformativc priors) and thus would be applicable fi)r many cases. Simple algebra In mathematics, specifically in ring theory, an algebra is simple if it contains no non-trivial ideals and the set ≠ .

The second condition in the definition precludes the following situation: consider the algebra

 then results in:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where [GAMMA] is the gamma function In mathematics, the Gamma function (represented by the capitalized Greek letter Γ) is an extension of the factorial function to real and complex numbers. For a complex number z with positive real part it is defined by

, [A.sub.i] = ln[[pa.sub.n]([x.sub.i])]ln[[pa.sub.n]([x.sub.i)]sup.T] and [B.sub.i] = ln[[x.bar]sub.i]ln[[pa.sub.n] ([[x.bar]sup.i])].sup.T [A.sub.i]-1. Given N= n, this equation allows for the direct calculation of Pr(N= n|x). This formula is specific to these priors, but similar formulae might be derived for other cases.

Any single gene in a p = 4 gene network has 8 possible sets of parents (no parents, 3 single parents, 3 double parents, all other genes), hence the total number of networks including cyclic cyclic /cyc·lic/ (sik´lik) pertaining to or occurring in a cycle or cycles; applied to chemical compounds containing a ring of atoms in the nucleus.

cy·clic or cy·cli·cal
adj.
1.
 networks would be 84 = 4,096 networks of which 543 are acyclic. As p increases, the total number of networks increases as the squared power of p([2.sup.p]([p.sup.-1]])) resulting in a very large network space to evaluate for larger networks (e.g., 4 x [10.sub.469] for a 40-gene network). Many different types of searching algorithm could be used to limit the number of networks to be evaluated for Equation 6; through trial and error, the following modified simulated annealing simulated annealing - A technique which can be applied to any minimisation or learning process based on successive update steps (either random or deterministic) where the update step length is proportional to an arbitrarily set parameter which can play the role of a temperature.  algorithm (Press et al. 1989) appears to work. We will refer to this method as the TAO-Gen (Theoretical Algorithm for identifying Optimal GENe interaction networks) algorithm.

The TAO-Gen algorithm has 7 basic steps:

1) Search conditions: Restrict to [zeta] < p, the maximum number of parents for any one gene and calculate the value of Equation 6 for all [summation summation n. the final argument of an attorney at the close of a trial in which he/she attempts to convince the judge and/or jury of the virtues of the client's case. (See: closing argument)  over][zeta].sub.i]=0 p-1 [C.sub.i] parent combinations, where, p [C.sub.i], is the binomial coefficient In mathematics, particularly in combinatorics, a binomial coefficient is a coefficient of any of the terms in the expansion of the binomial (x+y)n.  (When p is relatively small, [zeta] =p -1 can be chosen and the entire network space is evaluated in this step. When p is even moderately large (> 10), assuming [zeta] = 4 or 5 will substantially reduce the computational burden). Specify a number t (0 [less than or equal to] t [less than or equal to] 1) governing the probability of local versus global switching in step 4 (t = 0 implies only global switching, t = 1 implies only local switching).

2) For the initial step k = 0, randomly select an order in which genes enter the network [G.sub.k], = ([G.sub.k1] [G.sub.k2] [G.sub.k3] ... [G.sub.kp]) and build a starting network choosing the parents for each gene that maximize Equation 6 while keeping the network acyclic (i.e., choose the parents for [G.sub.k1] that are optimal first, then parents for [G.sub.k2] that are optimal, etc.)

3) Calculate the posterior likelihood (Equation 4) for this network and denote de·note  
tr.v. de·not·ed, de·not·ing, de·notes
1. To mark; indicate: a frown that denoted increasing impatience.

2.
 it [L.sub.k].

4) Generate a uniform random number [u.sub.1][member of] uniform, (0,1) to determine the type of permutation One possible combination of items out of a larger set of items. For example, with the set of numbers 1, 2 and 3, there are six possible permutations: 12, 21, 13, 31, 23 and 32.

(mathematics) permutation - 1.
, if [u.sub.i] < t, the permutation occurs between two randomly chosen genes, j and l, switching the two genes for the next permutation [G.sub.k+1, j] = [G.sub.k], l / and [G.sub.k+l],l = [G.sub.k],j). Otherwise, make the second half of the set of genes, starting from randomly chosen gene j, appear first in the order ([G.sub.|1,1 = [G.sub.k],j|1, [G.sub.k],j|1,2 = [G.sub.k],j|2, ..., [G.sub.k+1], m-j+1 = [G.sub.k]1,..., [G.sub.k + 1], m = [G.sub.k],j. Thus form a new gene order, [G.sub.k+1]..

5) Calculate a new posterior likelihood of the network [L.sub.k+1] associated with the order [G.sub.k+1], as in steps 2 and 3. If [L.sub.k+1] > [L.sub.k] then keep [G.sub.k+1]. Otherwise generate a uniform random number [u.sub.2] [member of] uniform (0,1)and if [u.sub.2] [less than or equal to] [L.sub.k+1]/[L.sub.k], keep [G.sub.k+1] else set [G.sub.k+1] = [G.sub.k].

6) Return to step 4 and iterate it·er·ate  
tr.v. it·er·at·ed, it·er·at·ing, it·er·ates
To say or perform again; repeat. See Synonyms at repeat.



[Latin iter
.

7) Choose the network with the highest posterior probability from the sequence ([G.sub.0], [G.sub.1],...).

This algorithm combines aspects of the Metropolis algorithm used for Markov-Chain Monte-Carlo sampling (Hastings 1970), with the simulated annealing algorithm used for optimization (Press et al. 1989). In essence it represents a new form of genetic algorithm genetic algorithm - (GA) An evolutionary algorithm which generates each individual from some encoded form known as a "chromosome" or "genome". Chromosomes are combined or mutated to breed new individuals.  aimed at networks in which mutations occur in each cycle as either base-pair switches or large translocations. It may be possible under certain fixed conditions to analytically determine the degree to which the TAO-Gen algorithm reduces the number of networks to be evaluated and the efficiency with which it finds the correct solution. This is left as a separate exercise; instead, simulation studies were used to address these issues as discussed in "Results."

Gene Expression Data Set Gardner et al. (2003) developed a gene-regulatory network for a nine-gene subnetwork See subnet.  of the SOS pathway in Escherichia colt. The nine genes (all gene names and locators, in parentheses See parenthesis.

parentheses - See left parenthesis, right parenthesis.
 following gene name, are from the EcoGene database (http://bmb.med. miami.edu/EcoGene/EcoWeb) they focused on were the principal mediators of the SOS response The SOS response is a postreplication DNA repair system that allows DNA replication to bypass lesions or errors in the DNA. The SOS uses the RecA protein. The RecA protein, stimulated by single-stranded DNA, is involved in the inactivation of the LexA repressor thereby inducing the , recA (recombinase re·com·bi·nase
n.
An enzyme that catalyzes genetic recombination.



recombinase

a function of the recA protein in Escherichia coli
 gene A, locator EC10823) and lexA (lambda excision gene A, locator EC10533); genes with known involvement in the SOS response, ssb (single strand binding gene, locator EC10976), reef (recombinase gene F, locator ECI ECI Employment Cost Index
ECI Election Commission(er) of India
ECI Enterprise Content Integration
ECI Early Childhood Intervention
ECI Environmental Change Institute
0828), dinI (damage inducible gene I, locator EC 12670), umuDC (UV mutator A mutator may refer to:
  • In computer science:
  • A mutator method is an object method that changes the state of the object
 gene, locator EC11057); and three sigma factor sigma factor
n.
A protein component of RNA polymerase that determines the specific site on DNA where transcription begins.
 genes whose function in SOS response is not clearly identified, rpoD (RNA polymerase RNA polymerase
n.
A polymerase that catalyzes the synthesis of RNA from a DNA or RNA template.
 factor subunit sub·u·nit  
n.
A subdivision of a larger unit.

Noun 1. subunit - a monetary unit that is valued at a fraction (usually one hundredth) of the basic monetary unit
fractional monetary unit
 D, locator EC10896), rpoH (RNA polymerase factor subunit H, locator EC10897), and WoS (RNA RNA: see nucleic acid.
RNA
 in full ribonucleic acid

One of the two main types of nucleic acid (the other being DNA), which functions in cellular protein synthesis in all living cells and replaces DNA as the carrier of genetic
 polymerasc f~actor subunit S, locator EC10510). To quantify the subnetwork, they applied a set of nine transcriptional perturbations to E colt cells in which each perturbation perturbation (pŭr'tərbā`shən), in astronomy and physics, small force or other influence that modifies the otherwise simple motion of some object. The term is also used for the effect produced by the perturbation, e.g.  overexpressed a different one of the nine genes in the SOS network. Using an arabinose-controlled episomal expression plasmid plasmid

Genetic element not contained within a chromosome. It occurs in many bacterial strains. Plasmids are circular DNA molecules that replicate independently of the bacterial chromosome. They are not essential for the bacterium but may give it a selective advantage.
, they grew the cells in batch cultures for 5.5 hr after the addition of arabinose arabinose Biochemistry A pentose that occurs in d and l configurations , then measured relative change in message for their nine target genes using quantitative real-time polymerase chain reaction In Molecular Biology, real-time polymerase chain reaction, also called quantitative real time polymerase chain reaction (QRT-PCR) or kinetic polymerase chain reaction . In addition to the nine perturbed per·turb  
tr.v. per·turbed, per·turb·ing, per·turbs
1. To disturb greatly; make uneasy or anxious.

2. To throw into great confusion.

3.
 cultures, they also produced two additional cultures, one in which a double plasmid (lexA/ recA) was incorporated into the cells and another in which 0.75 [micro]g/mL of mitomycin C mitomycin, mitomycin C

a group of highly toxic antineoplastics (mitomycin A, B and C) produced by Streptomyces caespitosus, indicated for palliative treatment of certain neoplasms that do not respond to surgery, radiation and other drugs.
 (MMC See MultiMediaCard and Microsoft Management Console. ) was added to the culture to stimulate gene expression of recA. The rcsulting data set with 11 samples of relative changes in gene expression for the nine target genes is given in Table SI in Gardner et al. (2003). In addition to the nine target genes, the nine plasmid constructs were added to the modeling as fixed stimulators of each of their respective genes to mimic changes in gene expression induced by insertion of the ten plasmid constructs. A separate stimulation by MMC was also included but with links to all genes in the network to determine if the predominant linkage to recA assumed by Gardner et ah (2003) was evident in the data. The exact model linking genes for sample k, (k = 1,2, ... 11) is given by

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where [beta][i.sub.ji] is as described previously, [I.sub.ik], is an indicator variable equal to 1 if gene i has an inserted plasmid in sample h and is equal to 0 otherwise, [[alpha].sub.i] is the magnitude of increase in gene expression induced in the ith gene by the plasmid when it is present, [M.sub.k. is the relative change (relative to the standard of 0.5 [micro]g/mL) in MMC exposure for sample k, and [[gamma].sub.i] is the magnitude of change in gene expression for gene i as a function of the relative change in MMC.

Simulation Results

Data were simulated for a given network by sampling from the assumed error distributions and priors for a given model situation. To simulate a network, genes highest on the parental list were simulated first and the simulated values were used to simulate daughters, etc. Different starting points Noun 1. starting point - earliest limiting point
terminus a quo

commencement, get-go, offset, outset, showtime, starting time, beginning, start, kickoff, first - the time at which something is supposed to begin; "they got an early start"; "she knew from the
 and different priors were used to estimate parameters in both the simulated data and the SOS data; these had no impact on the final results provided the priors chosen were uninformative un·in·for·ma·tive  
adj.
Providing little or no information; not informative.



unin·for
.

Results

The TAO-Gen algorithm was applied to real time PCR PCR polymerase chain reaction.

PCR
abbr.
polymerase chain reaction


Polymerase chain reaction (PCR) 
 data on nine genes (recA, lexA, ssb, recF, dinI, umuDC rpoD, rpoH, and rpoS) from the SOS pathway in E. colt as described above. Data consisted of 11 separate relative changes in gene expression: 9 samples for which a plasmid was inserted for one of the nine genes, a single construct for a combination of two genes (lexA and recA), and a modification of the culture (1.5 x increase in mitomycin C) in wild-type cells. Figure 2 illustrates the optimal gene interaction network identified by the TAO-Gen algorithm for these data. it is generally believed that the SOS regulon in E. coli E. coli: see Escherichia coli.
E. coli
 in full Escherichia coli

Species of bacterium that inhabits the stomach and intestines. E. coli can be transmitted by water, milk, food, or flies and other insects.
 is predominantly under the control of the products of the genes lexA and recA. Figure 3 illustrates a literature-based linkage map between genes in the SOS response for the repair of DNA DNA: see nucleic acid.
DNA
 or deoxyribonucleic acid

One of two types of nucleic acid (the other is RNA); a complex organic compound found in all living cells and many viruses. It is the chemical substance of genes.
 damage. When genotoxins, such as ultraviolet radiation and MMC, damage DNA base nucleotides, the replication process is activated and a region of single-stranded DNA (ssDNA) is formed. RecA (the product of recA) coats ssDNA, signaling the SOS response. RecA/ssDNA stimulates degradation of LexA (the products of lexA), which is a repressor repressor: see nucleic acid.  of RecA in the normal repair process. This inactivation inactivation /in·ac·ti·va·tion/ (in-ak?ti-va´shun) the destruction of biological activity, as of a virus, by the action of heat or other agent.  of LexA affects other genes involved directly in SOS response, such as dinI, and downstream genes involved in DNA replication DNA replication is the process of copying a double-stranded DNA molecule. This process is important in all known life forms and the general mechanisms of DNA replication are not the same in prokaryotic and eukaryotic organisms. , cell division and mutagenesis mutagenesis /mu·ta·gen·e·sis/ (mu?tah-jen´e-sis)
1. the production of change.

2. the induction of genetic mutation.


mu·ta·gen·e·sis
n. pl.
, such as rpoS (Beuning 2004; Janion 2001; Lindner 2004; Lusetti 2002; McKenzie 2000; Rangarajan 2002). The results from the TAO-Gen algorithm are given in Figure 2 and support this role for lexA with significant repressor activity on umuDC, din/and ssb. in contrast, RecA, the gene product of" recA, is expected to serve as an activator of the SOS regulon. Figure 2 indicates that recA serves as a central node in the regulation of gcnes in the SOS pathway, showing significant activation of lexA, reef, umuDG woH and ssb and significant repression of rpoD. There are four remaining significant linkages: ssb and rpoS repress re·press
v.
1. To hold back by an act of volition.

2. To exclude something from the conscious mind.
 and activate rpoD, respectively, and reef activates umuDC and rpaH activates ssb. Table 1 provides summary information on the parameter estimates estimated by treating the identified network (Figure 2) as known and quantifying the linkages between genes by the method of Toyoshiba et al. (2004). With the exception of the plasmid-induced change in reef, all linkages in Figure 2 are statistically significant (p < 0.05).

[FIGURE 2 OMITTED]

An indicator variable was used to separate data with and without plasmid insertion for each gene. For all nine genes, plasmid inserts increased mRNA levels ranging from a nonsignificant non·sig·nif·i·cant  
adj.
1. Not significant.

2. Having, producing, or being a value obtained from a statistical test that lies within the limits for being of random occurrence.
 (p = 0.31) 1.06-fold increase for reef to a significant (p < 0.01) 28-fold increase for rpoH. Changes in the level of MMC had significant effects on eight of the nine gcnes, the sole exception being lexA, which did not appear to be directly affected by changes in MMC. This finding is in contrast to what was believed to be the presumed transcriptional target of MMC, recA. It was previously suggested that all other MMC-induced changes in transcription are mediated me·di·ate  
v. me·di·at·ed, me·di·at·ing, me·di·ates

v.tr.
1. To resolve or settle (differences) by working with all the conflicting parties:
 through recA. In this analysis the largest impacts of MMC on transcription were for rpoH and rpoS (an ~12.3-fold increase in activity for each doubling of the MMC level) followed by effects on recA, dinI and umuDC (approximately a 1.9-fold increase in activity for each doubling of MMC level).

Our best network (Figure 2) and the literature-based network (Figure 3) support the notion that the activation of the SOS system is through activation of recA. Increases in recA result in activation of umuDC and ssb, critical components in the activation of repair of single-strand DNA damage. An increase in recA also induces an increase in lexA, which serves to suppress the activity induced by reed in umuDC and ssb. rpoH appears to serve as an independent activator of ssb with signaling from recA and possibly other genes not included in the network. Finally, while rpoS and rpoD seem to be linked to the network, they appear to be under control of other genes in the network rather than exerting control over the SOS response. Recent articles hypothesized possible roles for roles for RpoS, LexA and RecA in global stress gene regulation, but clear conclusions are not yet available (Gerard et al. 1999; Gill et al. 2000).

With such a small number of samples (11) relative to the number of genes involved (9), it is likely that the resulting model is overly sensitive to any one data set. To evaluate this, we applied the TAO-Gen analysis to 11 data sets in which one sample from the original data was eliminated. Generally, removing a sample resulted in deletion deletion /de·le·tion/ (de-le´shun) in genetics, loss of genetic material from a chromosome.

de·le·tion
n.
Loss, as from mutation, of one or more nucleotides from a chromosome.
 of a connection rather than inclusion of new connections. Removing the dinI plasmid insert had no impact on the resulting network; removing the double plasmid insert only added a single additional connection between rpoH and rDoS; and removing the MMC sample (no plasmid insert) removed only one linkage (rpoh-rpoS). All other sample removals resulted in two to five changes in the network with no more than one additional linkage in any case. Three linkages (recA to lexA, lexA to umuDC and reef to umuDC) remained unchanged for all sample deletions; all others were simply eliminated once or twice for specific sample deletions with the exceptions of recA to rpoH, which was removed in four sample deletions, and rpoS to rpoD which was removed in one sample deletion and switched direction for three sample deletions. All additional linkages (there were six sample deletions with onc additional linkage in each case) included at least one of the stationary phase The term stationary phase may refer to
  • Chromatography, in chemistry.
  • The stationary phase approximation in the evaluation of integrals in mathematics.
  • The method of steepest descent in the evaluation of integrals in mathematics.
  • A phase in bacterial growth.
 regulators (rpoH, rDoS, rpod), suggesting the linkage between this class of genes and the SOS pathway may be too distant to quantify. Generally, with the exception of linkages to and between the stationary phase regulators, the model was fairly stable across deletions of single samples from the data set.

[FIGURE 3 OMITTED]

Discussion

The network presented in Figure 2 is substantially smaller than that proposed by Gardner et al. (2003) Using their NIR NIR Near Infrared
NIR National Inventory Report
NIR National Identity Register (UK)
NIR Near-Infrared Reflectance
NIR Non-Ionizing Radiation
NIR Net International Reserves
NIR National Internet Registry
NIR Northern Ireland Railways
 (network identification by multiple regression Multiple regression

The estimated relationship between a dependent variable and more than one explanatory variable.
) algorithm, they identified a network with 45 linkages (excluding changes due to MMC or the plasmids) compared with our network with only 13 gene linkages Noun 1. gene linkage - (genetics) traits that tend to be inherited together as a consequence of an association between their genes; all of the genes of a given chromosome are linked (where one goes they all go)
linkage
. There are significant differences between the NIR and TAO-Gen algorithms that directly impact affect these findings. In the NIR algorithm, parents for each gene are discovered independently of the other genes by finding the five parents that maximize the usual likelihood of the data given the model. The choice of five parents is somewhat arbitrary,, and the use of the data multiple times for each gene overstates the information available. In addition each gene is allowed to be a parent of itself, creating a singularity (1) See technology singularity.

(2) (Singularity) An experimental operating system from Microsoft for the x86 platform written almost entirely in C#, a .NET managed code language. Released in 2007, Singularity is a non-Windows research project.
 in the model that results in most other parents having no significant impact on any given gene expression level. Of the 36 linkages (six parents were chosen for recF) identified by the NIR algorithm, all nine genes have significant linkages with themselves as parents. Of the remaining 27 linkages, only 9 are significant (p < 0.05 by a Wald test The Wald test is a statistical test, typically used to test whether an effect exists or not. In other words, it tests whether an independent variable has a statistically significant relationship with a dependent variable. ) as follows: ssb activates recA and reeF, recA suppresses lexA and rpoH, dinI activates recA, umuDC and rpoS, rpoH suppresses rpoD, and rpoS suppresses recF. The TAO-Gen algorithm, in contrast, restricts the network to acyclic linkages and uses the full likelihood (all of the data simultaneously) to find the best network. Of the 9 significant linkages identified by the NIR algorithm, the TAO-Gen algorithm identified only the suppression of recA and WoH by recA. The significant findings by the NIR algorithm do not identify recA as a key controlling gene in the network whereas the TAO-Gen algorithm does.

Mathematically the data obtained by Gardner et ah (2003) does not have sufficient statistical support to identify a cyclical cyclical

Of or relating to a variable, such as housing starts, car sales, or the price of a certain stock, that is subject to regular or irregular up-and-down movements.
 network. The data required to estimate parameters in a cyclical network must contain observations at different time points to estimate the dynamic characteristics of a cyclic network. To directly compare the Gardner et al. network to the one shown in Figure 2, the Gardner et al. network was made acyclic by removing the linkages for genes as their own parents and by removing the linkage between dinI and lexA. When the Bayesian estimation algorithm was applied (Toyoshiba et al. 2004), the posterior log-likelihood for this model had a mean value of 329.2 compared with 354.7 from the model identified by the TAO-Gen algorithm, suggesting a considerably better fit of the model in Figure 2 to the data. Using the "known model" suggested by Gardner et al. (2003), the resulting mean of the posterior log-likelihood was 311.0, also suggesting a serious lack of fit.

So is the model presented in Figure 2 a better representation of the gene interaction network for the SOS pathway in E. coil? The resulting network has identified the significant gene linkages seen in the data. It correctly identifies recA as playing the major role in control of this pathway and provides estimates of the steady-state linkage between these genes. The interpretation of the values estimated for the parameters linking genes in Figure 2 does not preclude that the network could be dynamic with substantial feedback; such a possibility is likely. But given the data available, this network identifies the key linkages that exist as the network changes from one steady-state to another. What this means can be explained by example. The activation of reef by recA has a mean value of 0.393. This implies that, if the steady-state expression of recA doubles, then the steady-state expression of reef would fold increase by the exponential 1. (mathematics) exponential - A function which raises some given constant (the "base") to the power of its argument. I.e.

f x = b^x

If no base is specified, e, the base of natural logarthims, is assumed.
2.
 of 0.393 x ln(2) or 1.32-fold. Singular changes in any gene in the network can easily be used to calculate new steady-state conditions In telecommunication, the term steady-state condition has the following meanings:
  • In a communications circuit, a condition in which some specified characteristic of a condition, such as a value, rate, periodicity, or amplitude, exhibits only negligible change over an
 for the network.

[FIGURE 2 OMITTED]

Illustrating that one can achieve a network from a given data set does not assess the reliability of a new algorithm. A better method is to evaluate the probability of choosing the correct network using data from a known network. Monte Carlo simulation Monte Carlo Simulation

A problem solving technique used to approximate the probability of certain outcomes by running multiple trial runs, called simulations, using random variables.
 was used to generate 100,000 artificial gene expression arrays from the network in Figure 1 using four different sets of model parameters as defined in Table 2. When the algorithm is applied to these data, the resulting optimal network is identical to the network shown in Figure 1 in all four cases. This illustrates that the algorithm is consistent for extremely large data sets. To assess the behavior of the algorithm for small samples, the four sets of 100,000 artificial arrays were subdivided into 1,000 data sets of 100 arrays, 2,000 data sets of 50 arrays, 4,000 data sets of 25 arrays, and 10,000 data sets of 10 arrays. For each data set, the algorithm was applied and an optimal network chosen; the results appear in Table 2.

[FIGURE 1 OMITTED]

There are 543 possible acyclic networks that can arise from a combination of four genes. Table 2 summarizes the frequency (from 543 total networks) seen for various network structures (column 3 is the correct structure). For example, with 100 arrays in the sample, the correct network is chosen 922/100 = 92% of the time for parameter set A (row 1 of Table 2). Generally, with 100 replicate arrays, the search algorithm In computer science, a search algorithm, broadly speaking, is an algorithm that takes a problem as input and returns a solution to the problem, usually after evaluating a number of possible solutions.  is better than 92% effective in finding the right network. The most common error in finding an array for this sample size is to add an additional linkage between gene 2 and gene 4 (column 8 in Table 2, 1-8%). When the sample size is halved halve  
tr.v. halved, halv·ing, halves
1. To divide (something) into two equal portions or parts.

2. To lessen or reduce by half: halved the recipe to serve two.

3.
 to 50 arrays, accuracy drops to between 86 and 93%, with the same additional linkage being the most common mistake (2-9%). With only 25 arrays, accuracy is still between 70 and 80%, with most of the errors occurring for the same additional linkage (4-8%), single deletions of linkages (3-4%), or reversals of individual linkages (2-3%). Replicate samples consisting of just 10 arrays surprisingly find the correct network 32-38% of the time, with 30-40% of the errors being additional linkages, single linkage removal, or single linkage reversals. The simulations suggest the algorithm generally detects networks having very close topologies to the correct one even if the sample number is severely diminished.

As noted in "Materials and Methods," the algorithm being used to find the best network is intended as an approximation approximation /ap·prox·i·ma·tion/ (ah-prok?si-ma´shun)
1. the act or process of bringing into proximity or apposition.

2. a numerical value of limited accuracy.
 for using the posterior likelihood to identify the best network. In the last four columns of Table 2, the correct network has the best posterior likelihood in every case for which it is the optimal network. In addition the algorithm works well at placing the correct network into the top three networks, ranging from about 99% for samples involving 100 arrays to 58% for samples consisting of 10 arrays. These simulations suggest that the best directed acyclic network does not necessarily mean that all the links are real or that they are causal. Conversely, they do suggest that the limitations inherent to small sample sizes could be reduced by considering not only the best network, but several of the best networks and using other resources, such as knowledge of the existing pathways, to decide which makes the most sense.

These results were expanded to look at an eight-gene network, effectively a combination of two four-gene networks similar to that in Figure 1, where gene 2 activates gene 5 and gene 3 activates gene 8 (Figure 4). In this case it is computationally impossible to conduct the exhaustive search as in the four-gene case because the number of acyclic networks is approximately 78 x [10.sup.13]. Instead, 1,000 data sets were randomly generated for each sample case (100, 50, 25, 10) and the TAO-Gen algorithm was applied to identify a best network for each data set. Table 3 shows the numbers of connections detected by the algorithm, where the rows and columns correspond to parents and child genes, respectively. For example, the algorithm detected the incorrect path from gene 1 to gene 2 only three times in 1,000 data sets with 100 samples. The red elements show the true connections. For 100 replicate samples (microarrays), the TAO-Gen algorithm identified the correct network in 95% of the cases. As before, the deviations from the correct model were all cases of" adding an additional linkage or removing a single linkage. As the sample size dropped to 50, 25 and 10, the correct network was identified 76, 30 and 1% of the time, respectively. Even though the performance in finding the fully correct network became poor, the linkages in the correct network were generally properly identified with high frequency, again indicating that the cases where the network was incorrect generally involved single or double alterations in the pathways of the network. The simulation using eight genes accentuates the importance of study design and prior knowledge about gene linkages in trying to find the best network to explain the data.

[FIGURE 4 OMITTED]

Many issues remain to be studied. It is unclear whether the TAO-Gen algorithm works better or worse than other algorithms in identifying gene interaction networks. The main problem arises because other algorithms have not used computer simulations to examine model specifity to directly address this issue. Also, the use of acyclic models to develop gene interaction networks is somewhat limited. A fully dynamic model using time-dependent differential equations differential equation

Mathematical statement that contains one or more derivatives. It states a relationship involving the rates of change of continuously changing quantities modeled by functions.
 could be used with the TAO-Gen algorithm provided multitime point data were available; the method would simply need to link models across time as suggested elsewhere (Toyoshiba et al. 2004) or use dynamic Bayesian networks A dynamic Bayesian network is a Bayesian network that represents sequences of variables. These sequences are often time-series (for example in speech recognition) or sequences of symbols (for example protein sequences). . Here we assume samples are independent; in time-course data, that would not necessarily be the case and the error structure between samples would need to be altered (in Equation 4 and subsequent derivations) to account for the longitudinal nature of such data. In any case the analysis would certainly require more data than are generally available. Perhaps the biggest advantage of using a Bayesian-linked analysis algorithm would occur when prior knowledge, based on known biologic linkages such as those derived from bioinformatic evaluations of transcription sequences, is used to limit the range of networks to be explored. The TAO-Gen algorithm could work in these situations but would need to be modified to use a prior different than the uniform prior used in this case.

Conclusion

In this article we have presented the TAO-Gen algorithm for identifying gene interaction networks. The algorithm was applied to data on the SOS pathway in E. coli to identify gene linkages. The resulting network is shown to be superior to a network derived by the NIR algorithm in (Gardner et al. 2003) both biologically and statistically. Unlike the NIR algorithm, this algorithm identified a statistically significant role of recA in controlling the SOS pathway; the linkages from recA in the NIR-derived network were generally not significant. To demonstrate the accuracy of the algorithm for varying sample sizes, a simulation study was performed, it was found that for moderate-size networks the algorithm performs accurately, with most errors being minor additions or deletions of a single linkage. However, the simulations do suggest that sample sizes need to be increased if large networks are to be identified and quantified using gene expression data.
Table 1. Estimated means, standard deviations and percentage above 0
for all interactions in SOS response genes for E. coli identified as
linked by the TAO-Gen algorithm (see Figure 2).

From              To       Type      Mean     SD     % < O

recA             lexA    Activate    0.435   0.065    0.00
                 ssb     Activate    0.137   0.056    0.99
                 recF    Activate    0.393   0.161    0.93
                 umuDC   Activate    0.365   0.129    0.42
                 rpoD    Repress    -0.356   0.091   99.97
                 rpoH    Activate    0.193   0.093    2.06
lexA             ssb     Repress    -0.158   0.065   98.86
                 dinl    Repress    -0.287   0.156   96.61
                 umuDC   Repress    -0.550   0.169   99.85
ssb              rpoD    Repress    -0.077   0.029   99.46
recF             umuDC   Activate    0.512   0.204    0.81
rpoH             ssb     Activate    0.031   0.012    0.55
rpoS             rpoD    Activate    0.496   0.108    0.02
Plasmid insert   recA    Activate    0.458   0.080    0.00
                 lexA    Activate    0.396   0.041    0.00
                 ssb     Activate    2.443   0.039    0.00
                 recF    Activate    0.062   0.130   30.95
                 dinl    Activate    1.188   0.110    0.00
                 umuDC   Activate    1.007   0.093    0.00
                 rpoD    Activate    1.409   0.069    0.00
                 rpoH    Activate    3.319   0.074    0.00
                 rpoS    Activate    0.513   0.100    0.00
MMC              recA    Activate    0.979   0.282    0.06
                 ssb     Activate    0.479   0.108    0.05
                 recF    Activate    0.637   0.345    3.28
                 dinl    Activate    0.896   0.282    0.07
                 umuDC   Activate    0.969   0.252    0.05
                 rpoD    Activate    0.460   0.221    2.12
                 rpoH    Activate    1.233   0.204    0.00
                 rpoS    Activate    1.255   0.248    0.00

Table 2. Results from 100,000 Monte Carlo simulations of four
hypothetical four-gene networks (A, B, C, D) (a) describing the
ability of the TAO-Gen algorithm to specify the correct network.

                          Frequency (%) of resulting optimal
                                  network structure
Samples      True
size         model     [??]        [??]        [??]      [??]

100 arrays     A      922 (92)      0 (0)        0 (0)     0 (0)
1,000 sims     B      977 (98)      0 (0)        0 (0)     0 (0)
               C      929 (93)      0 (0)        0 (0)     0 (0)
               D      980 (98)      0 (0)        0 (0)     0 (0)

50 arrays      A    1,716 (86)      4 (0.2)      3 (0.2)   6 (0.3)
2,000 sims     B    1,841 (92)      8 (0.4)      0 (0)     4 (0.2)
               C    1,745 (87)      6 (0.3)      4 (0.2)   3 (0.2)
               D    1,860 (93)      4 (0.2)      0 (0)     2 (0.1)

25 arrays      A    2,920 173)     76 (2)       72 (2)    56 (1)
4,000 sims     B    3,179 (80)     92 (2)       55 (1)    48 (1)
               C    2,891 (72)     60 (1)      100 (2)    56 (1)
               D    3,086 (77)     76 (2)       96 (2)    48 (1)

10 arrays      A    3,198 (32)    909 (9)      741 (7)   230 (2)
10,000 sims    B    3,768 (38)  1,002 (10)   1,051 (10)  220 (2)
               C    3,177 (32)    892 (9)      691 (7)   230 (2)
               D    3,768 (38)  1,052 (10)   1,031 (10)  280 (3)

                    Frequency (%) of resulting
                    optimal network structure
Samples      True
size         model    [??]      [??]     [??]

100 arrays     A      0 (0)     68 (7)  0 (0)
1,000 sims     B      0 (0)      6 (1)  0 (0)
               C      0 (0)     71 (7)  0 (0)
               D      0 (0)      6 (1)  0 (0)

50 arrays      A      4 (0.2)  165 (8)  0 (0)
2,000 sims     B      8 (0.4)   41 (2)  0 (0)
               C      6 (0.3)  175 (9)  0 (0)
               D      0 (0)     46 (2)  0 (0)

25 arrays      A     77 (2)    328 (8)  3 (0.1)
4,000 sims     B     47 (1)    192 (5)  8 (0.2)
               C     76 (2)    296 (7)  4 (0.1)
               D     48 (1)    164 (4)  8 (0.2)

10 arrays      A    149 (2)    328 (3)  497 (5)
10,000 sims    B    309 (3)    378 (4)  567 (6)
               C    151 (2)    398 (4)  457 (5)
               D    259 (3)    538 (5)  477 (5)

                     Rank (%) of the posterior likelihood for
                      the true network over all possible 543
                                acyclic networks
Samples      True
size         model      1           2          3        4-10

100 arrays     A      922 (92)     52 (5)    10 (1)       16 (2)
1,000 sims     B      977 (98)     17 (2)     4 (0.4)      2 (0.2)
               C      929 (93)     50 (5)     8 (1)       13 (1)
               D      980 (98)     13 (l)     5 (0.5)      2 (0.2)

50 arrays      A    1,716 (87)    157 (8)    34 (2)       70 (4)
2,000 sims     B    1,841 (92)     82 (4)    20 (1)       55 (3)
               C    1,745 (88)    128 (6)    41 (2)       62 (3)
               D    1,860 (93)     68 (3)    30 (2)       42 (2)

25 arrays      A    2,920 (73)    423 (10)  112 (3)      387 (10)
4,000 sims     B    3,179 (79)    348 (9)   133 (3)      249 (6)
               C    2,891 (72)    404 (10)  114 (3)      444 (11)
               D    3,086 (77)    328 (8)   149 (4)      365 (9)

10 arrays      A    3,198 (32)  1,027 (10)  781 (8)    2,389 (24)
10,000 sims    B    3,768 (38)    966 (10)  821 (8)    2,519 (25)
               C    3,177 (32)  1,232 (12)  769 (8)    2,347 (23)
               D    3,768 (38)  1,146 (11)  871 (9)    2,371 (24)

(a) (A) [[beta].sub.14] = 2.0, [[beta].sub.13] = 0.8, [[beta].sub.23]
= 0.8, [[beta].sub.34] = -1.3, [[sigma].sub.1] = [[sigma].sub.2]
= [[sigma].sub.3] = [[sigma].sub.4] = 1.0

(B) [[beta].sub.14] = 2.0, [[beta].sub.13] = 0.8, [[beta].sub.23] =
0.8, [[beta].sub.34] = -5.0, [[sigma].sub.1] = [[sigma].sub.2] =
[[sigma].sub.3] = [[sigma].sub.4] = 1.0

(C) [[beta].sub.14] = 2.0, [[beta].sub.13] = 0.8, [[beta].sub.23] =
0.8, [[beta].sub.34] = -1.3, [[sigma].sub.1] = [[sigma].sub.2] =
[[sigma].sub.3] = [[sigma].sub.4] = 1/3

(D) [[beta].sub.14] = 2.0, [[beta].sub.13] = 0.8, [[beta].sub.23] =
0.8, [[beta].sub.34] = -5.0, [[sigma].sub.1] = [[sigma].sub.2] =
[[sigma].sub.3] = [[sigma].sub.4] = 1/3

Table 3. Number (percent) of linkages between two genes identified by
the TAO-Gen algorithm in 1,000 Monte Carlo simulations of the
hypothetical eight-gene network shown in Figure 3.

            From            To cell number
            gene
           number     1           2             3

100 Chips    1         --      3 (0.3)   1,000 (100) (a)
             2      0 (0)          --      999 (99.9) (a)
             3      0 (0)      1 (0.1)            --
             4      0 (0)      0 (0)         0 (0)
             5      0 (0)      0 (0)         0 (0)
             6      2 (0)      0 (0)         2 (0.2)
             7      0 (0)      0 (0)         0 (0)
             8      0 (0)      0 (0)         0 (0)

50 Chips     1          --     4 (0.4)     980 (98) (a)
             2      8 (0.8)        --      977 (97.7) (a)
             3     14 (1.4)    2 (0.2)            --
             4      0 (0)      0 (0)         5 (0.5)
             5      2 (0.2)    9 (0.9)      14 (1.4)
             6     10 (1)      4 (0.4)      15 (1.5)
             7      1 (0.1)    0 (0)         0 (0)
             8      0 (0)      0 (0)         0 (0)

25 Chips     1          --    33 (3.3)     832 (83.2) (a)
             2     20 (2)          --      751 (75.1) (a)
             3     37 (3.7)   46 (4.6)            --
             4      1 (0.1)    0 (0)        63 (6.3)
             5      5 (0.5)   50 (5)        59 (5.9)
             6      9 (0.9)   10 (1)        19 (1.9)
             7      2 (0.2)    0 (0)        21 (2.1)
             8      2 (0.2)    0 (0)        13 (1.3)

10 Chips     1          --    51 (5.1)     516 (51.6) (a)
             2     49 (4.9)        --      335 (33.5) (a)
             3     73 (7.3)   84 (8.4)            --
             4     23 (2.3)   15 (1.5)     227 (22.7)
             5     16 (1.6)  106 (10.6)     79 (7.9)
             6     35 (3.5)   30 (3)        73 (7.3)
             7      9 (0.9)   18 (1.8)      74 (7.4)
             8      3 (0.3)    2 (0.2)      68 (6.8)

            From                 To cell number
            gene
           number         4                 5             6

100 Chips    1     1,000 (100) (a)       4 (0.4)        1 (0.1)
             2         9 (0.9)       1,000 (100) (a)    1 (0.1)
             3     1,000 (100) (a)       0 (0)          0 (0)
             4                  --       0 (0)          0 (0)
             5         3 (0.3)                    --    0 (0)
             6         2 (0.2)           2 (0.2)            --
             7         1 (0.1)           0 (0)          0 (0)
             8         0 (0)             0 (0)          0 (0)

50 Chips     1     1,000 (100) (a)      23 (2.3)       11 (1.1)
             2        19 (1.9)         989 (98.9) (a)   6 (0.6)
             3       995 (99.5) (a)      3 (0.3)        3 (0.3)
             4              --           0 (0)          0 (0)
             5         7 (0.7)                --        4 (0.4)
             6        13 (1.3)          15 (1.5)            --
             7         7 (0.7)           7 (0.7)        2 (0.2)
             8         5 (0.5)           0 (0)          0 (0)

25 Chips     1       960 (96) (a)       26 (2.6)       18 (1.8)
             2        63 (6.3)         912 (91.2)      14 (1.4)
             3       933 (93.3) (a)     10 (1)          5 (0.5)
             4              --           2 (0.2)        0 (0)
             5        34 (3.4)               --         9 (0.9)
             6        38 (3.8)          64 (6.4)            --
             7        24 (2.4)          60 (6)         19 (1.9)
             8         9 (0.9)           0 (0)          0 (0)

10 Chips     1       702 (70.2) (a)     63 (6.3)       30 (3)
             2       155 (15.5)        590 (5.9) (a)   35 (3.5)
             3       596 (59.6) (a)     67 (6.7)       16 (1.6)
             4              --          11 (1.1)        8 (0.8)
             5        87 (8.7)               --        33 (3.3)
             6        93 (9.3)          95 (9.5)            --
             7        79 (7.9)         168 (16.8)      51 (5.1)
             8        51 (5.1)          24 (2.4)        8 (0.8)

            From          To cell number
            gene
           number        7                 8

100 Chips    1         4 (0.4)           5 (0.5)
             2         3 (0.3)           7 (0.7)
             3         0 (O)         1,000 (100) (a)
             4         0 (0)             0 (0)
             5     1,000 (100) (a)     999 (99.9) (a)
             6     1,000 (100) (a)       8 (0.8)
             7             --        1,000 (100) (a)
             8         0 (0)                 --

50 Chips     1        23 (2.3)           8 (0.8)
             2        13 (1.3)          24 (2.4)
             3         9 (0.9)       1,000 (100) (a)
             4         1 (0.1)           0 (0)
             5       991 (99.1) (a)    973 (97.3) (a)
             6       989 (98.9) (a)     11 (1.1)
             7              --         998 (99.8) (a)
             8         2 (0.2)                --

25 Chips     1        26 (0.2)          50 (5)
             2        57 (5.7)          94 (9.4)
             3        46 (4.6)         962 (96.2)
             4         2 (0.2)          11 (1.1)
             5       905 (90.5) (a)    811 (81.1)
             6       857 (85.7) (a)     69 (6.9)
             7              --         964 (96.4)
             8        33 (3.3)                --

10 Chips     1        73 (7.3)         141 (14.1)
             2       171 (17.1)        166 (16.6)
             3       126 (12.6)        641 (64.1) (a)
             4        22 (2.2)          71 (7.1)
             5       519 (51.9) (a)    375 (37.5) (a)
             6       408 (40.8) (a)    187 (18.7)
             7              --         693 (69.3) (a)
             8       135 (13.5)               --

(a) Linkage that exists in the original simulated model.


REFERENCES

Akutsu T, Miyano S, Kuhara S. 2000. Algorithms for inferring qualitative models of biological networks. Pac Symp Biocomput 293 304.

Eisen MB, Spellman PT, Brown PO and DB. 1995. Cluster analysis Cluster analysis

A statistical technique that identifies clusters of stocks whose returns are highly correlated within each cluster and relatively uncorrelated across clusters. Cluster analysis has identified groupings such as growth, cyclical, stable, and energy stocks.
 and display of genome-wide expression pattens. Proc Natl Acad Sci USA 25:14863-14868.

Friedman N, Linial M, Nachman I, Pe'er D. 2000. Using Bayesian networks to analyze expression data. J Comput Biol 7:601-620.

Gardner TS, di Bernardo D, Lorenz D, Collins JJ. 2003. Inferring genetic networks and identifying compound mode of action via expression profiling Microarray technology is often used for gene expression profiling. It makes use of the sequence resources created by the genome sequencing projects and other sequencing efforts to answer the question, . Science 301:102-115.

Gelman A, Carlin car·line or car·lin  
n. Scots
A woman, especially an old one.



[Middle English kerling, from Old Norse, from karl, man.]
 J, Stern H, Rubin D. 1995. Bayesian Data Analysis. London: Chapman & Ball.

Gerard F, Dri AM, Moreau PL. 1999. Role of Escherichia coil RpoS, LexA and H-NS global regulators in metabolism and survival under aerobic aerobic /aer·o·bic/ (ar-o´bik)
1. having molecular oxygen present.

2. growing, living, or occurring in the presence of molecular oxygen.

3. requiring oxygen for respiration.

4.
, phosphate-starvation conditions. Microbiology microbiology: see biology.
microbiology

Scientific study of microorganisms, a diverse group of simple life-forms including protozoans, algae, molds, bacteria, and viruses.
 145:1547-1562.

Gill RT, Valdes J J, Bentley WE. 2000. A comparative study of global stress gene regulation in response to overexpression of recombinant proteins Since human recombinants have replaced the animal version in human therapeutics, the prefix of "rh" for "human recombinant" appears less and less in the literature Human recombinants that replaced animal or harvested from human types
 in Escherichia coli Escherichia coli (ĕsh'ərĭk`ēə kō`lī), common bacterium that normally inhabits the intestinal tracts of humans and animals, but can cause infection in other parts of the body, especially the urinary tract. . Metab Eng 2:178-189.

Hartemink A, Gifford D, Jaakkola T, Young R. 2002. Bayesian methods for elucidating genetic regulatory networks. IEEE (Institute of Electrical and Electronics Engineers, New York, www.ieee.org) A membership organization that includes engineers, scientists and students in electronics and allied fields.  Intell Sys 17:37-43.

Hastings WK. 1970. Monte Carlo Monte Carlo (môNtā` kärlō`), town (1982 pop. 13,150), principality of Monaco, on the Mediterranean Sea and the French Riviera.  sampling methods using Markov chains (probability) Markov chain - (Named after Andrei Markov) A model of sequences of events where the probability of an event occurring depends upon the fact that a preceding event occurred.

A Markov process is governed by a Markov chain.
 and their applications. Biometrika 57:97-109.

Johnson C, Balagurunathan Y, Mahlet T, Falahatpisheh H, Brun M, Walker M, et al. 2004. Unraveling gene-gene interactions regulated by ligands of the aryl hydrocarbon receptor The Aryl hydrocarbon receptor (AhR) is member of the family of basic-helix-loop-helix transcription factors. AhR is a cytosolic transcription factor that is normally inactive, bound to several co-chaperones. . Environ Health Perspect 112:403-412.

Kerr MK, Martin M, Churchill GA. 2000. Analysis of variance for gene expression microarray data. J Comput Biol 7:819-837.

Kikuchi S, Tominaga D, Arita M, Takahashi K, Tomita M. 2003. Dynamic modeling of genetic networks using genetic algorithm and S-system. Bioinformatics 19:643-650.

Pesch B, Bruning T, Frentzel-Beyme R, Johnen G, Harth V, Hoffmann W, et al 2004. Challenges to environmental toxicology toxicology, study of poisons, or toxins, from the standpoint of detection, isolation, identification, and determination of their effects on the human body. Toxicology may be considered the branch of pharmacology devoted to the study of the poisonous effects of drugs.  and epidemiology: where do we stand and which way do we go? Toxicol Lett 151:255-266.

Pilpel Y, Sudarsanam P, Church GM. 2001. Identifying regulatory networks by combinatorial analysis of promoter elements. Nat Genet genet: see civet.  29:153-159.

Press WH, Brian BP, Teukolsky SA, Vetterling WT. 1989. Numerical Recipes--The Art of Scientific Computing (FORTRAN Version). New York New York, state, United States
New York, Middle Atlantic state of the United States. It is bordered by Vermont, Massachusetts, Connecticut, and the Atlantic Ocean (E), New Jersey and Pennsylvania (S), Lakes Erie and Ontario and the Canadian province of
: Cambridge University Press Cambridge University Press (known colloquially as CUP) is a publisher given a Royal Charter by Henry VIII in 1534, and one of the two privileged presses (the other being Oxford University Press). .

Schonwalder C, Olden K. 2003. Environmental health moves into the 21st century. Int J Hyg Environ Health 206:263-267.

Simmons PT, Portier CJ. 2002. Toxicogenomics: the new frontier New Frontier

President John F. Kennedy’s legislative program, encompassing such areas as civil rights, the economy, and foreign relations. [Am. Hist.: WB, K:212]

See : Aid, Governmental
 in risk analysis. Carcinogenesis car·ci·no·gen·e·sis
n.
The production of cancer.



carcinogenesis

production of cancer.


biological carcinogenesis
viruses and some parasites are capable of initiating neoplasia.
 23:903-905.

Suter L, Babiss LE, Wheeldon EB. 2004. Toxicogenomics in predictive toxicology in drug development. Chem Biol 11:161-171.

Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, et al. 1999. Interpreting patterns of gene expression with self-organizing maps This article appears to contradict another article. Please see discussion on the linked talk page.
A self-organizing map (SOM) is a type of artificial neural network that is trained using unsupervised learning to produce low-dimensional representation of the training
: methods and application to hematopoietic hematopoietic /he·ma·to·poi·et·ic/ (-poi-et´ik)
1. pertaining to hematopoiesis.

2. an agent that promotes hematopoiesis.


hematopoietic

1. pertaining to or affecting the formation of blood cells.
 differentiation. Proc Natl Acad Sci USA 96:2907-2912.

Toraason M, Albertini R, Bayard S Bayard, horse, in chivalric romance
Bayard (bā`ərd), Ital. Baiardo (bäyär`dō), in chivalric romance, a bay horse, remarkable for his spirit and for his unique ability to fit his size to his rider.
, Bigbee W, Blair A, Boffetta P, et al. 2004. Applying new biotechnologies to the study of occupational cancer--a workshop summary. Environ Health Perspect 112:413-416.

Toyoshiba H, Yamanaka T, Sone sone  
n.
A subjective unit of loudness, as perceived by a person with normal hearing, equal to the loudness of a pure tone having a frequency of 1,000 hertz at 40 decibels.
 H, Parham F, Walker N, Martinez J, et ah 2004. Gene interaction network suggests dioxin dioxin

Aromatic compound, any of a group of contaminants produced in making herbicides (e.g., Agent Orange), disinfectants, and other agents. Their basic chemical structure consists of two benzene rings connected by a pair of oxygen atoms; when substituents on the rings are
 induces a significant linkage between Ah-receptor and retinoic acid receptor The retinoic acid receptor (RAR) is a type of nuclear receptor[1] which is activated by both all-trans retinoic acid and 9-cis retinoic acid.[2] There are three retinoic acid receptors (RAR), RAR-alpha, RAR-beta, and RAR-gamma encoded by the RARA  beta. Environ Health Perspect 112:1217-1224.

Voit EO, Radivoyevitch T. 2000. Biochemical systems analysis of genome-wide expression data. Bioinformatics 16:1023-1037.

Waters MD, Selkirk JK, Olden K, 2003. The impact of new technologies on human population studies. Mutat Res 544:349-360.

Address correspondence to C. J. Portier, Laboratory of Computational Biology Not to be confused with Biologically-inspired computing.
Computational biology is an interdisciplinary field that applies the techniques of computer science, applied mathematics, and statistics to address problems inspired by biology.
 and Risk Analysis, NIEHS NIEHS National Institute of Environmental Health Sciences (NIH, DHHS) , P.O. Box 12233, MD A3-06, 111 Alexander Dr., Research Triangle Park Research Triangle Park, research, business, medical, and educational complex situated in central North Carolina. It has an area of 6,900 acres (2,795 hectares) and is 8 × 2 mi (13 × 3 km) in size. Named for the triangle formed by Duke Univ. , NC 27709. Telephone: (919) 541-3802. Fax: (919) 541-3647. E-mail: portier@niehs.nih.gov

We would like to thank T. Darden, S. Rod, and N. Walker for helpful comments and advices.

The authors declare they have no competing financial interests. Received 19 March 2004; accepted 21 July 2004.
COPYRIGHT 2004 National Institute of Environmental Health Sciences
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2004, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

 Reader Opinion

Title:

Comment:



 

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:Toxicogenomics
Author:Portier, Christopher J.
Publication:Environmental Health Perspectives
Date:Nov 15, 2004
Words:8672
Previous Article:Prediction of toxicant-specific gene expression signatures after chemotherapeutic treatment of breast cell lines.(Toxicogenomics)
Next Article:Using decision forest to classify prostate cancer samples on the basis of SELDI-TOF MS data: assessing chance correlation and prediction...



Related Articles
Toxin Gene Expression by Shiga Toxin-Producing Escherichia coli: the Role of Antibiotics and the Bacterial SOS Response.
On the 50th anniversary of solving the structure of DNA. (Editorials).
TXG at SOT. (Meeting Report).
Phenotypic anchoring: linking cause and effect. (NCT Update).
Quicker tests identify E. coli strains.
Cluster busters.(Bioinformatics)
The utility of DNA microarrays for characterizing genotoxicity.(Genomics and Risk Assessment: Mini-Monograph)
Toxicogenomics through the Eyes of Informatics: conference overview and recommendations.(Meeting Report)
Human Escherichia coli O157:H7 genetic marker in isolates of bovine origin.(Dispatches)
A vision that challenges dogma gives rise to a new era in the environmental health sciences.(Essay on: Toxicogenomics)

Terms of use | Copyright © 2008 Farlex, Inc. | Feedback | For webmasters | Submit articles