Printer Friendly
The Free Library
14,718,654 articles and books
Member login
User name  
Password 
 
Join us Forgot password?

Cross-site comparison of gene expression data reveals high similarity.


Consistency and coherence of gene expression data across multiple sites depends on several factors such as platform (oligo, cDNA, etc.), environmental conditions at each laboratory, and data quality. The Hepatotoxicity hepatotoxicity (hepˑ··tō·t  Working Group of the International Life Sciences Institute Health and Environmental Sciences Institute consortium on the application of genomics to mechanism-based risk assessment is investigating these factors by comparing high-density gene expression data sets generated on two sets of RNA RNA: see nucleic acid.
RNA
 in full ribonucleic acid

One of the two main types of nucleic acid (the other being DNA), which functions in cellular protein synthesis in all living cells and replaces DNA as the carrier of genetic
 from methapyrilene (MP) experiments conducted at Abbott Laboratories Abbott Laboratories (NYSE: ABT) is a diversified pharmaceuticals and health care company. It has over 65,000 employees and operates in 130 countries. The corporate headquarters are in Abbott Park, Illinois, a neighborhood of North Chicago, Illinois.  and Boehringer-Ingdheim Pharmaceuticals, Inc. using a single platform (Affymetrix Rat Genome U34A GeneChip) at seven different sites. This article focuses on the evaluation of data quality and statistical models that facilitate the comparison of such data sets at the probe level. We present methods for exploring and quantitatively assessing differences in the data, with the principal goal being the generation of lists of site-insensitive genes responsive to low and high doses of MP. A combination of numerical and graphical techniques reveals important patterns and partitions of variability in the data, including the magnitude of the site effects. Although the site effects are significantly large in the analysis results, they appear to be primarily additive and therefore can be adjusted in the statistical calculations in a way that does not bias conclusions regarding treatment differences. Key words: cross-site comparison, gene expression, hepatotoxicity, ILSI ILSI International Life Sciences Institute
ILSI Incorporated Law Society of Ireland
, toxicogenomics. Environ Health Perspect 112:449-455 (2004). doi:10.1289/txg.6787 available via http://dx.doi.org/[Online 15 January 2004]

**********

Advancing the general knowledge base of mechanisms and markers of hepatotoxicity is of great interest to all parties involved in this consortium. The Hepatotoxicity Working Group of the International Life Sciences Institute (ILSI) Health and Environmental Sciences Institute (HESI HESI High Energy Solar Imager ) Committee on the Application of Genomics to Mechanism-Based Risk Assessment is investigating these factors by comparing high-density gene expression data sets generated on two sets of RNA from two independent in vivo in vivo /in vi·vo/ (ve´vo) [L.] within the living body.

in vi·vo
adj.
Within a living organism.



in vivo adv.
 experiments where rats were dosed with methapyrilene (MP) conducted at either Abbott Laboratories (site A; Abbott Park, IL) or Boehringer-Ingelheim Pharmaceuticals, Inc. (BIPI BIPI Boehringer Ingelheim Pharmaceuticals, Inc.
BIPI Buscador Inteligente Para Internet (Spanish) 
; site B; Ridgefield, CT).

Most microarray studies are designed with large "p" (number of genes) and small "n" (number of arrays) characteristics. Two issues of concern arise when investigators work with data having such characteristics. The first issue is of statistical inference Inferential statistics or statistical induction comprises the use of statistics to make inferences concerning some unknown aspect of a population. It is distinguished from descriptive statistics.  power where the aim is to minimize both false-positive and false-negative rates. Increasing sample size (i.e., n) can afford better statistical inference power; however, this remedy is often cost prohibitive. The second issue arises in the attempt to address the first concern by increasing sample size by incorporating data sets generated at disparate sites and times. Thus, the second concern is about the consistency of such data sets generated across multiple sites and whether the same or similar conclusions can be drawn. An across-site microarray study can be useful for addressing this issue, which is another way to increase sample size. Conceptually, the complexities among data generated across different sites are higher than those of data generated within one site. The above two issues are related; however, the second one may be more general and of increasing concern as microarray data sets become increasingly available and the desire to compare and contrast across studies increases.

Male Sprague-Dawley rats [CRL CRL - Carnegie Representation Language.

Carnegie Group, Inc. Frame language derived from SRL. Written in Common LISP. Used in the product Knowledge Craft.
: CD(SD)IGS IGS - Internet Go Server.  VAF/Plus] (Charles River Charles River

River, eastern Massachusetts, U.S. The longest river wholly in the state, it flows into Boston Bay after a course of about 80 mi (130 km). Navigable for about 7 mi (11 km), its estuary separates the cities of Boston and Cambridge.
 Laboratories, Kingston, NY) approximately 6-7 weeks of age were assigned to nine study groups (four rats/group) and dosed by garage for 1, 3, or 7 days with water (vehicle), 10 mg/kg/day MP, or 100 mg/kg/day MP (Figure 1). Dose selection was based on published and unpublished studies; the high dose of MP was chosen to yield hepatotoxicity, and a nontoxic low dose was selected. In general, the BIPI study yielded more hepatotoxicity than the study conducted at Abbott Laboratories, as defined by clinical pathology clinical pathology
n.
1. The practice of pathology as it pertains to the care of patients.

2. The subspecialty in pathology concerned with the theoretical and technical aspects of laboratory technology that pertain to the
 parameters and microscopic examinations of hematoxylin hematoxylin /he·ma·tox·y·lin/ (he?mah-tok´si-lin) an acid coloring matter from the heartwood of Haematoxylon campechianum; used as a histologic stain and also as an indicator.  and eosin-stained liver sections. No significant histopathological alterations were observed in livers of rats treated with 10 mg/kg/day MP at the 1- and 3-day time points compared with alterations in livers of the control groups. In comparison, MP treatment with 10 mg/kg/day for 7 days resulted in minimal portal mononuclear mononuclear /mono·nu·cle·ar/ (-noo´kle-er)
1. having but one nucleus.

2. a cell having a single nucleus, especially a monocyte of the blood or tissues.


mon·o·nu·cle·ar
adj.
 infiltrates Infiltrates
Cells or body fluids that have passed into a tissue or body cavity.

Mentioned in: Eosinophilic Pneumonia
, minimal hepatocellular periportal necrosis necrosis /ne·cro·sis/ (ne-kro´sis) pl. necro´ses   [Gr.] the morphological changes indicative of cell death caused by progressive enzymatic degradation; it may affect groups of cells or part of a structure or an organ. , and minimal microvesicular hepatocellular vacuolization. At the 100-mg/kg/day dose, all rats showed early minimal mononuclear portal infiltrates, minimal hepatocellular periportal necrosis, and mild to moderate periportal microvesicular vacuolization at 1 and 3 days of exposure. The severity of the lesions increased at day 7, and mild hyperplasia hyperplasia (hī'pərplā`zhə): see hypertrophy.  became evident. In addition, in the 100-mg/kg/day MP dose group at 7 days of exposure, moderate mononuclear portal infiltrates were noted, and the number of enlarged periportal hepatocytes with microvesicular vacuolization increased. The severity of hepatocellular periportal necrosis at the 7-day time point also increased, accompanied by increased numbers of hepatocellular mitotic figures mitotic figure
n.
The microscopic appearance of a cell undergoing mitosis.
. Bile duct bile duct or biliary duct
n.
Any of the excretory ducts in the liver that convey bile between the liver and the intestine, including the hepatic, cystic, and common bile ducts. Also called gall duct.



bile duct

1.
 hyperplasia was observed in animals in the 100-mg/kg/day dose group at 3 and 7 days. Minimal bile duct hyperplasia was seen at the 3-day time point and increased in severity to mild by 7 days. Levels of alanine aminotransferase alanine aminotransferase /al·a·nine ami·no·trans·fer·ase/ (ah-me?no-trans´fer-as) alanine transaminase.

alanine aminotransferase
n. Abbr. ALT
See SGPT.
, aspartate aminotransferase aspartate aminotransferase
n. Abbr. AST
See SGOT.



aspartate aminotransferase

an enzyme that catalyzes the reversible transfer of an amino group:

$$\eqalign $$
, and sorbitol dehydrogenase Sorbitol dehydrogenase is an enzyme in carbohydrate metabolism converting sorbitol, the sugar alcohol form of glucose, into fructose. Together with aldose reductase, it provides a way for the body to produce fructose from glucose without using ATP.  increased in high-dose animals in a time-dependent manner. Total bilirubin Bilirubin

The predominant orange pigment of bile. It is the major metabolic breakdown product of heme, the prosthetic group of hemoglobin in red blood cells, and other chromoproteins such as myoglobin, cytochrome, and catalase.
 tended to be elevated in the high-dose group with continued dosing. All the above parameters were reflective of liver toxicity.

[FIGURE 1 OMITTED]

An initial amount of 5-20 [micro]g total RNA derived from livers of rats used in those studies was used for the synthesis of double-stranded cDNA with a commercially available kit (Superscript Any letter, digit or symbol that appears above the line. For example, 10 to the 9th power is written with the 9 in superscript (109). Contrast with subscript.  Choice System; Invitrogen Life Technologies, Carlsbad, CA, or Roche Molecular Biochemicals, Mannheim, Germany) in the presence of a T7-[(dT).sub.24] DNA DNA: see nucleic acid.
DNA
 or deoxyribonucleic acid

One of two types of nucleic acid (the other is RNA); a complex organic compound found in all living cells and many viruses. It is the chemical substance of genes.
 oligonucleotide Oligonucleotide

A deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) sequence composed of two or more covalently linked nucleotides. Oligonucleotides are classified as deoxyribooligonucleotides or ribooligonucleotides.
 primer. After synthesis, the cDNA was purified by phenol/chloroform/isoamyl alcohol extraction and ethanol precipitation Ethanol precipitation is a method used to concentrate DNA. DNA is polar, soluble in water which is polar as well. Based on the principle of "like dissolves like", it is insoluble in the relatively less polar ethanol. . The purified eDNA was then transcribed in vitro in vitro /in vi·tro/ (in ve´tro) [L.] within a glass; observable in a test tube; in an artificial environment.

in vi·tro
adj.
In an artificial environment outside a living organism.
 (ENZO Life Sciences, Farmingdale, NY or Ambion Diagnostics, Austin, TX) in the presence of biotinylated ribonucleotides to form biotinlabeled cRNA. The labeled cRNA was then purified on an affinity resin [RNeasy, Qiagen (Valencia, CA)], quantified, and fragmented. Ten to 20 [micro]g labeled cRNA was hybridized for approximately 16 hr at 45[degrees]C to an expression probe array. The array was then washed and stained with streptavidin-P-phycoerythrin (SAPE SAPE Sapient Corp (stock symbol)
SAPE Substance Abuse Prevention Education
SAPE Survivable Adaptive Planning Experiment
SAPE Sexual Assault Prevention and Education
; Molecular Probes Molecular Probes is a biotechnology company located in Eugene, Oregon specializing in fluorescence. The company was founded in 1975 by Richard and Rosaria Haugland in their kitchen in Minnesota, then moved briefly to Texas and finally to Oregon in the early 1980s. , Inc., Eugene, OR). The signal was amplified using a biotinylated goat antistreptavidin antibody (Vector Laboratories, Burlingame, CA), and the array received a final staining with SAPE. The GeneChip Fluidics fluidics, branch of engineering and technology concerned with the development of equivalents of various electronic circuits using movements of fluid rather than movements of electric charge.  Workstation 400 (Affymetrix, Inc., Santa Clara Santa Clara, city, Cuba
Santa Clara (sän`tä klä`rä), city (1994 est. pop. 217,000), capital of Villa Clara prov., central Cuba.
, CA) was used to stain the arrays. The array was then scanned twice using a confocal confocal

see confocal microscopy.
 laser scanner (GeneArray Scanner 2500, Hewlett Packard, or Agilent, Foster City, CA), which resulted in one scanned image.

Many factors can contribute to the heterogeneity of data sets, including but not limited to differences in platform (oligo, cDNA, etc.), environmental conditions at each laboratory, and data quality. Meta-analysis performed by statistically integrating results from different data sets (Choi et al. 2003; Ghosh et al. 2003) is one solution for across-site microarray studies. When the experimental design settings are homogenous homogenous - homogeneous  across sites, pooling data sets for a comprehensive analysis provides direct comparisons across sites, with better statistical inference power. The major challenges to this end are how to normalize normalize

to convert a set of data by, for example, converting them to logarithms or reciprocals so that their previous non-normal distribution is converted to a normal one.
 data sets and how to define and use variation across sites.

Five well-adapted normalization In relational database management, a process that breaks down data into record groups for efficient processing. There are six stages. By the third stage (third normal form), data are identified only by the key field in their record.  methods, including cyclic Loess loess (lĕs, lō`əs, Ger. lös), unstratified soil deposit of varying thickness, usually yellowish and composed of fine-grained angular mineral particles mixed with clay.  (modified from Dudoit et al. 2002), contrast-based method (Astrand, unpublished data), quantile quantile

division of a total into equal subgroups; includes terciles, quartiles, quintiles, deciles, percentiles.
 normalization (Irizarry et al. 2003), the scaling method in Affymetrix MAS 5.0 (Affymetrix, Inc. 2002), and the nonlinear method (Li and Wong 2001; Schadt et al. 2001) are reviewed by Bolstad et al. (2003). Quantile normalization seemed slightly better than the others in the three types of comparisons performed. Quantile normalization makes the distribution of each chip the same by aggressively removing specific sources of variation (such as those associated with chip-to-chip differences) and artificially generating ideal distributions. Another less aggressive way to deal with this is to normalize the location and scaling of the distribution. We prefer the second approach and have implemented that in this article. In our opinion, quantile normalization has a potential drawback in that it can reduce the consistency of the expression profile of the same probe set across arrays. This consistent expression profile is one of the essential characteristics of the GeneChip probe data (Li and Wong 2001). Lowering this consistency may increase the variation in downstream statistical modeling. An interquartile range In descriptive statistics, the interquartile range (IQR), also called the midspread, middle fifty and middle of the #s, is a measure of statistical dispersion, being equal to the difference between the third and first quartiles.  normalization is applied in this article toward the same goal as quantile normalization (i.e., to obtain consistent but not identical data distribution among chips). This approach makes data comparable across sites and preserves a certain level of site effects when combining the data. The factor of site can be easily adapted in an analysis of variance (ANOVA anova

see analysis of variance.

ANOVA Analysis of variance, see there
) model. In using variation components naturally involved in the data, the mixed-model approach provides flexibility (Chu et al. 2002) and robustness (Chu et al., in press) for this task. Tan and coworkers (Tan et al. 2003) conducted a similar comparison study that demonstrated mildly positive concordance concordance /con·cor·dance/ (-kord´ins) in genetics, the occurrence of a given trait in both members of a twin pair.concor´dant

con·cor·dance
n.
 between Affymetrix and Amersham (Piscataway, NJ) short oligo arrays and Agilent cDNA arrays.

Data Consistency Data consistency summarizes the validity, accuracy, usability and integrity of related data between applications and across the IT enterprise. This ensures that each user observes a consistent view of the data, including visible changes made by the user's own transactions and  and Normalization

RNA samples were analyzed independently by seven different Affymetrix platform sites using RGU RGU The Robert Gordon University (Aberdeen, Scotland)
RGU Responsible Governmental Unit
RGU Revenue-Generating Unit
34A expression probe arrays (Affymetrix) containing 8,799 probe sets interrogating primarily annotated genes, for a total of 99 chips. Each probe set consisted of 16 probes; thus, each chip resulted in 140,784 probes. The rat sequences used for the design of the RGU34A expression probe array were derived from the UniGene Database build #34 (created from Genbank 107/dbEST 11/18/98) and supplemented with additional annotated gene sequences from Genbank 110 (http:// www.ncbi.nih.gov/GenBank). UniGene clusters are represented by a sample sequence that is the most complete and most 3' sequence in the cluster. The oligonucleotide probes are 25mers, and 16 probe pairs per sequence are used. The detection sensitivity is 1:100,000, measured by the detection in a comparative analysis between a complex RNA containing spiked control transcripts and a complex RNA with no spikes [Anonymous. GeneChip Rat Genome U34 Set data sheet (Affymetrix 2002)]; detection is quantitative over more than three orders of magnitude (Lockhart et al. 1996). Each site is identified as a user in the HESI consortium in Table 1. Sites 2, 3, 6, and 8 analyzed RNA samples from both in vivo studies performed at sites A and B, whereas the remaining sites analyzed only the RNA samples from site B. Each RNA set was analyzed using nine Affymetrix chips, one for each of the three dose levels and three time points, with the exception of site 3, where all nine chips were used in three technical replicates of the RNA samples from three doses from the third time point from the in vivo study performed at site A only. Data for perfect match probe intensities from CEL CEL Cellular
CEL Celestial
CEL Check Engine Light
CEL Degrees Celsius (temperature)
CEL Comisión Ejecutiva Hidroeléctrica del Río Lempa (El Salvador)
CEL Center for Entrepreneurial Leadership
 files were used for analysis. In this article we focus on the point that site effects exist but can be statistically accounted for and adjusted. It would be interesting to investigate whether using different normalization methods or outcome variables (such as perfect match and mismatch) results in different site-effect magnitude, but this was not a major goal of this article. Data used in this report may be further analyzed by interested scientists and can be accessed via the Internet (http://dir. niehs.nih.gov/microarray/ilsi-datasets/ home.htm or the European Bioinformatics Institute The European Bioinformatics Institute (EBI) is a centre for research and services in bioinformatics, and is part of European Molecular Biology Laboratory (EMBL). It is a pioneer of novel and developmental bioinformatics research.  ArrayExpress database at http:// www.ebi.ac.uk/arrayexpress/).

A [log.sub.2] transformation was applied to all data before any analysis process. As a first attempt to inspect data consistency across sites, box plots of all chips were generated in Figure 2A for comparison of the distribution of each chip. As shown in Figure 2A, the within-site chip-to-chip variation is higher in sites 3 and 5. In addition, the range of data varies significantly from site to site. To further inspect the correlations among chips, a subgroup of 10 chips corresponding to control animals (the first level of dose factor, i.e., dosed with vehicle alone or the 0 mg/kg dose) at day 1 (the first level of time factor) was chosen to compute the interchip correlation coefficients. Figure 3A shows the scatterplots of the [log.sub.2] perfect-match probe intensities among the 10 chips. The red ellipse ellipse, closed plane curve consisting of all points for which the sum of the distances between a point on the curve and two fixed points (foci) is the same. It is the conic section formed by a plane cutting all the elements of the cone in the same nappe.  curve within each plot indicates the 95% density curve based on bivariate bi·var·i·ate  
adj.
Mathematics Having two variables: bivariate binomial distribution.

Adj. 1.
 normal distribution (i.e., 95% of data is inside the ellipse). The pairwise correlation coefficients ranged from 0.92 to 0.98 except for those from site 3 (B01_3) or site 5 (B01_5), which ranged from 0.82 to 0.90.

[FIGURES 2-3 OMITTED]

Another method used to inspect the consistency of these chips is examination of Oligo B2 (Affymetrix 2002). The Oligo B2 contains the Poly-A Controls (Dap (Directory Access Protocol) A protocol used to gain access to an X.500 directory listing. See LDAP. See also DAAP. , lys, phe, thr, and trp) and the Hybridization hybridization /hy·brid·iza·tion/ (hi?brid-i-za´shun)
1. crossbreeding; the act or process of producing hybrids.

2. molecular hybridization

3.
 Controls (bioB, bioC, bioD, and cre) as part of the GeneChip Eukaryotic eukaryotic /eu·kary·ot·ic/ (u?kar-e-ot´ik) pertaining to a eukaryon or to a eukaryote.

eukaryotic

pertaining to eukaryosis.


eukaryotic cells
see cell.
 Hybridization Control Kit. Details on Oligo B2 can be found in the reference provided. Briefly, Oligo B2 serves as spike-in controls. The Poly-A Controls can be spiked into a complex RNA sample and carried through the sample preparation process. The Hybridization Controls are prepared in staggered concentrations (1.5, 5, 25, and 100 pM for bioB, bloC, bioD, and cre, respectively) independent of RNA sample preparation and are spiked into the hybridization cocktail. Although in the Affymetrix reference it states that the variation in B2 hybridization intensities across the array is normal and does not indicate a variation in hybridization efficiency, we have often observed that these controls are expressed consistently across chips (unpublished data). Figure 3B presents the scatterplots of Oligo B2 controls among the 10 chips in Figure 3A. A small percentage of the Hybridization Controls on the chips were saturated. The linearity (correlation) between chips was higher when there were no saturated intensities. The Poly-A Controls have a better consistency than Hybridization Controls. Because the Poly-A Controls are carried in RNA samples through the preparation process, they are better candidates to indicate the consistency of data. The correlation coefficients among the Poly-A Controls were calculated and listed in Table 2. These correlation coefficients revealed that sites 3, 4, and 5 have less consistency with other chips. However, the inspection here was based on assuming all experiments followed the same protocol.

For a global view of the correlations among all chips, a matrix with each entry as one minus the correlation coefficient of the corresponding pair of chips was calculated and used as the distance matrix for multidimensional scaling Multidimensional scaling (MDS) is a set of related statistical techniques often used in data visualisation for exploring similarities or dissimilarities in data. MDS is a special case of ordination.  (MDS MDS,
n See temporomandibular pain-dysfunction syndrome.

MDS 1 Maternal deprivation syndrome, see there 2 Myelodysplastic syndrome, see there
) analysis with the MDS procedure (SAS Institute SAS Institute Inc., headquartered in Cary, North Carolina, USA, has been a major producer of software since it was founded in 1976 by Anthony Barr, James Goodnight, John Sall and Jane Helwig.  1999). The results are presented in Figure 4. This two-dimensional representation gives the relative location of each chip based on the distance matrix of a multidimensional mul·ti·di·men·sion·al  
adj.
Of, relating to, or having several dimensions.



multi·di·men
 space (a dimension of 99 in this case). The plotted points are their relative location on a two-dimensional map. The closer two points are to each other, the more similar they are. The chips from sites 3 and 5 are spotted apart from the others except for the three chips from site 3 on the margin.

[FIGURE 4 OMITTED]

For a quick summary of the consistency, the 99 chips were separated into 18 categories based on the unique site of in vivo studies, dose, and time combination, and the average within-category correlation coefficients were calculated. The results are shown in Figure 5. The average correlation coefficients in all categories were higher than 0.9. Categories that did not involve sites 3, 4, and 5 (A01, A02, A11, A12, A21, and A22 in Figure 5) because the site A samples were not analyzed at those sites, had correlation coefficients higher than 0.95.

[FIGURE 5 OMITTED]

Three major characteristics were revealed on inspection of the aforementioned probe data, namely, different within-site chip-to-chip variation, different site-to-site variation, and high within-treatment-group correlation across sites. The first characteristic can be handled with the mixed-model approach and will be discussed later in this article. The site-to-site variation can be reduced by normalization. As seen on examination of the box plots in Figure 2A, there are two factors that need normalization: the range of data and the within-chip variance (size of the box). An interquartile normalization with the median as the location parameter In statistics, if a family of probability densities parametrized by a scalar- or vector-valued parameter μ is of the form

fμ(x) = f(x − μ)


where f
 and interquartile range as the scaling parameter was applied. Figure 2B presents the box plots after normalization. The observation of a high within-treatment-group correlation across sites provides an incentive to pool data across sites for more powerful statistical inferences.

Mixed-Model Analysis

The mixed-model approach provides flexible model specification for the ANOVA type of analysis with the ability to accommodate different correlation structures in the data. Chu et al. (2002) have more details for applying the mixed model on GeneChip probe data. The mixed model for the MP data applied here is as follows:

[1] [Y.sub.ijklp] = [R.sub.i] + [D.sub.j] + [T.sub.k] + [S.sub.l] + R[D.sub.ij] + R[T.sub.ik] + R[S.sub.il] + D[T.sub.jk] + D[S.sub.jl] + T[S.sub.kl] + [P.sub.p] + R[P.sub.ip] + D[P.sub.jp] + T[P.sub.kp] + S[P.sub.lP] + [A.sub.ijl(l)] + [[epsilon].sub.ijklp],] [A.sub.ijk(1)]~ N(0, -[l.sup.2]) [[epsilon[.sub.ijklp]~ N(0, [[sigma].sup.2].

The indices i, j, k, l, and p indicate site of in vivo study, dose, time point, site, and probe number, respectively. The index of the gene was omitted in the model, as the model will be run on a per-gene basis. The dependent variable Y is the normalized perfect match probe intensity. The symbols R, D, T, S, P represent RNA samples, dose, time, site, and probe main effects, respectively. The symbols with two letters are the interactions of the two effects associated with the letters. The [A.sub.ijk(L)] is the/th within-site array random effect and is assumed to be normally distributed with mean 0 and variance [[sigma].sup.2.sub.l]. Specifying array random effect induces a correlation across all observations (probes) on the same chip (probe set). The [[epsilon].sub.ijklp] is a stochastic By guesswork; by chance; using or containing random values.

stochastic - probabilistic
 error and is assumed to be normally distributed with mean 0 and variance [sigma.sup.2]. The two random terms are assumed to be independent.

The interactions involving more than two effects can be included in Model 1. However, after fitting those higher interactions in the model for several genes, we observed that those interactions were not significant; therefore, they are not included in the model. The error term can be partitioned to associate with each site in a similar fashion to the array random effect. Because all the within-treatment-group correlations were higher than 0.9, partitioning error term becomes a minor issue. In addition, assuming that the errors are identically distributed can enhance the strength of pooling data to more accurately estimate the associated variance.

A desired outcome of this exercise was to find genes responding to different doses at different time points by performing statistical testing on dose, time, and dose-by-time interaction effects. Whether the significant genes selected are consistent across sites was also of particular interest to parties involved in this across-site microarray study. This can be achieved by testing dose-by-site and time-by-site interactions.

For comparison purposes, a subset of 18 chips from site 8 is extracted and fitted by a similar fashion as in Model 1 but without all site-involved effects. This single-site model is listed as follows:

[2] [Y.sub.ijklp] = [R.sub.i] + [D.sub.j] + [T.sub.k] + R[D.sub.ij] + R[T.sub.ik] + D[T.sub.jk] + [P.sub.p] + R[P.sub.ip] + D[P.sub.jp] + [A.sub.ijk] + [[epsilon].sub.ijklp], [A.sub.ijk] ~ N(0, [[sigma].sup.2.sub.a), [[epsilon].sub.ijklp] ~ N(0, [[sigma].sup.2).

The significances of testing some effects in this case were compared with the results from Modal 1. For fitting Models 1 and 2, standard maximum likelihood approaches are usually best and can be accessed through software like the MIXED procedure (SAS Institute 1999).

Results

The statistical testing results of the 10 fixed effects that did not involve probe in Model 1 are presented in Figure 6A. Two plots, a histogram histogram
 or bar graph

Graph using vertical or horizontal bars whose lengths indicate quantities. Along with the pie chart, the histogram is the most common format for representing statistical data.
 and a box, are drawn on negative [log.sub.10] p-values for each effect among all genes. Table 3 presents the number of significant genes selected with controlling false-positive rate by Bonferroni's approach and controlling three different false discovery rates (FDR). The cutoff of the negative [log.sub.10] p-value for 0.05 family-wide false-positive rate with Bonferroni's adjustment in this case is 6.245, which is indicated as the red horizontal line (Descriptive Geometry & Drawing) a constructive line, either drawn or imagined, which passes through the point of sight, and is the chief line in the projection upon which all verticals are fixed, and upon which all vanishing points are found.

See also: Horizontal
 on each plot in Figure 6A. As expected, the site effect is highly significant for most of the genes (95.7%). The percentage of significant genes for both dose and time-by-site interactions were 0.11 and 0.19, respectively. This implies that only 27 of 8,799 genes showed a differential response to dose or time across sites. Therefore, the genes selected as differentially responsive to dose or time from the pooled data sets is consistent across sites except for a very few. However, there were seven genes showing a very highly significant dose effect with a negative logarithm/-value larger than 10 that also showed a significant dose-by-site interaction. An explanation for this is that the extremely significant main effect often causes significant interactions with other effects. The results of testing the fixed effects of Model 2 with data from site 8 only are presented in Figure 6B. All the negative logarithm logarithm (lŏg`ərĭthəm) [Gr.,=relation number], number associated with a positive number, being the power to which a third number, called the base, must be raised in order to obtain the given positive number.  p-values are < 6 in this case. This implies that there were no genes showing any significant effect with Bonferroni's criterion when data from site 8 only were used.

[FIGURE 6 OMITTED]

Compared with Bonferroni's approach, which conservatively guarantees that the probability of only one (or more) false positives is less than 0.05 across all of the tests, FDR allows a certain proportion (the cutoff rate) of significant genes to be false discoveries. More significant genes were selected with FDR approaches. However, the proportion of genes with significant dose or time-by-site effects was still considerably low--l.3 and 2.2% in the case of setting 0.005 (1 in 200) as cutoff.

The results in Table 3 reveal that site-to-site effects, although significant in a large number of genes, appear to be only additive, as Model 1 fit the data well, with a median [r.sup.2] equal to 0.97, and only a few genes showed significant site-by-treatment interaction. Therefore, this can be adjusted for in the statistical calculations in a way that does not bias conclusions regarding treatment differences. In other words Adv. 1. in other words - otherwise stated; "in other words, we are broke"
put differently
, each site ends up relaying a similar story regarding significantly differentially expressed genes.

Figure 7 presents the histograms and box plots of the standard deviations of the random components that are seven within-site array random factors, [A.sub.ijk(l)] and the stochastic errors, [[epsilon].sub.ijklp], in Model'1. These plots provide global comparisons among the site-specific array variations. The data from sites 6 and 7 show less array variation, whereas the data from sites 3 and 5 show larger array variation. The median standard deviations of sites 2-8 and stochastic errors are 0.029, 0.089, 0.037, 0.087, 0.018, 0.018, 0.024, and 0.096, respectively. Judging from comparison of these medians, the array variations from sites 3 and 5 are about 23-fold (0.0892/0.0182) larger than the variations from sites 6 and 7.

[FIGURE 7 OMITTED]

Figure 8 presents the comparison of significance of dose and time effects from Models 1 and 2. The red lines on plots are regression-fitted lines with slopes 0.14 and 0.16 on Figure 8A (comparing significance of dose effect) and Figure 8B (comparing significance of time effect), respectively. Pooling data across sites increases statistical inference power significantly. Judging from the inverse of slopes, the negative log /"values increase 7.14- and 6.25-fold for dose and time effects, respectively, when pooling data across seven sites, with the number of chips applied increasing from 18 to 99.

[FIGURE 8 OMITTED]

Discussion

Combining data across sites typically provides more powerful statistical inference. However, consistency of data sets is an essential issue for analyzing pooled data sets across sites. A robust normalization method is desirable to make data sets across sites more comparable. The interquartile range normalization is suitable for data with high correlation but inconsistent data range across chips. This normalization was used on the data applied here to achieve consistent ranges across the majority of the data and to preserve the phenomena of significant site variation.

An alternative means for normalizing data across sites is to use a universal reference sample, The universal reference is typically a type of mRNA pool from all the mice in the experiment and is distributed to each site of the consortium to serve as baseline for normalization. Data can be normalized within each site to the universal reference array, using linear or nonlinear methods. The advantage of a universal reference is that it serves as a bridge to bring the data from all chips across sites to be comparable; however, extra costs are involved in preparation, distribution, and maintenance of the pooled reference as well as the expense of running more arrays. In addition, data from the reference are subject to nonconstant within-site sources of variability and do not provide a gold standard for comparison. This concern will be more serious for sites with high within-site array-to-array variation

Rather than using the data from the whole array for normalization, another alternative is to use a portion of the data considered to be invariant (programming) invariant - A rule, such as the ordering of an ordered list or heap, that applies throughout the life of a data structure or procedure. Each change to the data structure must maintain the correctness of the invariant.  across arrays to generate a scoring function for normalization (Li and Wong 2001; Schadt et al. 2001). Those probes in Affymetrix provided as Oligo B2 can be considered invariant with known concentrations. Again, the quality of those controls is key to the success of this approach, and we are currently investigating ways of implementing this approach.

The mixed model provides a flexible method to adjust Site effects and to use different array variations between sites. Significant site effects were revealed by this analysis as expected; however, only a few genes showed significant interaction effects between sites and treatments, dose, or time. In other words, each site tends to tell the same story regarding the list of significantly differentially expressed genes. This is a primarily positive result from this study and lends hope to the prospect of gaining power by combining study results. Similar studies are needed to extend this type of analysis to investigation of cross-platform data sets.
Table 1. User identifiers of seven sites in HESI consortium.

User ID                   Site                   RNA sample applied

2      Novartis AG                                      A, B
3      Roche Molecular Biochemicals                     A, B
4      Wyeth Research                                    B
5      AstraZeneca Pharmaceuticals, Inc.                 B
6      Schering-Plough Research Institute               A, B
7      Boehringer-Ingelheim Pharmaceuticals, Inc.        B
8      Pfizer Inc                                       A, B

Table 2. Correlation coefficients of Poly-A Controls of
the 10 chips selected.

            A01_2   B01_2   B01_3   B01_4   B01_5   A01_6   B01_6

A01_2       1.00    0.90    0.71    0.66    0.78    0.84    0.83
B01_2               1.00    0.78    0.71    0.80    0.94    0.95
B01-3                       1.00    0.59    0.69    0.82    0.77
B01_4                               1.00    0.62    0.73    0.74
B01_5                                       1.00    0.79    0.77
A01_6                                               1.00    0.96
B01_6                                                       1.00
B01_7
A01_8
B01_8

            B01_7   A01_8   B01_8

A01_2       0.80    0.89    0.90
B01_2       0.86    0.95    0.96
B01-3       0.73    0.78    0.78
B01_4       0.80    0.72    0.68
B01_5       0.72    0.80    0.78
A01_6       0.87    0.94    0.92
B01_6       0.89    0.94    0.94
B01_7       1.00    0.87    0.86
A01_8               1.00    0.95
B01_8                       1.00

Table 3. Number of significant genes selected by the
10 effects without probe involved.

                       Number of significant genes (a)

Effect    Bonferroni    FDR (0.005)     FDR (0.01)    FDR (0.05)]
                            (b)            (b)            (b)

R         466 (5.3)     1,620 (18.4)   1,851 (21.0)   2,605 (29.6)
D         787 (8.9)     1,900 (31.6)   2,159 (24.5)   3,067 (34.9)
T         173 (2.0)      735 (8.4)      890 (10.1)    1,539 (17.5)
S        8,422 (95.7)   8,746 (99.4)   8,762 (99.6)   8,792 (99.9)
RD        33 (0.38)      274 (3.1)      352 (4.0)      714 (8.1)
RT        361 (4.1)      972 (11.0)    1,144 (13.0)   1,738 (19.8)
RS       1,921 (21.8)   5,031 (57.2)   5,443 (61.9)   6,509 (73.3)
DT        507 (5.8)     1,504 (17.1)   1,735 (19.7)   2,690 (30.6)
DS        10 (0.11)      112 (1.3)      149 (l.7)      421 (4.8)
TS        17 (0.19)      190 (2.2)      308 (3.5)      970 (11.0)

Abbreviations: D, dose; DS, dose-site; DT, dose-time; R, RNA
sample; RD, RNA sample--dose; RS, RNA sample--site; RT, RNA
sample--time; S, site; T,time; TS, time-site.

(a) Results of false discovery rate (FDR), with 0.005, 0.01,
and 0.05 as the cutoffs. (b) The percentage of significant
genes of each effect is listed inside parentheses after the
counts of significant genes.


REFERENCES

Affymetrix, Inc. 2002. GeneChip Expression Analysis: Data Analysis Fundamentals. Santa Clara, CA:Affymetric, Inc. Available: http://www. affymetrix.com/support/downloads/manuals/ data_enalysis_fundamentals_manual.pdf [accessed 7 October 2003].

Bolstad BM, Irizarry RA, Astrand M, Speed TP. 2003. A comparison of normalization methods for high-density oligonucleotide array data based on bias and variance. Bioinformatics 19:185-193.

Choi JK, Yu U, Kim S, Yoo OJ. 2003. Combining multiple microarray studies and modeling interstudy variation. Bioinformatics 19 (suppl 1):184-190.

Chu T, Weir B, Wolfinger RD. 2002. A systematic statistical linear modeling approach to oligonucleotide array experiments. Math Biosci 176:35-51.

Chu T, Weir BS, Wolfinger RD. In press. Comparison of Li-Wong and loglinear mixed models for the statistical analysis of oligonucleotide arrays. Bioinformatics.

Dudoit S, Yang YH, Callow M J, Speed TP. 2002. Statistical methods for identifying genes with differential expression in replicated cDNA microarray experiments. Stat Sin 12:111-139.

Ghosh D, Barette TR, Rhodes D, Chinnaiyan AM. 2003. Statistical issues and methods for mete-analysis of microarray data: a case study in prostate cancer prostate cancer, cancer originating in the prostate gland. Prostate cancer is the leading malignancy in men in the United States and is second only to lung cancer as a cause of cancer death in men. . Funct Integr Genomics 3:180-188.

Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, et el. 2003. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics biostatistics /bio·sta·tis·tics/ (-stah-tis´tiks) biometry.

bi·o·sta·tis·tics
n.
The science of statistics applied to the analysis of biological or medical data.
 4:249-264.

Li C, Wong WH. 2001. Model-base analysis of oligonucleotide arrays: expression index computation and outlier outlier /out·li·er/ (out´li-er) an observation so distant from the central mass of the data that it noticeably influences results.

outlier

an extremely high or low value lying beyond the range of the bulk of the data.
 detection. Proc Natl Acad Sci USA 98:31-36.

Lockhart D, Dong H, Byrne M, Follettie M, Gallo M, Chee M, et al. 1996. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol 14:1675-1680.

SAS Institute Inc. 1999. SAS/STAT Software, Version 8. Cary, NC:SAS Institute, Inc.

Schadt E, Li C, Eliss B, Wong WH. 2001. Feature extraction In pattern recognition and in image processing, Feature extraction is a special form of dimensionality reduction.

When the input data to an algorithm is too large to be processed and it is suspected to be notoriously redundant (much data, but not much information) then the
 and normalization algorithms for high-density oligonucleotide gene expression array data. J Cell Biochem (suppl 37):120-125.

Tan PK, Downey TJ, Spitznagel EL, Xu P, Fu D, Dimitrov DS, et al. 2003. Evaluation of gene expression measurements from commercial microarray platforms. Nucleic Acids Nucleic acids
The cellular molecules DNA and RNA that act as coded instructions for the production of proteins and are copied for transmission of inherited traits.
 Res 31:5676-5684.

Tzu-Ming Chu, (1) Shibing Deng, (1) Russ Woffinger, (1) Richard S. Paules, (2) and Hisham K. Hamadeh (3)

(1) SAS Institute Inc., Cary, North Carolina Cary is the second largest municipality in Wake County, North Carolina and the third largest municipality in The Triangle (North Carolina) behind Raleigh and Durham. It is the seventh largest municipality in North Carolina. , USA; (2) National Institute of Environmental Health Sciences The National Institute of Environmental Health Sciences (NIEHS) is one of 27 Institutes and Centers of the National Institutes of Health (NIH),which is a component of the Department of Health and Human Services (DHHS). The Director of the NIEHS is Dr. David A. Schwartz. , National Institutes of Health, Department of Health and Human Services Noun 1. Department of Health and Human Services - the United States federal department that administers all federal programs dealing with health and welfare; created in 1979
Health and Human Services, HHS
, Research Triangle Park Research Triangle Park, research, business, medical, and educational complex situated in central North Carolina. It has an area of 6,900 acres (2,795 hectares) and is 8 × 2 mi (13 × 3 km) in size. Named for the triangle formed by Duke Univ. , North Carolina North Carolina, state in the SE United States. It is bordered by the Atlantic Ocean (E), South Carolina and Georgia (S), Tennessee (W), and Virginia (N). Facts and Figures


Area, 52,586 sq mi (136,198 sq km). Pop.
, USA; (3) Amgen Inc., Thousand Oaks, California Thousand Oaks, commonly referred to as "T.O." by residents, is a city in southeastern Ventura County, California, in the United States. It was named after the many oak trees that grace the area, and the city seal is adorned with an oak. , USA

This article is part of the mini-monograph "Application of Genomics to Mechanism-Based Risk Assessment."

Address correspondence to H.K. Hamadeh, Amgen Inc., One Amgen Center Dr., Mail Drop 5-1-A, Thousand Oaks Thousand Oaks, residential city (1990 pop. 104,352), Ventura co., S Calif., in a farm area; inc. 1964. Avocados, citrus, vegetables, strawberries, and nursery products are grown. , CA 91320 USA. Telephone: (805) 447-4818. Fax: (805) 499-2936. E-mail: hhamadeh@amgen.com

We thank our colleague H. Chen, at SAS Institute, Inc., for creating some of the figures in this article. We also thank the Hepatotoxicity Working Group of the ILSI Health and Environmental Sciences Institute's Committee on the Application of Genomics to Mechanism-Based Risk Assessment testing program, a scientific consortium organized to facilitate further development and advances in genomics and proteomic methodologies to increase the utility of gene expression data for mechanism-based risk assessment.

The authors declare they have no competing financial interests.

Received 6 October 2003; accepted 15 December 2003.
COPYRIGHT 2004 National Institute of Environmental Health Sciences
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2004, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

 Reader Opinion

Title:

Comment:



 

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:Genomics and Risk Assessment: Mini-Monograph
Author:Hamadeh, Hisham K.
Publication:Environmental Health Perspectives
Geographic Code:1USA
Date:Mar 15, 2004
Words:5441
Previous Article:Interlaboratory evaluation of rat hepatic gene expression changes induced by methapyrilene.(Genomics and Risk Assessment: Mini-Monograph)
Next Article:Quantitative PCR deconstruction of discrepancies between results reported by different hybridization platforms.(Genomics and Risk Assessment:...
Topics:



Related Articles
Taking stock of toxicogenomics: mini-monograph offers overview.(Science Selections)
Applying new biotechnologies to the study of occupational cancer--a workshop summary.(Workshop Summary)
Toxicogenomics in risk assessment: an overview of an HESI collaborative research program.(Genomics and Risk Assessment: Mini-Monograph)
The utility of DNA microarrays for characterizing genotoxicity.(Genomics and Risk Assessment: Mini-Monograph)
Overview of an interlaboratory collaboration on evaluating the effects of model hepatotoxicants on hepatic gene expression.(Genomics and Risk...
Overview of the application of transcription profiling using selected nephrotoxicants for toxicology assessment.(Toxicogenomics)
Identification of platform-independent gene expression markers of cisplatin nephrotoxicity.(Genomics and Risk Assessment: Mini-Monograph)
Phenotypic anchoring of gene expression changes during estrogen-induced uterine growth.(Toxicogenomics)
Toxicogenomics in risk assessment: communicating the challenges.(Guest Editorial)
Workgroup report: review of genomics data based on experience with mock submissions--view of the CDER pharmacology toxicology Nonclinical...

Terms of use | Copyright © 2009 Farlex, Inc. | Feedback | For webmasters | Submit articles