Using DNA Microarrays to Study Host-Microbe Interactions.
The complex interaction between a microbial pathogen and a host is the underlying basis of infectious disease. By understanding the molecular details of this interaction, we can identify virulence-associated microbial genes and host-defense strategies and characterize the cues to which they respond and mechanisms by which they are regulated. This information will guide the design of a new generation of medical tools.
Genomic sequencing will provide the data needed to unravel the complexities of the host-pathogen interaction. As of August 10, 2000, draft sequence was available for 87% of the human genome (http://www.ncbi.nlm.nih.gov/genome/ seq/), and at least 39 prokaryotic genomes, including those of more than a dozen human pathogens, had been completely sequenced (http://www.tigr.org/tdb/mdb/mdbcomplete.html). The pace of gene discovery rapidly accelerates, but its potential for explaining life at the molecular level remains largely unrealized because our understanding of gene function lags increasingly far behind. For example, even in the heavily studied Escherichia coli, no function has been assigned to more than one third of its genes (1). High-throughput methods for assessment of function are clearly required if this wealth of primary sequence information is to be used.
Global profiling of gene expression is one attractive approach to assessing function. Because a gene is usually transcribed only when and where its function is required, determining the locations and conditions under which a gene is expressed allows inferences about its function. Several independent high-throughput methods for differential gene expression (including SAGE and differential display) may enable function annotation of sequenced genomes (2). DNA microarray hybridization analysis stands out for its simplicity, comprehensiveness, data consistency, and high throughput.
Transcription control plays a key role in host-pathogen interaction (3,4); thus, genomewide transcription profiling seems particularly appropriate for the study of this process. This review focuses on microarray-based approaches for studying transcription response because they hold exceptional promise for the study of infectious disease. Microarray-based genotyping applications, although expected to make substantial contributions in this field, are covered only briefly here.
High-Density DNA Microarrays: Basic Tools
First described in 1995 (5), high-density DNA microarray methods have already made a marked impact on many fields, including cellular physiology (6-11), cancer biology (12-17), and pharmacology (18,19). The first results of gene expression profiling of the host-pathogen interaction have just begun to emerge. Before exploring these results, we briefly review the methods.
The key unifying principle of all microarray experiments is that labeled nucleic acid molecules in solution hybridize, with high sensitivity and specificity, to complementary sequences immobilized on a solid substrate, thus facilitating parallel quantitative measurement of many different sequences in a complex mixture (20,21). Although several methods for building microarrays have been developed (22,23), two have prevailed. In one, DNA microarrays are constructed by physically attaching DNA fragments such as library clones or polymerase chain reaction (PCR) products to a solid substrate (5) (Figure 1). By using a robotic arrayer and capillary printing tips, we can print at least 23,000 elements on a microscope slide (P. Brown, pers. comm.; Figure 2). In the other method, arrays are constructed by synthesizing single-stranded oligonucleotides in situ by use of photolithographic techniques (24,25). Advantages of the former method include relatively low cost and substantial flexibility (which explain its wide implementation in the academic setting); in addition, primary sequence information is not needed to print a DNA element. Advantages of the latter method include higher density ([is greater than] 280,000 features on a 1.28X1.28-cm array) and elimination of the need to collect and store cloned DNA or PCR products. Continued commercial interest in microarray technology promises increasing array element density, better detection sensitivity, and cheaper, faster methods. Technical descriptions of microarray construction methods and hybridization protocols are available (26-28; and http:// cmgm.stanford.edu/pbrown/mguide/index.html).
[Figures 1-2 ILLUSTRATION OMITTED]
Messenger RNA from eukaryotic cells is usually specifically labeled by affinity purification of mRNA with an oligo-dT resin, followed by incorporation of dye-labeled nucleotides into cDNA molecules by reverse transcriptase (RT) with random or oligo-dT oligonucleotide primers (Figure 1). In prokaryotes, the absence of polyadenylation on transcripts makes labeling of mRNA more difficult. One method is labeling: of total RNA either by covalent linkage (29) or by incorporating dye-labeled nucleotides into complementary DNA through RT and random oligonucleotide primers (30). In spite of the high copy number of labeled ribosomal and tRNA molecules in the hybridization reactions specific hybridization of mRNA to the array can be achieved under appropriate stringency. An alternative method is to prime reverse transcription with a mixture of reverse-strand oligonucleotides specific for open reading frames (ORFs), either those used to construct the microarray (M. Laub and L. Shapiro, pers. comm.) or a minimally compilex mixture of octamers sufficient to hybridize to the 3' end of every ORF (31). This method results in higher signal-to-noise ratios by preferentially synthesizing cDNA from coding regions.
For printed DNA microarrays, relative transcript abundance is measured by labeling two samples with different fluorescent dyes (e.g., Cy3 and Cy5), hybridizing them simultaneously, and determining the fluorescence ratio for each spot on the array (Figure 1). On oligonucleotide arrays, multiple probes from the same gene, each with a corresponding mismatch probe that serves as internal control, as well as labeled transcript of known amounts for standard genes makes quantitative measurement of transcript abundance possible after hybridizing a single labeled sample (25). For both techniques, use of fluorescent labeling enhances sensitivity and the dynamic range of measurement.
Gene expression array experiments can also be performed by hybridizing a single labeled mRNA sample to "macroarrays" of DNA elements on positively charged filters (10,11,32-34). Because this format does not require any special arraying or scanning equipment, specialty arrays can be made and analyzed relatively cheaply. Human, mouse, and microbial macroarrays are also commercially available (SigmaGenosys, The Woodlands, TX; Research Genetics, Huntsville, AL; Clontech Laboratories, Palo Alto, CA; Genome Systems, St. Louis, MO). The major disadvantages of this format are reduced sensitivity (32), limited elements, and the need for higher concentrations of labeled cDNA.
Microarray Data Analysis
Microarrays are likely to become a standard tool of the microbiology laboratory. However, because genomewide datasets are large and comprehensive, analysis of an experiment can become daunting. Careful experimental design can simplify analysis and interpretation of the dataset by minimizing the number of variables that affect gene expression. For example, strain differences can be minimized by using isogenic mutants, tissue complexity can be reduced by studying clonal cell lines, and complex regulatory pathways can be tamed by experimental modulation of transgene expression (6).
Because microarray experiments result in such large amounts of data, false-positive results are likely. Analyzing multiple independent experiments may eliminate spurious results (32). Also important is validation of differentially expressed genes by independent methods. When checked by a number of methods including quantitative RT-PCR (6, 35), Northern blotting (33, 34, 36), and protein expression (33, 34), most differentially expressed genes have been confirmed. For example, 72 of 72 mRNAs found to be regulated in response to cytomegalovirus (CMV) infection were confirmed by either prior reports or Northern blotting (37). Future challenges for microarray researchers will include developing databases and algorithms to manage and analyze vast genomic-scale datasets.
Image Analysis Software
The first step after hybridization is capturing an image of the array and from it, extracting numerical data for each element (Figure 1). Several software applications, including those packaged with most commercial scanners, can perform this task. However, not all programs use the same algorithms to calculate signal intensity, and each of the programs exports a different constellation of signal quality measurements, complicating comparisons between data acquired with different applications (38). If gene expression datasets are to be compared, these measurements must be standardized. Furthermore, standard, robust statistical methods must be developed for assigning significance values to gene expression measurements.
Although many laboratories are now capable of collecting microarray data, few have access to a database that can effectively meet their data requirements. With considerable investment of resources, a few full-featured, relational gene expression databases have been developed, but these are not available for public deposition of data (e.g., http://genome-www4.stanford.edu/MicroArray/ MDEV/index.html; http://www. nhgri.nih.gov/DIR/LCG/15K/HTML/dbase.html). Recently released, the freely available AMAD software package (http://www.microarrays.org/ software.html) provides basic microarray data storage and retrieval capabilities to the average laboratory.
A grander goal for the community is establishing a consolidated resource for public distribution of microarray data (39-41). Again, the lack of a standard format for microarray data interferes with creating such a resource (38,39). The European Bioinformatics Institute, recognizing this obstacle, has proposed defining a standard based upon XML, a computer markup language that combines data and formatting in a single file for distribution over the World-Wide Web (40; http://www.ebi.ac.uk/arrayexpress/).
Inferring biologically meaningful information from microarray data requires sophisticated data exploration. Most global gene expression analyses have used some form of unsupervised clustering algorithm (16,42-44) to find genes coregulated across the dataset (Figure 1). A primary justification for this approach is that shared expression often implies shared function (38,43). In datasets containing many experiments, clustering can also group experiments on the basis of gene expression profiles, an approach that has been successful in classifying tumor-derived cell lines (19, 45) and tumor subtypes (12-17).
When a coregulated class of genes is known, supervised clustering algorithms, which are trained to recognize known members of the class, can assign uncharacterized genes to that class. For example, a machine-learning method known as a support vector machine has been used to classify yeast genes by function on the basis of shared regulation (46). Robust determination of coregulated gene clusters may be achieved by using a tiered approach: unsupervised clustering to identify coregulated genes followed by testing and refinement with supervised algorithms (47).
Although clustering algorithms will continue to be a mainstay in the analysis of gene expression datasets, a wealth of other data-mining techniques have yet to be applied (38,48). Preliminary reports indicate that many algorithms and visualization methods are being developed, but their ability to extract biologic insight has yet to be established (49-51).
The study of microbial pathogens, and prokaryotes in general, will require the development of some specialized analysis tools. First, the compact and modular structure of prokaryotic genomes--and in particular, the presence of operons and pathogenicity islands--suggests that important insights may be gained by mapping gene expression information onto genomic structure. In addition, because gene expression will be measured in many different pathogens, often under the same environmental conditions, tools for cross-species comparison of gene expression data will permit the detection of conserved transcription responses.
Examining a Microorganism: Application of DNA Microarrays
Microarray technology promises to speed the study of uncharacterized or poorly characterized microbes by contributing to annotation of the microbial genome, enabling exploration of microbial physiology, and identifying candidate virulence factors.
Designing a Microbial Genome Microarray
Designing a whole-genome DNA microarray for a fully sequenced microbe is conceptually straightforward. Several sensitive microbial gene-finding programs can quickly and accurately predict most ORFs (52-57). DNA fragments representing each of the ORFs can be obtained by PCR amplification that uses ORF-specific oligonucleotides, the design of which can be automated with primer design software such as Primer3 (58). Homology-searching algorithms should be used to choose regions of genes that will not cross-hybridize with other regions of the genome. After a simple purification step, PCR fragments can be arrayed by a robotic arrayer (5). This basic approach has been used to construct a 4,290-ORF E. coli microarray (10, 11) and a 3,834-ORF Mycobacterium tuberculosis microarray (30) as well as full-genome arrays for Helicobacter pylori (S. Fallow, pers. comm.) and Caulobacter crescentus (L. Shapiro, pers. comm.). Microarray fabrication based on photolithographic synthesis of oligonucleotides in situ is also a viable approach and has been successfully used for the production of an E. coli complete ORF chip (E. coli Genome Array, Affymetrix, Santa Clara, CA).
The utility of microarrays is not restricted to fully sequenced organisms. A powerful screening tool can be obtained by arraying DNA libraries, as has been done for the eukaryotic pathogen, Plasmodium falciparum (59). A DNA microarray of 3,648 random genomic chines was used to identify [is greater than] 50 genes for which expression differed significantly between the trophozoite and gametocyte stages. The major limitation of this approach is that the identity of any element of interest must be determined after the experiment.
Annotating the Function of a Microbial Genome
For many pathogens, the number of genes for which function information is available is usually low. Moreover, the relative insufficiency of genetic tools can make obtaining such information difficult. However, because [is greater than] 70% of bacterial proteins have orthologs in other organisms (60,61), one can leverage extensive knowledge of function from the model organisms to infer function for a pathogen's genome. Similarity searches alone will predict functions of many genes.
We expect the study of genomewide expression patterns to contribute even further to annotation of function. The rationale for this belief follows from the observation that shared expression often implies shared function (38). As suggested by Brown and Botstein (21), the inclusion of a gene with a characterized ortholog in a coregulated gene cluster can predict the function of the remaining genes in that cluster, thus bootstrapping the function annotation of the pathogen's genome. This assertion is borne out in a study of global gene expression in Saccharomyces cerevisiae. Clustering of 2,467 gene expression profiles across a series of 78 experiments representing eight cellular processes demonstrated coregulation of genes that participated in shared cellular function (43). Therefore, the acquisition of a pathogen's gene expression data from even a modest number of experimental conditions may lead to testable hypotheses about function for a substantial number of genes, even those lacking sequence similarity to genes whose function has been characterized.
Probing a Microbe's Physiologic State
The assumption that genes are preferentially expressed when their function is required allows inference of gene function directly from physiologic gene response. For example, genes preferentially transcribed during the diauxic shift in yeast are predicted to contribute in the metabolic transition to respiration (9). Thus, gene expression studies will contribute to function annotation by identifying the specific environmental and physiologic conditions in which each gene is expressed. Furthermore, as annotation improves, the direction of this inference may be reversed, i.e., if information on function is known for many genes, genomic expression profiling may reveal the physiologic state of the organism.
Two studies have used whole-genome DNA arrays to explore gene expression response to environmental stimuli in E. coli. First, treatment with isopropyl-[Beta]-D-thiogalactopyranoside (IPTG) was shown to induce only the lac operon, and to a lesser extent, the melibiose operon (11). In a second study, comparison of strains grown in minimal versus rich media revealed 344 genes that were differentially expressed between the two conditions: preferential expression of the translation apparatus in rich media and the amino acid biosynthetic pathways in minimal media were entirely consistent with prior data (10). Finally, examination of gene expression during heat shock revealed 119 genes with altered expression levels, all but 35 of which were previously recognized as heat shock genes (11). These studies confirm that the physiologic state of bacteria can be inferred from gene expression data.
In the first report of global gene expression monitoring in a bacterial pathogen, oligonucleotide microarrays were used to measure the relative transcript levels of 100 Streptococcus pneumoniae genes during the development of natural competence and during stationary phase (29). The results confirmed induction of the cin operon and identified 11 genes differentially regulated in stationary versus exponential phase. Of course, gene expression monitoring is not restricted to the study of bacterial pathogens. Transcription of the CMV genome was measured during infection by using an array of 75-mer oligonucleotides representing each of the 226 predicted CMV ORFs (62). By blocking translation or DNA replication, the researchers revealed a detailed classification of CMV genes into four kinetic classes, in agreement with previous reports, and assigned many ORFs, for which expression data were not previously available, into these groups.
Identifying Candidate Virulence Factors
Because expression of virulence-associated genes is tightly regulated (4), measuring a pathogen's gene expression in microenvironments specific to the pathogen and germane to the disease process is critical. Exploration of pathogen gene expression in the host environment may be technically challenging because of the relatively small number of pathogens present in an infected animal (29). Until more sensitive detection protocols are developed, examining global gene expression will be more practical in environmental conditions that mimic aspects of the host environment, such as elevated temperature, iron limitation, and changes in pH (4, 63) and in cell culture models. In fact, a microarray has been used to monitor gene expression in M. tuberculosis while it infects cultured monocytes (64). Even after measurement of bacterial gene expression from infected hosts becomes feasible, the ex vivo datasets will facilitate deconstruction of the in vivo gene expression response into component responses, leading to detailed understanding of the pathways of virulence factor regulation.
Identifying candidate virulence factors through a global gene expression method relies on two assumptions. First, because virulence-associated genes are often coordinately regulated (4), new virulence factors are likely to be coregulated with known ones. By clustering gene expression profiles across a large number of conditions, we can precisely monitor coregulation, thus revealing subtleties of regulation and leading to the identification of bona fide regulons. Second, because virulence-associated genes are tightly regulated (4), genes that are specifically expressed during infection or under conditions mimicking infection are candidate virulence factors. This assumption has been justified by numerous studies using in vivo expression technology (IVET) and differential fluorescence induction (DFI), in which genes induced during infection are often required for virulence (4, 65). When RNA from in vivo microbial samples can be efficiently isolated and labeled, microarrays will provide substantial advantages over IVET and DFI technologies for identifying putative virulence factors, including immediate identification of differentially expressed genes and detection of temporal profiles of transcription induction and repression. As is demanded for candidate genes identified by any expression screening approach, a role in pathogenesis must be confirmed by mutation and subsequent assays of virulence.
By identifying factors expressed in the host, microarray methods may also identify potential vaccine targets. Furthermore, one could identify candidate epitopes for vaccine development for intracellular pathogens by predicting whether genes that are preferentially expressed inside host leukocytes will encode promiscuous human leukocyte antigen class II ligands (66).
Gene expression studies may also reveal key regulatory differences that lead to differing virulence between closely related pathogen strains. For example, variations in virulence of Listeria monocytogenes serotypes have been correlated with differential transcription of PrfA-regulated virulence genes (67, 68). However, because microarrays cannot measure expression of genes that are absent from the reference strain, genotypic differences such as horizontal transfer of virulence factors will not be detectable by this method.
Yet another application for microarrays is the study of drug effects on microbial cellular physiology, as revealed by global gene expression patterns (69). This approach has been used to identify drug-specific gene expression signatures in yeast and human cells (18,19,70). Correlation of gene expression with drug activity may suggest molecular details of drug action, and correlation of transcription profiles in untreated cells with drug response may reveal mechanisms for sensitivity and resistance (19).
This approach has recently been used to characterize gene expression response in M. tuberculosis exposed to known inhibitors of the mycolic acid biosynthesis pathway, isoniazid and ethionamide (30). Both of these compounds elicited a similar gene expression response profile, characterized by pronounced transcription induction of five adjacent genes encoding fatty acid biosynthesis enzymes. Because a proven isoniazid target, KasA, was among these genes, the authors proposed that the adjacent, coregulated loci might be targets for new anti-tuberculosis drugs. Finally, these results suggested that the mode of action of a novel compound may be inferred from gene expression response to that compound.
Using microarrays to detect microbial polymorphisms linked to known drug-resistance phenotypes will also influence diagnosis and subsequent drug treatment. For example, an oligonucleotide array was used to detect mutant alleles of the M. tuberculosis rpoB gene, which are known to confer resistance to rifampicin (71).
One microarray application that interrogates DNA rather than RNA is the identification of genomic deletions in mutant strains and environmental isolates by measuring the number of DNA copies at each locus, a technique termed array-based comparative genome hybridization (72). This technique was used to identify several large deletions in a number of BCG vaccine strains and reconstruct their phylogeny (73).
Oligonucleotide arrays have also been used for fine-scale genotyping of polymorphisms in related pathogens. Accurate identification of Mycobacterium species using a GeneChip containing a set of 82 polymorphic oligonucleotides from the 16S ribosomal RNA gene demonstrated the potential power of this approach for molecular diagnostics (71). As additional microbial genome ORF microarrays become available, molecular surveys of the genomic structure of multiple strains will become far more precise and feasible. Two caveats should be mentioned: the ability to characterize genome insertions relative to the reference sequence is lacking, and the degree to which sequence variability can be characterized on the basis of microarray hybridization is unknown.
Examining a Host: Application of DNA Microarrays
Designing Microarrays for Host Organisms
The currently described human DNA microarrays are largely composed of expressed sequence tags (ESTs). Culling ESTs from many different tissue sources and limiting representation of any single Unigene cluster (see http:// www.ncbi.nlm.nih.gov/UniGene/Hs.stats.shtml) have resulted in better than 50% representation of the predicted 80,000-100,000 human coding regions (28). A variety of human DNA and oligonucleotide microarrays are available commercially (e.g., Incyte, Palo Alto, CA; Affymetrix; NEN Life Science Products, Boston, MA).
For in vivo studies of host response, infection of animal models will often be necessary. If the animal is a primate, human DNA microarrays might be used to monitor host gene expression because of the high level of primary sequence similarity between species. Sequence similarity is too low to permit reliable cross-hybridization with nonprimate vertebrates, but microarrays composed of mouse and rat sequences have been described (74) and are available (e.g., Incyte, Affymetrix).
Microarrays promise to accelerate our understanding of the host side of the host-pathogen interaction. A large fraction of the genome can be simultaneously interrogated, and clustering of the data may identify groups of genes that implicate activation or repression of key regulatory pathways. Microarrays also allow the temporal sequence of transcription induction and repression to be followed, a prerequisite for determining the order of events following an encounter. Finally, ascertainment of the host cell's physiologic state, particularly apoptosis and necrosis, by genomewide profiling will facilitate separation of primary and secondary effects.
One important caveat of studying transcription in any system is that post-transcription regulatory events cannot be detected. This is particularly important in the case of host response because many important host cell events, such as cytoskeletal rearrangements, occur after transcription (75). Therefore, some key aspects of the molecular program may not be easily characterized by gene expression profiling. Eventually, it may be possible to monitor simultaneously the levels, activities, and interactions of all proteins in the cell (76).
Although analyzing gene expression of infected tissues is feasible, cellular heterogeneity may make analysis of host response complicated. Examining the response in infected cultured cells by using cell types most likely to encounter the pathogen may reduce the complexity of the system being examined. Results obtained in cell culture systems will be instrumental in interpreting gene expression profiles of specific cell types from whole tissue datasets.
The first application of global gene expression methods to pathogenesis used oligonucleotide arrays to monitor gene expression in primary human fibroblasts infected by human CMV (37). The transcript abundance of 258 out of 6,600 human genes changed by more than fourfold compared to uninfected cells at either 8 or 24 hours after infection. Some of these changes, such as induction of cytokines, stress-inducible proteins, and many interferon-inducible genes, were consistent with induction of cellular immune responses.
A similar experimental design has been used to examine the global effects of HIV-1 infection on cultured CD4-positive T cells. One study concluded that HIV-1 infection resulted in differential expression of 20 of the 1,506 human genes monitored and that most of these changes occurred only after 3 days in culture (36). In contrast, the preliminary results of an independent study using a similar design indicated that substantial HIV-induced transcription changes began very early after inoculation (77). The latter study confirmed activation of nuclear factor-[Kappa]B (NF-[Kappa]B), p68 kinase, and RNase L.
DNA expression arrays have recently been used to examine the response of host cells to infection by bacterial pathogens. Transcription profiling of macrophages and epithelial cells infected by Salmonella confirmed increased expression of many proinflammatory cytokines and chemokines, signaling molecules, and transcription activators and identified several genes previously unrecognized to be regulated by infection (33,34). The macrophage study demonstrated that exposure to purified Salmonella lipopolysaccharide resulted in a very similar response profile to whole cells and that activation of macrophages with gamma interferon before infection modified the response (34). In epithelial cells, overexpression of [Kappa]B (an inhibitor of NF-[Kappa]B) blocked induction of gene expression for a number of regulated genes, underscoring the importance of NF-[Kappa]B in the proinflammatory response (33).
Similarly, the transcription response of human promyelocytic cells to L. monocytogenes infection has been determined by both oligonucleotide arrays and filter-based arrays (32). Comparison of these data with the Salmonella infection data suggests that the proinflammatory response is grossly conserved: in both cases
Address for correspondence: Craig Cummings, VAPAHCS 154T, Building 100, Room D4-123, 3801 Miranda Ave., Palo Alto, CA 94304, USA; fax: 650-852-3291; e-mail: firstname.lastname@example.org.
Craig A. Cummings(*) and David A. Relman(*)([dagger])
Stanford University, Stanford, California, USA; VA Palo Alto Health Care System, Palo Alto, California, USA
|Printer friendly Cite/link Email Feedback|
|Author:||Relman, David A.|
|Publication:||Emerging Infectious Diseases|
|Date:||Sep 1, 2000|
|Previous Article:||Comparative Genomics and Understanding of Microbial Biology.|
|Next Article:||Pertussis Infection in Fully Vaccinated Children in Day-Care Centers, Israel.|