Printer Friendly
The Free Library
14,530,286 articles and books
Member login
User name  
Password 
 
Join us Forgot password?

Data explosion: bringing order to chaos with bioinformatics. (Data Explosion).


Scientists say a clearer understanding of gene-toxicant interactions will provide significant new opportunities for protecting public health. But there's a catch: these toxicogenomics promises lie hidden in mountains of data.

Thanks to technology advances, the nucleotide sequences that make up DNA DNA: see nucleic acid.
DNA
 or deoxyribonucleic acid

One of two types of nucleic acid (the other is RNA); a complex organic compound found in all living cells and many viruses. It is the chemical substance of genes.
, in addition to the amino acid amino acid (əmē`nō), any one of a class of simple organic compounds containing carbon, hydrogen, oxygen, nitrogen, and in certain cases sulfur. These compounds are the building blocks of proteins.  sequences that make up proteins, are collected with robotic automation and stored by the millions in vast, expanding databases throughout the world. Microarrays, which provide snapshots of thousands of expressed genes simultaneously, are also data-intensive. Years ago, when sequencing was slow and tedious, scientists could study the output manually--no more. By necessity, they now need computers and sophisticated algorithms to wade through it all.

In recent years, the Years, The

the seven decades of Eleanor Pargiter’s life. [Br. Lit.: Benét, 1109]

See : Time
 field of bioinformatics has emerged to meet these challenges. By definition, bioinformatics is the process by which informatics--the science of turning data into information--is applied to biology. A combination of computer science, information technology, and molecular biology molecular biology, scientific study of the molecular basis of life processes, including cellular respiration, excretion, and reproduction. The term molecular biology was coined in 1938 by Warren Weaver, then director of the natural sciences program at the Rockefeller , bioinformatics allows researchers to quickly access and interpret a rising tide Noun 1. rising tide - the occurrence of incoming water (between a low tide and the following high tide); "a tide in the affairs of men which, taken at the flood, leads on to fortune" -Shakespeare
flood tide, flood
 of genomic information. This is critical for the genomic era: scientists are sequencing the genomes of many species, but they know little about how great regions of these genomes and the proteins they give rise to actually function.

In a basic application, bioinformatics allows researchers to search online databases such as GenBank for a given gene's composition, proteins, mutations, coverage in the scientific literature, and many other relevant parameters that are collectively termed "annotation." With more advanced applications, scientists use bioinformatics techniques to model chemical networks in living cells, including those stressed by disease or toxicity.

No researcher can possibly be familiar with all the known interactions in a cell, says Trey Ideker, a computational biologist with the Whitehead Institute Founded in 1982, the Whitehead Institute for Biomedical Research is a non-profit research and teaching institution located in Cambridge, Massachusetts. The Whitehead Institute was founded as a fiscally independent entity from Massachusetts Institute of Technology, and its members  for Biomedical Research Biomedical research (or experimental medicine), in general simply known as medical research, is the basic research or applied research conducted to aid the body of knowledge in the field of medicine.  in Cambridge, Massachusetts This article is about the city of Cambridge in Massachusetts. For the English university town, see Cambridge, England. For other places, see Cambridge (disambiguation).
Cambridge, Massachusetts is a city in the Greater Boston area of Massachusetts, United States.
. Bioinformatics allows scientists to access, display, and interpret systems-level information. Fueled by bioinformatics, toxicogenomics is becoming an in silico science, with computerized data mining a key source of new discoveries.

Core Repositories

The rise of modern bioinformatics is rooted in the history of protein and nucleotide sequencing. The timeline arguably dates back to 1955, the year a Nobel Prize-winning British biochemist named Frederick Sanger Noun 1. Frederick Sanger - English biochemist who determined the sequence of amino acids in insulin and who invented a technique to determine the genetic sequence of an organism (born in 1918)
Fred Sanger, Sanger
 first sequenced the protein bovine insulin. The first completed genome, sequenced in 1980, was that of a virus called phiX174. In subsequent years, scientists have gone on to sequence the genomes of higher organisms, including the human genome The human genome is the genome of Homo sapiens, which is composed of 24 distinct pairs of chromosomes (22 autosomal + X + Y) with a total of approximately 3 billion DNA base pairs containing an estimated 20,000–25,000 genes. , which was completed in April 2003.

At first, sequencing was a slow and tedious process. The traditional technique--which involved gel electrophoresis gel electrophoresis
n.
Electrophoresis performed in a gel composed of agarose, polyacrylamide, or starch.
 and autoradiography--allowed scientists to manually sequence a single DNA fragment of 300-500 base pairs in about a day. This technique has been replaced almost entirely by automated high-throughput technologies to process DNA samples to determine the arrangement of nucleotides. The Applied Biosystems Applied Biosystems, Inc. (formerly NASDAQ: ABIO) is the original name of a pioneer biotechnology company founded in 1981 in Foster City, California, among the Silicon Valley cities of the southern San Francisco Bay Area.  sequencers used in the decoding of the human genome, for example, are roughly 6,000 times faster than earlier approaches.

Today, sequencing is an international phenomenon. Entire consortia are devoted to sequencing the genomes of many species, including the human, the rat, the mouse, and many types of fish, birds, and microbes. Most of these sequences eventually wind up in a few publicly available databases. For nucleotides, the chief database in the United States United States, officially United States of America, republic (2005 est. pop. 295,734,000), 3,539,227 sq mi (9,166,598 sq km), North America. The United States is the world's third largest country in population and the fourth largest country in area.  is GenBank, maintained by the National Center for Biotechnology Information The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health. The NCBI is located in Bethesda, Maryland and was founded in 1988.  (NCBI NCBI National Center for Biotechnology Information (NIH)
NCBI National Coalition Building Institute
NCBI National Council for the Blind of Ireland (Dublin, Ireland) 
), a division of the National Library of Medicine of the NIH "Not invented here." See digispeak.

NIH - The United States National Institutes of Health.
. GenBank was actually started by the late physicist Walter Goad of the Los Alamos National Laboratory Los Alamos National Laboratory (LANL) (previously known at various times as Site Y, Los Alamos Laboratory, and Los Alamos Scientific Laboratory) is a United States Department of Energy (DOE) national laboratory, managed and operated by Los Alamos National , who began compiling sequences there in 1979 while initiating efforts to create a national DNA/RNA database. The NIH created GenBank from Goad's original compilation, and the database was transferred to the NCBI from Los Alamos Los Alamos (lôs ăl`əmōs', lŏs), uninc. town (1990 pop. 11,455), seat of Los Alamos co., N central N.Mex. It is on a long mesa extending from the Jemez Mts. The U.S.  in 1992.

Today, all of GenBank's content is tightly integrated with two other databases, one (the EMBL EMBL European Molecular Biology Laboratory
EMBL Eniwetok Marine Biological Laboratory
 Nucleotide Sequence Database) maintained by the European Molecular Biology Laboratory The European Molecular Biology Laboratory (EMBL) is a molecular biology research institution supported by 19 countries comprising nearly all of western Europe and Israel.  in Heidelberg, Germany, and the other (the DNA Data Bank of Japan) by the Center for Information Biology of the Japanese National Institute of Genetics in Mishima. GenBank's place in the U.S. research community is pivotal; most journals won't publish new sequences that GenBank has yet to accept. "GenBank is designed as a repository for all publicly available nucleotide data," explains NCBI staff scientist David Wheeler. "Anyone can come here [via the Internet] and pick what they need in terms of primary sequences."

Another publicly available source of nucleotide data is at The Institute for Genomic Research (TIGR TIGR The Institute for Genomic Research
TIGR Treasury Investment Growth Receipt
TIGR This Is Getting Ridiculous
TIGR Thermally Induced Gallium Removal
TIGR TSPI Interface for GPS/RAJPO
), a nongovernmental research group based in Rockville, Maryland. Unlike GenBank, the TIGR database is populated with data produced by TIGR researchers in addition to data collected from bacterial sequencing projects going on around the world. Initially a pioneer in the field of bacterial genomics (scientists there sequenced the first bacterial genomes in 1995), TIGR has more recently broadened its scope to include nonbacterial species, including the parasites that cause malaria and sleeping sickness sleeping sickness: see encephalitis; trypanosomiasis.
sleeping sickness

Protozoal disease transmitted by the bite of the tsetse fly. Two forms, caused by different species of the genus Trypanosoma, occur in separate regions in Africa.
. It was also a major contributor to the sequencing of the human genome. The TIGR database is complementary to GenBank, in that it tracks all ongoing bacterial genome sequencing projects, in addition to those that have already been completed.

For protein sequences, the critical database is Swiss-Prot, which is a collaboration of the Geneva-based Swiss Institute of Bioinformatics and the European Bioinformatics Institute The European Bioinformatics Institute (EBI) is a centre for research and services in bioinformatics, and is part of European Molecular Biology Laboratory (EMBL). It is a pioneer of novel and developmental bioinformatics research.  of the European Molecular Biology Laboratory. (Within the next three years, the United Protein Database, or UniProt, will combine Swiss-Prot and two other databases; see "Putting Proteins in One Place," p. A336 this issue.) With respect to microarrays, the database options are quite diverse. Among the public databases are the NCBI's Gene Expression Omnibus and ArrayExpress, which is maintained by the European Bioinformatics Institute. Various research organizations maintain a host of smaller "core" databases, including the holdings of the Microarray Center at the NIEHS NIEHS National Institute of Environmental Health Sciences (NIH, DHHS)  National Center for Toxicogenomics (NCT NCT National Childbirth Trust
NCT National Car Test
NCT North Carolina Theatre
NCT National Coordination Team
NCT Northern California TRACON
NCT Noise Cancellation Technology
NCT Network Control and Timing
NCT Nicotine Replacement Therapy
) and the Stanford Microarray Database at Stanford University. And finally, the key public database for single-nucleotide polymorphisms, or SNPs, which are simple gene mutations, is dbSNP, maintained by the NCBI.

Side-by-Side Sequences

Bioinformatics has traditionally been focused on "sequence comparisons" performed with an evolving set of computational algorithms. With this process, scientists compare known and unknown sequences in an attempt to infer the properties of the latter. The underlying assumption is that similar sequences are homologous homologous /ho·mol·o·gous/ (ho-mol´ah-gus)
1. corresponding in structure, position, origin, etc.

2. allogeneic.


ho·mol·o·gous
adj.
1.
, meaning they are ancestrally related with similar properties across a variety of species. Screening a newly sequenced protein for homologues in Swiss-Prot, for example, provides predictive information about the protein's function, three-dimensional structure, and organization.

Because these predictions are based on sequence homology homology (hōmŏl`əjē), in biology, the correspondence between structures of different species that is attributable to their evolutionary descent from a common ancestor. , they must be confirmed experimentally. Ideker says the ability to find new opportunities for experimentation is fueling a paradigm shift A dramatic change in methodology or practice. It often refers to a major change in thinking and planning, which ultimately changes the way projects are implemented. For example, accessing applications and data from the Web instead of from local servers is a paradigm shift. See paradigm.  in biology. Because of bioinformatics, he says, biology is becoming a predictive rather than merely descriptive science.

Like sequencing itself, sequence comparisons have evolved from their tedious origins. The first algorithms, such as the Needleman-Wunsch algorithm introduced in 1971, were designed to allow "global alignment." These algorithms align every amino acid or nucleotide in a sequence of interest to a known counterpart in a search for homologous regions.

Current sequencing approaches favor "local alignment" strategies that look for short regions of nearly perfect matches. The most widely used of these is the Basic Local Alignment Search Tool (BLAST[R]) software, available from the NCBI. By running BLAST, researchers quickly scan novel sequences against up-to-date content from GenBank and a host of other relevant databases. "BLAST was just amazing to us when it was released in the early nineteen-nineties," recalls Fran Lewitter, director of biocomputing Biocomputing can mean at least two different things:
  • First, it can be defined as the construction and use of computers which function like living organisms or contain biological components, so-called biocomputers. In this meaning it is closely related to DNA computing.
 at the Whitehead Institute. "[Before BLAST,] it could take hours to compare sequences. But with BLAST you could enter a sequence into a computer, hit 'return,' and you'd get your answer immediately."

Global and local alignments are often performed sequentially. Researchers will run a sequence through BLAST to identify short regions of high similarity and then run global alignments to identify a wider range of sequences around those alignments. Thus, it is possible to observe evolutionary changes around the more highly conserved surrounding regions.

BLAST is typically the first step for someone consulting GenBank to evaluate a novel sequence. Upon entry of the sequence, BLAST returns lists of accession numbers for other, similar sequences. Researchers click on these accession numbers and through the GenBank interface--known as Entrez--connect to databases of annotated information for the sequence matches. "Entrez is our general search system," Wheeler explains. "It covers data contained in a variety of databases including GenBank, Swiss-Prot, PubMed, and many others."

Another useful search tool for obtaining sequence annotation is Ensembl, offered by the European Bioinformatics Institute and The Wellcome Trust Sanger Institute, a biomedical research organization near Cambridge, United Kingdom. Like Entrez, Ensembl allows users to run BLAST searches and link results to annotated databases by accession number. And the University of California, Santa Cruz The University of California, Santa Cruz, also known as UC Santa Cruz or UCSC, is a public, collegiate university, one of the ten campuses of the University of California. , offers a genome browser that is particularly well suited for novel RNA RNA: see nucleic acid.
RNA
 in full ribonucleic acid

One of the two main types of nucleic acid (the other being DNA), which functions in cellular protein synthesis in all living cells and replaces DNA as the carrier of genetic
 sequences. This particular browser runs sequence comparisons with a program called BLAST-Like Alignment Tool (BLAT). According to Jim According to Jim is an American situation comedy television series originally broadcast by ABC. The show premiered with little publicity in October 2001, following the surprise hit comedy My Wife and Kids.  Kent, a research scientist with the university's Genome Bioinformatics Team, BLAT maps RNA sequences to the genome at a speed roughly 50 times faster than BLAST.

George Bell, a bioinformatics scientist at the Whitehead Institute, says users are best served by employing a variety of search tools. "It's like searching for movie reviews," he explains. "You don't want to go to just one site; you want as much information as you can get." There are several good reasons to consult multiple sources for sequence matching and information, Bell says. No one site is definitive--the number of published sequences changes every day, as does the amount and quality of associated annotations. Furthermore, automated algorithms are all prone to error. Comparing the output of several sites provides a maximal amount of information. The question of which output to use, Bell emphasizes, is best answered using the researcher's own scientific judgment.

Mining Microarrays

The bioinformatic techniques used to evaluate microarray data differ entirely from those used to compare nucleotides and proteins. In a toxicogenomics experiment with microarrays, fluorescent dyes are used to differentially label RNA from unexposed versus exposed animals. Results are measured in terms of relative fluorescence intensity, a continuous variable that Mike Waters, the NCT's assistant director for database development, says is best compared using classical statistics for measurable outcomes, such as analysis of variance. These analyses can be run using standard desktop software, says Bruce Weir, director of the Bioinformatics Research Center at North Carolina State University History

Main article: History of North Carolina State University
The North Carolina General Assembly founded NC State on March 7, 1887 as a land-grant college under the name North Carolina College of Agriculture and Mechanic Arts.
 in Raleigh. Such programs allow scientists to approximate which genes have been activated or inactivated inactivated

rendered inactive; the activity is destroyed.


inactivated viruses
treated so that they are no longer able to produce evidence of growth or damaging effect on tissue.
 by chemical exposure.

Multivariate statistics are then applied to microarray data to identify groups of genes that respond concurrently to chemical exposures. There are many techniques for grouping genes in this way, including gene clustering, a statistical method developed by Michael B. Eisen, a scientist with the Life Sciences Division at the Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory and Lawrence Livermore National Laboratory, scientific research centers run by the Univ. of California, located in Berkeley, Calif., and Livermore, Calif., respectively.  in Berkeley, California.

Identifying chemically induced chemically induced,
adj initiating biologic action or response by the introduction of a chemical.
 gene clusters is of high value to toxicogenomics. Modern microarrays show the expression of hundreds to thousands of genes simultaneously. Clustering of highly expressed genes provides structure to these voluminous data. "It allows you to find genes that are regulated in the same way," Kent explains. "You may find these clusters are tissue-specific. Clustering basically allows you to create groups of gene families as we do with sequence homology. Therefore, we can infer something about the gene's function according to the family to which it belongs."

An effort to apply microarrays to toxicogenomics is currently under way at the Microarray Center at the NIEHS. Pierre Bushel bushel: see English units of measurement. , bioinformatics manager at the center, says data generated there are shared with public repositories such as Gene Expression Omnibus and ArrayExpress, in addition to a new "knowledge base" at the NCT called Chemical Effects in Biological Systems. With this knowledge base, the NIEHS aims to provide the ultimate international resource for all toxicogenomics data. Bushel says most of the microarray chips currently used by the NCT are prepared in-house.

A key objective, says Waters, is to ensure that annotation for all of the center's microarrays is current. This is a tall order, he admits. Annotated information in the public domain is continually updated. Ultimately, Waters says, the NCT wants to automate its annotation, using distributed annotation servers that track GenBank, Swiss-Prot, and other major databases, pulling in new information as it becomes available.

Presently, the NCT is working with Agilent Technologies on a mouse microarray for toxicogenomics studies. According to James Selkirk, deputy director of the NCT, this "ToxChip" is being designed in cooperation with the NIEHS-funded Toxicogenomics Research Consortium, a group of five academic research centers plus the Microarray Center. The intention, he says, is to produce a chip containing a large number of genes thought to be relevant to the toxicity of environmental agents. "This should be something that is of wide interest to the microarray profiling public," Selkirk says.

Computing Biology

At a certain point, the knowledge gained from studying sequences and microarrays sets the stage for investigations of cellular networks and pathways. Toxicity is manifested by a stunningly complex array of cellular events. The nature of these complex systems is studied with an extension of bioinformatics called computational biology. Whether the two fields are actually distinct is a matter of debate. One view suggests that bioinformatics deals with the acquisition, storage, and presentation of data, whereas computational biology applies the data to biological models. But in general, both fields cover the spectrum of computer-related activities in biological research.

In some ways, computational biology is more applicable to proteomics--the study of protein function in biological systems--where experts say the biomedical bi·o·med·i·cal
adj.
1. Of or relating to biomedicine.

2. Of, relating to, or involving biological, medical, and physical sciences.
 benefits of genomic knowledge will ultimately be found. "The actual network of molecular interactions is elucidated with proteomics," says Ideker. "Researchers in this field are asking two key questions: what are the protein-protein interactions, and what are the protein-DNA interactions? These are the fundamental iterations that we're concerned with."

According to Ideker, a number of experimental methods predominate in this type of research. These methods are currently focused mainly on studies in yeast. For protein-protein interactions, Ideker says, a key method is the 2-hybrid system (also known as the yeast 2-hybrid system). This experimental system allows researchers to screen for interactions in, large numbers of yeast proteins simultaneously.

A high-throughput method for assessing protein-DNA interactions has been developed by Richard Young, a biology professor at the Massachusetts Institute of Technology Massachusetts Institute of Technology, at Cambridge; coeducational; chartered 1861, opened 1865 in Boston, moved 1916. It has long been recognized as an outstanding technological institute and its Sloan School of Management has notable programs in business,  and a member at the Whitehead Institute. Young's method is based on a technique known as immunoprecipitation. In brief, the technique involves tagging proteins, cross-linking them with DNA in a cell, and then purifying the protein-DNA linkages. By uncrossing the linkages, scientists are able to evaluate the nature of the protein-DNA interactions.

These interactions can then be made publicly available via a number of online repositories. According to Ideker, one of the best repositories for protein-protein interactions is the Biomolecular Interaction Network Database, coordinated in part by Genome Canada, a genomics research organization based in Ottawa. This database is specifically designed for studies in computational biology. An important repository for protein-DNA interactions is the Transcription Factor Database, coordinated by Research Group Bioinformatics of Germany.

According to Ideker, computational biologists mine these repositories to model cell networks. It's now possible to construct models of "virtual cells" that are broad although not detailed, he says. "It's also possible to really nail a particular pathway," he adds. Ideker is currently collaborating with Leona Samson, a professor of toxicology at the Massachusetts Institute of Technology, on computational studies investigating pathways of DNA repair following exposure to chemical mutagens.

Eventually, scientists hope to pull all the available genomic data into complete models that also address the influence of genetic mutations such as SNPs. These models will allow researchers to assess how genomic variations contribute to disease or the response to toxicants. But many difficult challenges remain. For instance, database information must be maintained in compatible formats for global searches. Databases must also be updated with respect to the ever-increasing body of biological knowledge. And of course, scientists still need to extrapolate extrapolate - extrapolation  the results of experiments in lower organisms such as yeast to mammalian systems, humans in particular. "We're dealing with a level of exceeding complexity," Waters says. "These are not advances that are going to come overnight."
COPYRIGHT 2003 National Institute of Environmental Health Sciences
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2003, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

 Reader Opinion

Title:

Comment:



 

Article Details
Printer friendly Cite/link Email Feedback
Author:Schmidt, Charles W.
Publication:Environmental Health Perspectives
Date:May 15, 2003
Words:2696
Previous Article:Phenotypic anchoring: linking cause and effect. (NCT Update).
Next Article:Liver library: creating a microarray for hepatotoxicants. (Science Selections).



Related Articles
Policing the peace: verifying a comprehensive test ban. (part 2) (includes related article Reagan and the Comprehensive Test Ban Treaty)
Sun Toes The StorEdge Line.(Sun Microsystems StorEdge L700)(Product Announcement)
EXPERIMENT SHOWS QUAKE TENDENCIES.(News)
BLAST STUNS HAPPY CROWD CONFIDENT ABOUT SECURITY.(NEWS)
LION GETS NEW AND UPGRADED LICENSES FOR SRS.
Toxicogenomics: roadblocks and new directions. (Standards).
Crunching the bio-numbers.(Bioinformatics)
Y.F. Leung's Functional Genomics.(txg net)
LEBANON - Mar 23 - Bomb Kills Two.
Telebomb.(Book Review)(Brief Review)

Terms of use | Copyright © 2009 Farlex, Inc. | Feedback | For webmasters | Submit articles