Printer Friendly

Putting proteins in one place. (Bioinformatics).

The growing wealth of information about the human proteome--the hundreds of thousands of proteins at work in the human body--is useful only if scientists can get their hands on it. To give researchers faster worldwide access to high-quality protein data, the National Human Genome Research Institute (NHGRI) and five other institutes and centers of the NIH have awarded $15 million to create a comprehensive, public data bank of protein sequences.

The United Protein Database, or UniProt, will combine three existing databases--Swiss-Prot, TrEMBL, and the Protein Information Resource (PIR). By the end of the three-year grant, UniProt should contain annotated entries on more than 2 million proteins, including information on protein sequences, functions, modifications, and other characteristics.

UniProt brings together scientists and resources that complement one another, says Peter Good, program director for genome informatics and computational biology at the NHGRI. The database will combine 830,000 entries from TrEMBL, 123,000 entries from Swiss-Prot, and 283,000 entries from PIR.

TrEMBL contains more entries because it is a computational database--computer programs use the protein sequences to make predictions of protein function. Swiss-Prot uses the more time-consuming hand-annotation method, which means that a scientist reads articles that mention a particular protein, extracts the relevant information, then adds it to the database. PIR, operated by Georgetown University Medical Center and the National Biomedical Research Foundation in Washington, D.C., contains both computer-annotated and hand-annotated entries. The PIR will cease to be updated, and its staff will assist with hand-annotating the TrEMBL records.

PIR will also contribute its "protein family" method of classification, which groups proteins by function based on sequence similarity. If two proteins fall into the same family, scientists can infer that the proteins may have similar functions. This method--created by one of the pioneers of protein sequence databases, Margaret Dayhoff--has been developed further by Cathy Wu, director of bioinformatics for PIR and one of the principal investigators of UniProt.

Other UniProt principal investigators are Rolf Apweiler, who is head of the Sequence Database Group at the European Bioinformatics Institute, and Amos Bairoch, who is group leader of the Swiss-Prot Group at the Swiss Institute of Bioinformatics.

Any biomedical scientist interested in protein function will benefit from the new database, Good says. Drug discovery in particular involves pinpointing proteins that are altered in diseased tissue, then exploring these proteins further to determine if they will make good targets for new drugs.

A typical proteomics experiment uses mass spectrometry to identify proteins and parts of their sequences. "The way to add meaning to those sequences is to search databases, which contain links to essentially all human knowledge surrounding those protein targets," says Tim Haystead, an associate professor of pharmacology and cancer biology at Duke University in Durham, North Carolina, and founder of the drug discovery company Serenex. Right now, scientists have to search several different databases to find all existing information on a sequence. "A unified database will make it easier for us," Haystead says.

William Pearson, a professor of biochemistry and molecular genetics at the University of Virginia and a member of PIR's oversight and scientific advisory board, agrees that scientists who work with protein sequence data have been frustrated both by the need to search multiple databases and by the sometimes contradictory information arising from their provenance in different methods of annotation. "When these [databases] all get put together, they're going to have much more consistent ways of referencing data and giving names to things," he says. "It will be much more efficient."

Funded in October 2002, UniProt is still a work in progress. "Part of the challenge is getting three groups that have different [organizational] cultures to interact," Good says. For continuity, the three groups will maintain their current search interfaces, each of which will eventually access the entire UniProt database. UniProt will also be accessible via a central website, A basic version of that site will be up this year, according to Apweiler.

UniProt will be freely available to all, but not until after the expiration of a license with industry covering access to Swiss-Prot records by commercial users. The license for Swiss-Prot records allowed the European Bioinformatics Institute and the Swiss Institute of Bioinformatics to continue developing Swiss-Prot in the absence of government funding. With support from the NIH, the entire UniProt database will be available free of charge for both academic and commercial users by January 2005.
COPYRIGHT 2003 National Institute of Environmental Health Sciences
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2003, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

Article Details
Printer friendly Cite/link Email Feedback
Author:Spivey, Angela
Publication:Environmental Health Perspectives
Date:May 15, 2003
Previous Article:TXG at SOT. (Meeting Report).
Next Article:Bioinformatics organization. (txgnet).

Related Articles
Data explosion: bringing order to chaos with bioinformatics. (Data Explosion).
The Human Proteome Organization (HUPO) and Environmental Health. (Commentary).
Crunching the bio-numbers.
Proteomics: characterizing the cogs in the machinery of life.
Toxicogenomics through the Eyes of Informatics: conference overview and recommendations.

Terms of use | Copyright © 2017 Farlex, Inc. | Feedback | For webmasters