Putting proteins in one place. (Bioinformatics).The growing wealth of information about the human proteome--the hundreds of thousands of proteins at work in the human body--is useful only if scientists can get their hands on it. To give researchers faster worldwide access to high-quality protein data, the National Human Genome The human genome is the genome of Homo sapiens, which is composed of 24 distinct pairs of chromosomes (22 autosomal + X + Y) with a total of approximately 3 billion DNA base pairs containing an estimated 20,000–25,000 genes. Research Institute (NHGRI NHGRI National Human Genome Research Institute ) and five other institutes and centers of the NIH "Not invented here." See digispeak.
NIH - The United States National Institutes of Health. have awarded $15 million to create a comprehensive, public data bank of protein sequences.
The United Protein Database, or UniProt, will combine three existing databases--Swiss-Prot, TrEMBL, and the Protein Information Resource (PIR "Parent in room." See digispeak. ). By the end of the three-year grant, UniProt should contain annotated entries on more than 2 million proteins, including information on protein sequences, functions, modifications, and other characteristics.
UniProt brings together scientists and resources that complement one another, says Peter Good, program director for genome informatics and computational biology Not to be confused with Biologically-inspired computing.
Computational biology is an interdisciplinary field that applies the techniques of computer science, applied mathematics, and statistics to address problems inspired by biology. at the NHGRI. The database will combine 830,000 entries from TrEMBL, 123,000 entries from Swiss-Prot, and 283,000 entries from PIR.
TrEMBL contains more entries because it is a computational database--computer programs use the protein sequences to make predictions of protein function. Swiss-Prot uses the more time-consuming hand-annotation method, which means that a scientist reads articles that mention a particular protein, extracts the relevant information, then adds it to the database. PIR, operated by Georgetown University Medical Center Georgetown University Medical Center (GUMC) is the medical campus at Georgetown University. It is co-located with Georgetown University Hospital on the University's main campus in Washington, DC. and the National Biomedical Research Biomedical research (or experimental medicine), in general simply known as medical research, is the basic research or applied research conducted to aid the body of knowledge in the field of medicine. Foundation in Washington, D.C., contains both computer-annotated and hand-annotated entries. The PIR will cease to be updated, and its staff will assist with hand-annotating the TrEMBL records.
PIR will also contribute its "protein family" method of classification, which groups proteins by function based on sequence similarity. If two proteins fall into the same family, scientists can infer that the proteins may have similar functions. This method--created by one of the pioneers of protein sequence databases, Margaret Dayhoff--has been developed further by Cathy Wu, director of bioinformatics for PIR and one of the principal investigators of UniProt.
Other UniProt principal investigators are Rolf Apweiler, who is head of the Sequence Database Group at the European Bioinformatics Institute The European Bioinformatics Institute (EBI) is a centre for research and services in bioinformatics, and is part of European Molecular Biology Laboratory (EMBL). It is a pioneer of novel and developmental bioinformatics research. , and Amos Bairoch Amos Bairoch is a Swiss bioinformatician, born 22 November 1957.
Bairoch is currently professor of Bioinformatics at the Department of Structural Biology and Bioinformatics of the University of Geneva and group leader at the Swiss Institute of Bioinformatics. , who is group leader of the Swiss-Prot Group at the Swiss Institute of Bioinformatics.
Any biomedical bi·o·med·i·cal
1. Of or relating to biomedicine.
2. Of, relating to, or involving biological, medical, and physical sciences. scientist interested in protein function will benefit from the new database, Good says. Drug discovery in particular involves pinpointing proteins that are altered in diseased tissue, then exploring these proteins further to determine if they will make good targets for new drugs.
A typical proteomics experiment uses mass spectrometry mass spectrometry
or mass spectroscopy
Analytic technique by which chemical substances are identified by sorting gaseous ions by mass using electric and magnetic fields. to identify proteins and parts of their sequences. "The way to add meaning to those sequences is to search databases, which contain links to essentially all human knowledge surrounding those protein targets," says Tim Haystead, an associate professor of pharmacology and cancer biology at Duke University in Durham, North Carolina Durham is a city in the U.S. state of North Carolina. It is the county seat of Durham CountyGR6 and is the fourth-largest city in the state by population. , and founder of the drug discovery company Serenex. Right now, scientists have to search several different databases to find all existing information on a sequence. "A unified database will make it easier for us," Haystead says.
William Pearson, a professor of biochemistry and molecular genetics molecular genetics
The branch of genetics that deals with hereditary transmission and variation on the molecular level. at the University of Virginia and a member of PIR's oversight and scientific advisory board, agrees that scientists who work with protein sequence data have been frustrated both by the need to search multiple databases and by the sometimes contradictory information arising from their provenance in different methods of annotation. "When these [databases] all get put together, they're going to have much more consistent ways of referencing data and giving names to things," he says. "It will be much more efficient."
Funded in October 2002, UniProt is still a work in progress. "Part of the challenge is getting three groups that have different [organizational] cultures to interact," Good says. For continuity, the three groups will maintain their current search interfaces, each of which will eventually access the entire UniProt database. UniProt will also be accessible via a central website, http://www.uniprot.org/. A basic version of that site will be up this year, according to Apweiler.
UniProt will be freely available to all, but not until after the expiration of a license with industry covering access to Swiss-Prot records by commercial users. The license for Swiss-Prot records allowed the European Bioinformatics Institute and the Swiss Institute of Bioinformatics to continue developing Swiss-Prot in the absence of government funding. With support from the NIH, the entire UniProt database will be available free of charge for both academic and commercial users by January 2005.