Printer Friendly
The Free Library
19,607,059 articles and books
Member login
User name  
Password 
 
Join us Forgot password?

Basically, DNA is a computing problem


The computing resources of the Sanger Institute The Wellcome Trust Sanger Institute (formerly the Sanger Centre) is a genome research centre in Cambridgeshire, England. It was set up in 1992 by the Wellcome Trust and the Medical Research Council, the purpose of which is stated on their website ([1] as "to further our  at Hinxton, near Cambridge, are almost unfathomable. Three rooms are filled with walls of blade servers A server architecture that houses multiple server modules ("blades") in a single chassis. It is widely used in datacenters to save space and improve system management. Either self-standing or rack mounted, the chassis provides the power supply, and each blade has its own CPU, memory and  and drives, and there is a fourth that is kept fallow fallow

a pale cream, light fawn, or pale yellow coat color in dogs.
, and for the moment full of every sort of debris: old Sun workstations, keyboards, cases and cases of backup tapes - even a dishwasher. But the fallow room is an important part of the centre's preparations. Things are changing so fast that they can have no idea what they will be required to do in a year's time.

When Tony Cox
For the jazz musician and ex-husband of Yoko Ono, see Anthony Cox.
For the South African guitarist, see Tony Cox (South African musician).


Joseph Anthony "Tony" Cox
, now the institute's head of sequencing informatics Same as information technology and information systems. The term is more widely used in Europe. , was a post-doctoral researcher he could sequence 200 bases of DNA DNA: see nucleic acid.
DNA
 or deoxyribonucleic acid

One of two types of nucleic acid (the other is RNA); a complex organic compound found in all living cells and many viruses. It is the chemical substance of genes.
 in a day (human DNA has about 3bn bases). The machines being installed today can do 1m bases an hour. What will be installed in two years' time is anyone's guess, but the centre is as ready as it can be.

Invisible revolution

Genome sequencing, which is what the centre excels at, has wrought a revolution in biology that many people think they understand. But it has happened alongside a largely invisible revolution, in which molecular biology molecular biology, scientific study of the molecular basis of life processes, including cellular respiration, excretion, and reproduction. The term molecular biology was coined in 1938 by Warren Weaver, then director of the natural sciences program at the Rockefeller  - which even 20 years ago was done in glassware inside laboratories - is now done in silicon.

A modern sequencer See MIDI sequencer.

(music) sequencer - Any system for recording and/or playback of music via a programmable memory which stores music not as audio data, but as some representation of notes.
 itself is a fairly powerful computer. The new machines being brought online at the Wellcome Trust The Wellcome Trust is a United Kingdom-based charity established in 1936 to administer the fortune of the American-born pharmaceutical magnate Sir Henry Wellcome. Its income was derived from what was originally called Burroughs Wellcome & Co, later renamed in the UK as the  Sanger Institute are robots from waist-height upwards, where the machinery grows and then treats microscopic specks of DNA in serried ser·ried  
adj.
Pressed or crowded together, especially in rows: troops in serried ranks.



[Past participle of obsolete serry, to close ranks, from French
 ranks so that a laser can illuminate it and a moving camera capture the fluorescing bases every two seconds. The lower half of each cabinet holds the computers needed to coordinate the machinery and do the preliminary processing of the camera pictures. At the heart of the machine is a plate of treated glass about the size of an ordinary microscope slide, which contains around 30m copies of 2,640 tiny fragments of DNA, all arranged in eight lines along the glass, and all with the bases at their tips being directly read off by a laser.

To one side is a screen which displays the results. The sequencing cabinet pumps out 2MB of this image data every second for each two-hour run. With 27 of the new machines running full tilt, each one will produce a terabyte every three days. Cox was astonished a·ston·ish  
tr.v. as·ton·ished, as·ton·ish·ing, as·ton·ish·es
To fill with sudden wonder or amazement. See Synonyms at surprise.
 when he did the preliminary calculations. "It was quite a simple back-of-the envelope calculation: right, we've got this many machines, and they're producing this much data, and we need to hold it for this amount of time and we sort of looked at it and thought: oh, shit, that's 320TB!"

Think of it as the biggest Linux swap partition in the world, since the whole system is running on Debian Linux See Debian. . The genome project genome project 1 The Human Genome Project, see there 2. A general term for a coordinated research initiative for mapping and sequencing the genome of any organism  uses open source software as much as possible, and one of its major databases is run on MySQL, although others rely on Oracle.

"History has shown," says Cox, "that when we have created - it used to be 20TB or 30TB, maybe - of sequencing data, for the longer term storage, then you may need 10 times that in terms of real estate, and computational process, to analyse and compare and all the things that you want to do with it. So having produced something in the order of 100TB to 200TB of sequential data, then the layer beyond that, the scratch space Scratch space is space on the hard disk drive that is dedicated for only temporary storage. It cannot be used to permanently backup files. Scratch disks can be set to erase all data at regular intervals so that the disk space is left free for future use. , and the sequential analysis In statistics, sequential analysis is statistical analysis where the sample size is not fixed in advance. Instead data is evaluated as it is collected, and further sampling is stopped in accordance with a pre-defined stopping rule as soon as significant results are observed. , and so on - to be honest, we are still teasing out what that means, but it's not going to be small."

Down in the rooms where the servers are farmed you must raise your voice to be heard above the fans. A wall of disk drives about 3m long and 2m high holds that 320TB of data. In the next aisle stands a similarly sized wall of blade servers with 640 cores, though no one can remember exactly how many CPUs are involved. "We moved into this building with about 300TB of storage real estate, full stop," says Phil Butcher, the head of IT. "Now we have gone up to about a petabyte One quadrillion bytes (one trillion kilobytes). Also PB, Pbyte and P-byte. See peta, binary values and space/time.

(unit) petabyte - 2^50 = 1,125,899,906,842,624 bytes = 1024 terabytes or roughly 10^15 bytes. 1024 petabytes is one exabyte.
 and a half, and the last 320 of that was just to put this pipeline together."

This new technology is the basis for a new kind of genomics, with really frightening implications. The ballyhooed first draft of the Human Genome The human genome is the genome of Homo sapiens, which is composed of 24 distinct pairs of chromosomes (22 autosomal + X + Y) with a total of approximately 3 billion DNA base pairs containing an estimated 20,000–25,000 genes.  Sequence in 2000 was a hybrid of many people's DNA; like scripture, it is authoritative, but not accurate. Now the Sanger Institute is gearing up for its part in a project to sequence accurately 1,000 individual human genomes, so that all of their differences can be mapped. The idea is to identify every single variation in human DNA that occurs in 0.5% or more of the population sampled. This will require one of the biggest software efforts in the world today.

Although it is only very rare conditions that are caused by single gene defects, almost all common conditions are affected by a complex interplay of factors along the genome, and the Thousand Genome Project is the first attempt to identify the places involved in these weak interactions. This won't be tied to any of the individual donors, who will all be anonymous. But mapping all the places where human genomes differ is the first necessary step towards deciding which differences are significant, and of what.

There are three sorts of differences between your DNA - or mine, or anyone's - and the sequence identified in the human genome project. There are the SNPs, where a single base change can be identified; these are often significant, and are certainly the easiest things to spot. Beyond that are the changes affecting tens of bases at a time: insertions and deletions within genes; finally there are the changes which can affect relatively long strings of DNA, whole genes or stretches between genes, which may be copied or deleted in different numbers. The last of these are going to be extremely hard to spot, since the DNA must be sequenced in fragments that may be shorter than the duplications themselves. "It's a bit like one of those spot the difference things," Cox says. "If you have 1,000 copies, it's very much easier to spot the smallest differences between them."

Genome me?

All of the work of identifying these changes along the 3bn bases of the genome must be done in software and - since the changes involved are so rare - each fragment of every genome must be sequenced between 11 and 30 times to be sure that the differences the software finds are real and not just errors in measurement. But there's no doubt that all this will be accomplished. The project is a milestone towards genome-based medicine, in which individual patients could be sequenced as a matter of course.

Once that happens, the immense volumes of data that the Sanger Institute is gearing up to handle will become commonplace. But the project is unique in that it must not just deal with huge volumes of data, but keep all of it easily accessible so different parts can quickly be compared with each other.

At this point, the old sort of science is almost entirely irrelevant. "It now has come out of the labs and into the domain of informatics," Butcher says. The Sanger Institute, he says, is no longer just competing for scientists. It is about to embark on this huge Linux project just at the time that the rest of the world has discovered how reliable and useful it can be, so that they have to compete with banks and other employers for people who can manage huge clusters with large-scale distributed file systems Software that keeps track of files stored across multiple networks. When the data are requested, it converts the file names into the physical location of the file so it can be found. . Perhaps the threatened recession will have one useful side effect, by freeing up programmers to work in science rather than the City.
Copyright 2008 guardian.co.uk
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright (c) Mochila, Inc.

 Reader Opinion

Title:

Comment:



 

Article Details
Printer friendly Cite/link Email Feedback
Author:guardian.co.uk
Publication:guardian.co.uk
Date:Feb 28, 2008
Words:1301
Previous Article:Arpad Pusztai: Biological divide
Next Article:Pity those poor souls who have to cut back and let the nanny go



Related Articles
Computing with DNA: getting DNA-based computers off the drawing board and into the wet lab.
DNA computing tricks add up to progress.
Parasites on the Web: A potential hack attack technique called parasitic computing may be coming to a PC near you. (Tech Talk).
It's in the genes: DNA technology could change the way we compute. (Inside Technology).
Whether Homo sapiens thinking can be carried out by biomolecular information processors in the brain?
A computer in every cell.
Design and test of nano devices, circuits and systems; proceedings.

Terms of use | Copyright © 2012 Farlex, Inc. | Feedback | For webmasters | Submit articles