Astronomy & big big data: how will astronomers cope with the tsunamis of raw data soon to pour in from wide-field surveys?
How much is that exactly? Winter made an analogy with Blu-ray discs. A standard, dual-layer Blu-ray holds about 50 gigabytes of data, so in one day SDO generates 60 Blu-rays' worth of data. The mission has been running for six years--131,400 Blu-rays--and the team hopes to observe the Sun's entire 11-year cycle. Imagine trying to watch all those discs, deleted scenes and all.
The same thing is happening all over astronomy. And today's "big data" pales next to what's coming, both with surveys already underway, such as the Panoramic Survey Telescope and Rapid Response System (Pan-STARRS) and with those in development, including the Large Synoptic Survey Telescope (LSST) and the Square Kilometer Array. These and other visual and radio telescopes will blow SDO out of the sky in terms of data output.
To explore what big data means in the astronomy context, and to look at the challenges it presents, let's take a close look at the principal U.S. project being developed in ground-based astronomy for the 2020s: the LSST.
A Data Tsunami
The LSST is currently under construction near La Serena, Chile. When completed, it will share a ridge on Cerro Pachon with the Gemini South and Southern Astrophysical Research telescopes. Its chief funders are the U.S. National Science Foundation (for its facility and telescope), the U.S. Department of Energy (its camera), and various philanthropists (its mirror). This year alone LSST will consume about $140 million, with an estimated $1 billion over its lifetime.
The LSST will have three components: an 8-meter wide-field telescope, a 3.2-billion-pixel camera, and an automated data-processing system. The scope's field of view will be much larger than that of any other 8-meter telescope: 3.5[degrees] degrees, or about seven times the diameter of the full Moon. Every 40 seconds or so the telescope will point to a new area of the sky, and its camera will take two 15-second exposures (to efficiently reject cosmic rays). The camera will record using six different visual and near-infrared filters, and over its anticipated 10-year run it will capture some 40 billion objects in an unprecedentedly large volume of the universe.
Besides being wide, fast, and deep, LSST will add the time domain. After those 10 years, this will result in a sort of stop-motion movie of much of the celestial sphere. All told, the camera will image about 10,000 square degrees of sky every three clear nights. LSST will not stop for follow-up observations but will simply keep harvesting night after night.
LSST leaders expect astronomers to use the output from this technological juggernaut to explore many key science areas. Four in particular will be counting asteroids and other moving objects in our solar system, mapping the structure and evolution of the Milky Way, exploring transient phenomena such as supernovae and other variable stars, and constraining the nature of dark energy and dark matter. But the science will be up to the scientific community. LSST will simply collect the data--a continuous tsunami of it.
The project expects to accumulate 15 terabytes of raw data each night--five times what SDO garners daily. Fifteen terabytes a night is a lot, but within a few years, the cumulative data will become petabytes in size. (A petabyte is 1,000 terabytes; see the table above right.) "That's a scale we're not typically used to," says Andrew Connolly (University of Washington) of his fellow astronomers.
After its initially budgeted 10-year run, LSST will have amassed 54,750 terabytes of raw data, or 1,095,000 Blu-rays' worth. How high a stack of discs would that be, one lying flat atop the other, with no cases? Just over 4,300 feet (1,300 m), or about 2V2 times the height of the 1,776-foot Freedom Tower in New York City (see diagram on page 17).
But that's just the raw data. When all the processed data and such are included, the total for that 10-year stretch will amount to about 200 petabytes, says LSST director Steven Kahn (Stanford University).
How can astronomers possibly deal with that much input? How will they actually use it? No one really knows. But they're working double-time to find out.
Small Data Gets Big
For centuries after Galileo first aimed a telescope at the heavens, astronomers mostly trained their instruments on individual objects or small samples of cosmic sources. Data sets were totally manageable, even into the modern era. As Winter said in his SDO talk, "When I was a graduate student, the way you picked out interesting things on the Sun was you locked a graduate student in the basement to go through the day's files, and they would tag what was interesting."
But in recent decades, large survey projects increasingly have been displacing single-object studies. One example is the Sloan Digital Sky Survey. Its 2.5-meter telescope in New Mexico conducted a thorough visible-light survey of one-third of the observable sky. All told, SDSS recorded position and brightness for a billion stars, galaxies, and quasars, along with spectra of a million objects. And that was just SDSS's first phase; follow-up survey work has continued ever since.
There had been other wide-held sky surveys--most notably, the photographic Palomar Sky Surveys during the 1950s and 1980s--but SDSS pulled astronomy straight into the big-data era. The project resulted in a spectacular amount of groundbreaking science, much of it unforeseen by the venture's designers. Among other findings, SDSS enabled major discoveries regarding active galactic nuclei, the substructure of the Milky Way, and baryon acoustic oscillations (S&T: Apr. 2016, p. 22). Since routine operations began in 2000, SDSS data have been used in more than 5,800 peer-reviewed publications in astronomy and other sciences, and those papers have been cited nearly 250,000 times.
"Every science agency around the world, every observatory manager, saw this and said, '5,000 papers for only $100 million?'" says astronomer Eric Feigelson (Pennsylvania State University), citing the estimated cost of SDSS over about 10 years. "It was probably the cheapest science-per-dollar ever achieved. And so everyone said, 'Let's do it, too.'"
Today an alphabet soup of wide-held surveys is either already or soon to be in operation (some of the largest are shown in the diagram above right). There are also biggish-data instruments planned for space. Traditionally, NASA hasn't designed its missions to transmit telemetry in megabytes per second, so previously any heavy data-processing was done aboard Kepler, Chandra, and other space-based instruments. But besides the SDO, we'll soon have NASA's Wide Field Infrared Survey Telescope (WFIRST) and the European Space Agency's Euclid. Both are wide-held-survey space telescopes designed to help answer questions about dark energy, exoplanets, and other hot topics in astronomy and astrophysics.
A Nightly Ritual
Once the LSST survey starts--currently scheduled for October 2022--the project team will handle initial processing so that astronomers can quickly and easily make use of the observations to do science. This upfront work will comprise basic data analysis, including characterizing sources in terms of their color, shape, motion on the sky, and time variability. The LSST team will also ensure consistent data quality and assemble the object catalog. Much of this standard pipeline processing will be highly automated: the data volumes are so massive that they preclude human examination of all but the tiniest fraction.
The LSST team can't get behind on processing the raw data, because if they do, they'll never catch up. To help protect against this, the project plans to staff two primary data centers, the main one at the National Center for Supercomputing Applications in Urbana, Illinois, and a backup center in Lyons, France. It will also furnish multiple copies of the full data, and each year a new run will reprocess the entire available survey data set.
But LSST leaders are confident that such advance efforts will not present an obstacle, nor will storage. As the project explains on its website, "While LSST is making a novel use of advanced information technology, it is not taking the risk of pushing the expected technology to the limit." LSST will have two redundant, 40-gigabyte-per-second optical-fiber links from La Serena to Urbana. Such dedicated long-haul networks mean that even transferring that amount of information is not expected to be difficult, Kahn says. "What is a difficult problem," he adds, "is finding anything in it."
Algorithm as Instrument
For each object that ends up in the LSST catalog, the team will typically measure tens of parameters: its position, shape, brightness, color, and so on. Over its planned 10-year run, LSST will make about 1,000 observations of every object, giving astronomers information about stars, galaxies, and other entities as a function of time --that stop-motion movie. So, in addition to astronomical amounts of data, the phase space that the data occupy will have thousands of dimensions. How are astronomers expected to deal with that? One word: software.
In the old days, astronomers went to the observatory to make discoveries. Now more often they go to the database --in fact, to extremely large databases, or XLDBs. The XLDBs arising from LSST will be trillions of lines long, Kahn says. Searching them in a linear way, row by row, would be too time-intensive even for the world's fastest computers, he says. As such, beyond thorough indexing, inventive new algorithms will be critical to effectively mine the LSST data.
We're entering an era in which the algorithm is the instrument, says astrostatistician Thomas Loredo (Cornell University). It's as the telescope used to be, in the sense of being the intermediary between the sky and discovery. "The knowledge comes not from opening a dome on the sky--that's happening every night for you--but from making the right types of queries of the database and then knowing what to do with the numbers," Loredo says.
Kahn agrees. The greatest innovations in working with reams of wide-field survey data will come from creative querying. In the project's XLDBs, for example, how will astronomers ferret out what they like to call "unknown unknowns"? As Kahn puts it, "What sort of new phenomena are out there that we never knew about before? So of all the kinds of things we already know about, how do we identify those in the data and exclude them so we can find the things that don't look like that?"
The most unexpected findings will likely come from probing regions astronomers haven't been able to probe before. "There are holes in our knowledge--you know, time scales of variability and rareness, like things that only happen once per year per cubic gigaparsec," says Loredo, chuckling. (A gigaparsec equals about 3.26 billion light-years.) "That's a region of phase space that we haven't been able to look in before."
Altogether, astronomers have to change how they think about and work with databases. SDSS was trans SLOAN WORKHORSE The 2.5-meter telescope at Apache Point Observatory in New Mexico has hosted the Sloan Digital Sky Survey and all its follow-up projects. formative in that it made its results available to anyone. But astronomers did science with SDSS by querying the database for the observations they were interested in, then downloading those to their local machine and manipulating them there. With mega data sets such as LSST's, this model might no longer work. "Can I continue to pull the data down onto my own local disk?" says Connolly. "Or do I now have to start moving my analysis to the database itself? That's what we're trying to learn now."
Discoveries will ride on those sophisticated new algorithms, on clever ways to seek correlations, and on testing predicted statistical relationships across XLDBs. Innovative visualizations are another way. "When you have 1,000 points and you've measured three of their properties, you can put them on a couple of graphs and publish them on a flat piece of paper--that's been done for centuries," Feigelson says. "When you have a billion objects, and you've measured hundreds of things about them, you literally can't even look at [that data set] directly."
Astronomers will need imaginative techniques, such as using color or time in inspired ways. For instance, researchers might project higher-dimensional data sets onto rotatable, 3D frames, then make a movie of those data sets in time and study the movie for anything compelling going on. By interacting with the data while viewing the movie, they have a greater hope of spotting and understanding things using the human eye and mind, Feigelson says, than by relying solely on algorithms.
Alerting the Community
Besides gathering and processing incoming data, another task the LSST project will do every night in real time is issue alerts when something has changed at a specific location on the sky--perhaps an object's brightness or position, or a serendipitous appearance has occurred. This will happen automatically, with each incoming image being "subtracted" from a deep template built from prior observations of the same spot. For any given object of interest, an alert will go out within 60 seconds of when the target was observed.
Not surprisingly given the LSST's stupendous capabilities, there will be a lot of these alerts--about 10 million per night. Individual astronomers who subscribe to the project's feed won't receive 10 million emails overnight; rather, with each visit to the subscription service, they'll receive a small number of alerts, say 20 or so, satisfying criteria they themselves specify. And LSST itself will tag easily identifiable asteroids, variable stars, and other known objects. "So without a huge amount of effort, people will be able to filter out the mundane things from a small subset of the exotic," Kahn says.
But LSST won't go beyond basic processing and alerting. That is, the LSST team will stop short of making scientific decisions; as with SDSS, they will leave that up to the astronomical community.
In light of this, Kahn expects there will be what he calls event brokers. These experts will determine what astronomers are interested in, cross-match those choices to other catalogs, and perform extra filtering accordingly. Which objects highlighted in the alerts are time-critical and should be viewed with other telescopes as soon as possible? Which should astronomers follow up on spectroscopically? Which should they observe with radio, infrared, X-ray, or gamma-ray facilities?
Event brokers will have a job on their hands classifying the alerts so astronomers can jump on the most important ones. "You don't just want to say, 'Something happened over here,"' Loredo says. "You want to try to say, 'Well, based on the previous 10 or 100 measurements we have there, our best guess of what happened there is this.' And then people who control the other observing resources can try to make more informed decisions, either with algorithms or just manually. It's a big issue."
Are Astronomers Ready?
Getting astronomers prepped for the torrent from LSST and other wide-held surveys is a formidable challenge. There are two issues, Kahn says: energizing the astronomical community to prime itself for this hood of data, and securing adequate financial support so it can do so. The latter commonly occurs in fields like particle physics, he says: in advance of big facilities like the Large Hadron Collider turning on, strong support for the community exists to ensure they're ready to conduct scientific analyses as soon as the faucet is turned on.
"In astronomy, there's been more of a culture of 'Let's wait till the data come and then we'll figure it out,"' Kahn says. Astronomers can't do that with the LSST or they'll find themselves drinking from a fire hose. "And you know what happens when you drink from a fire hose," Kahn says. "Your head gets blown off!" The task, he says, is to get astronomers to change their culture a bit and acknowledge that while the start of the survey is over six years away, they need to start preparing now.
Even if astronomers grasp the urgency, they're often not adequately trained in techniques they'll require. To work in that many-dimensional phase space, for example, astronomers need to learn what Loredo terms "a little nontrivial math." And data visualization of the type Feigelson described? "Nobody does good data visualization," Feigelson said flatly, referring to astronomers in general.
Traditionally, astronomers haven't been schooled in the necessary statistics and information technology. For instance, the number of classes in statistics required for a Ph.D. in astronomy at a U.S. university is zero, Feigelson says. Until recently there weren't even tenure-track positions available for astrostatisticians. Yet an understanding of statistics will be crucial to sifting treasures from the information overload. As Feigelson and his longtime Penn State collaborator, statistician Jogesh Babu, write in one of their papers, "Scientific insights simply cannot be extracted from massive data sets without statistical analysis."
Astronomers will just have to get comfortable with statistical techniques such as "advanced regression" and "Bayesian inference" and "multivariate classification." The same goes for tools in informatics. All the experts interviewed for this article agreed that collaborations among astronomers, statisticians, and information scientists must be greatly expanded. Some astronomers are on top of this, such as those associated with the Department of Energy's SLAC National Accelerator Laboratory, which is building the LSST's camera. But many others are not.
To that end, Feigelson spends a lot of time training astronomers in astrostatistics. "I've been in an airplane maybe 20 times in the last two or three years," he says, "where they fly me out to give tutorials in astrostatistics." He and colleagues have given instruction to about 10% of the world's astronomers, he estimates. "Which means we've had some inroads, but not enough to really modernize the methodology used by the entire field."
All the issues raised here really come down to one question: how well prepared will astronomers be when the goods begin gushing forth from the LSST, the Square Kilometer Array, and other such projects? "You don't fail by not being ready," Kahn says. "You can still do something. The question is, are you fully exploiting the data? That's really the challenge."
Incidentally, what is 200 petabytes of data--Kahn's estimate for the entire LSST data set, raw and processed, after 10 years--in stacked Blu-rays? It's about 15,750 feet, or roughly the height of Mont Blanc, the highest mountain in the Alps.
Peter Tyson is editor in chief of Sky & Telescope.
SKA TOTAL DATA VOLUME: HIGHER THAN SPACE
Think the LSST's anticipated total data volume (TDV) is large? The Square Kilometer Array TDV is expected to be about 4,600 petabytes. In stacked Blu-ray discs, that's about 68.6 miles high. Space "officially" begins about 62 miles (too kilometers) up.
Byting Off More ... 1 Byte 8 bits 1 Kilobyte (KB) 1,000 bytes ([10.sup.3]) 1 Megabyte (MB) 1,000,000 bytes ([10.sup.6]) 1 Gigabyte (GB) 1,000,000,000 bytes ([10.sup.9]) 1 Terabyte (TB) 1,000,000,000,000 bytes ([10.sup.12]) 1 Petabyte (PB) 1,000,000,000,000,000 bytes ([10.sup.15]) 1 Exabyte (EB) 1,000,000,000,000,000,000 bytes ([10.sup.18]) 1 Zettabyte (ZB) 1,000,000,000,000,000,000,000 bytes ([10.sup.21])
Please note: Some tables or figures were omitted from this article.
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||Cosmic Flood|
|Publication:||Sky & Telescope|
|Date:||Sep 1, 2016|
|Previous Article:||Ten years over Mars: it's easy, but wrong, to take for granted a mission that's still going strong after a decade.|
|Next Article:||Observing through a truly large telescope: the author and friends enjoyed a memorable night of observing through what was once the world's largest...|