Tackling the challenges of exploratory data analysis on large-scale data: rapid algorithm development in high-level programming languages delivers solutions.
However, big data combined with state-of-the-art numerical and combinatorial algorithms presents significant computational challenges. The research community needs rapid algorithm development in high-level programming languages that effortlessly deal with large data sets to surmount these challenges. According to Professor Nick Trefethen of Oxford University, "In computing with humans, response time is everything. One's likelihood of getting the science right falls quickly as one loses the ability to steer the computation on a human time scale."
Rapid algorithmic development
The resolution and input rates of sensors are growing rapidly, leading to exponentially larger volumes of data being generated, as compared to a few years ago. For instance, confocal microscopes are now available with 4096x4096-pixel resolution, four times the resolution of previous generation microscopes. Each frame from this latest microscope consumes about 24 megabytes, so even a moderate number of frames can consume well over a gigabyte of storage or memory. The same story holds true for input gathered from a variety of sources: gene data from microarrays, image and sensor data from satellites, mass spectrometry data about composition of physical samples, relationships in social networks, and metabolic pathways in biological networks, to name several. Data is big and quickly growing bigger. A crucial threshold for "big" is whether the input fits in the memory of a typical desktop computer (say, four gigabytes). Often, it does not.
In many disciplines, this data glut has transformed computational science from predictions based on first principles to deriving intuition from investigation of the actual phenomena. This aspect of computational science focuses on the analysis and theory formation step of the scientific discovery cycle. One example of this new approach is a group of biological researchers who want to optimize their lab space by quickly identifying cell cultures that have moved into an uninteresting state and replacing those cultures. This is a classification problem. Other groups want to identify important pathways in a metabolic network to optimize biomass from microbes, or to identify actors in social networks matching certain characteristics. These are dimensionality reduction problems. Climate scientists want to identify similar climate periods, and conservation biologists want to identify corridors for optimal animal movement in complex landscapes.
For all these problems, the best answer may exist theoretically but the computation may not be feasible. Computational complexity grows exponentially with the size of the data input and, therefore, approximate or heuristic methods have to be used. However, as the evidence grows, the approximations or heuristics that were workable for a smaller scale may not necessarily apply. Users will need to try a variety of algorithms on full-scale data to choose the best one. Unfortunately, exploring a variety of algorithms interactively on full-scale data runs is hamstrung by today's analysis tools.
Many scientists cope with growing input sizes by using only subsets of the full data. This enables them to tackle the problem using their preferred desktop tools, particularly the popular very high level languages (VHLLs) such as R, Matlab, Mathematica and Python. This approach sacrifices the ability to scale problems to larger computers in order to retain the productivity gains of VHLLs. As a result, the execution time of many researchers' analyses have ballooned from minutes or hours to days and even weeks, denying them the benefit of interacting with their data in real time. Researchers need alternatives to solve bigger problems faster.
Scientists and engineers typically turn to high performance computing systems for faster speed. But this approach requires algorithms originally developed in a VHLL to be rewritten into C, then parallelized using the message passing interface (MPI). While this approach can yield code that can scale to large sizes, the C and MPI interfaces are at such a low conceptual level that a rewrite can often take several months, even for experienced parallel programmers. Common algorithms for data analysis do not decompose naturally to run in parallel on multiple physical processors, making implementation even more cumbersome. This programming cost leads researchers to parallelize only algorithms they are certain they want to use, forgoing the ability to try several different modeling techniques and algorithms on full-scale data. In addition, once parallelized, modifications to the algorithms are equally expensive.
Fortunately, new libraries of commonly used algorithms for data exploration are becoming available from both commercial and open source communities. These include data mining and machine learning algorithms. Combinatorial algorithms to probe relationships between individual elements in a system also are included, which are typically represented as graphs. One such library, called the Graph Analysis and Pattern Discovery Toolbox (GAPDT) is currently implemented in the Matlab language and runs transparently in parallel with Star-E Here are a few examples to highlight the effectiveness of such an infrastructure.
Ecologists at the National Center for Ecological Analysis and Synthesis (NCEAS) model wildlife migration and gene flows across fragmented landscapes using electrical network theory. Corridors are areas that connect important habitats in human-altered landscapes. They provide natural avenues along which animals can travel, plants can propagate, genetic interchange can occur, species can move in response to environmental changes and natural disasters, and threatened populations can be replenished from other areas. A good example is Y2Y, or the Yellowstone to Yukon corridor, where U.S. and Canadian conservation organizations are trying to identify which habitats to conserve to protect species from harmful decline or extinction.
Effective modeling can be instrumental in smart conservation planning, helping organizations decide which lands to preserve or restore--and where to best invest their tight conservation budgets in order to preserve habitat and connectivity for wildlife populations. Circuitscape, a tool written in Matlab, uses several combinatorial algorithms (such as graph construction, connected components, graph contraction) and numerical algorithms (solving linear systems) to identify the corridors.
Initially, the code took three days to process a landscape with a million raster cells on a desktop workstation. Improvements resulting from using the GAPDT, combined with vectorization and parallelization with Star-P, enables problems 10 times bigger to be processed within minutes on a high-performance computer. Sequential performance also is increased dramatically. With this fast response time, ecologists can use these techniques as part of their routine workflow, rather than an esoteric tool only to be used in specialized situations.
Other examples are researchers doing social network analysis, fraud detection and genomic analysis. They often want to understand the key dimensions along which information is structured, without knowing in advance what those dimensions are. Existing methods commonly use the singular value decomposition of a matrix representing the graph of relationships.
In one case, a researcher wanted to use a non-traditional method, nonnegative matrix factorization (NNMF), for such a problem. Researchers have published several variants of NNMF in their papers. The codes are typically 10-20 lines of Matlab code. He tried an implementation on the real-world data set from NetFlix, the online DVD rental company, with ratings of 17,700 movies from about half a million users (a total of approximately 100 million ratings). This implementation completed in about two minutes using 100 processors of a parallel system demonstrating that advanced algorithms running on large datasets can run fast enough to be used by researchers in an interactive, exploratory mode.
Extending desktop VHLLs (Matlab, in this case) with automatic parallelization technology like Star-P allows new algorithms to be prototyped directly on the desktop, even when disconnected from the network (i.e., during an airplane flight). Then, when connected to the network and HPC system, the problem can be scaled up to the full-scale data. This permits scientists and engineers to explore their input at an instant, to the extent that their environment allows.
Algorithm researchers are scrambling to deliver good algorithms for today's data sizes, while bracing for the challenges of imminent larger digital information. Their ability to develop new algorithms depends on their ability to understand quickly how those algorithms do or do not work on fullscale data, and to repeat this process often. Similarly, end-user scientists, engineers and analysts need to have a variety of algorithms to try on fullscale data, to see quickly which ones work. Today's new libraries of algorithms are making real strides in addressing these challenges.
Acronyms GAPDT Graph Analysis and Pattern Discovery Toolbox | MPI Message Passing Interface | NCEAS National Center for Ecological Analysis and Synthesis | NNMF Non negative Matrix Factorization | VHLLs Very High Level Languages | Y2Y Yellowstone to Yukon Corridor
Steve Reinhardt is Vice President of Joint Research and Viral Shah is Senior Member, Technical Staff, at Interactive Supercomputing. They may be reached at editor@ScientificComputing.com.