Printer Friendly

New computational techniques speed knowledge discovery: pioneering algorithms sort through massive unstructured datasets using familiar desktop tools.

Despite the astonishingly powerful and inexpensive computation available to scientists, engineers and analysts these days, it's all for naught if the researchers aren't really sure what they're looking to discover in the first place. Much of today's most important research seeks the unknown results of sifting through huge proverbial haystacks of data--a.k.a, knowledge discovery (KD). To this end, researchers are pioneering new mathematical techniques for sorting through enormous unstructured datasets to try to find meaningful trends, patterns or clusters of data. The technology is enabling them to figure out which relevant questions to ask, which data would be useful to that end, and what knowledge can be derived from the inquiries. Some popular knowledge discovery techniques are derived from established algorithms such as k-means clustering, principal component analysis (PCA), non-negative matrix factorization (NMF) and singular-value decomposition (SVD). The problem is that datasets in the multi-terabyte and petabyte range have become commonplace in science, government and industry. However, parallel implementations of those algorithms on high performance computers (HPCs), necessary for both memory capacity and execution speed, have not kept pace.

Key algorithms

Before we can address the computational challenge of executing the code on large datasets, it's important to understand how the knowledge discovery algorithms work. When users believe that data is highly redundant or irrelevant, but in unknown ways, they often use dimensionality reduction. For instance, human faces can be characterized by numerous anatomical characteristics that can be automatically extracted from their images. However, it's difficult to know which of those characteristics vary together or do not help discriminate faces. Dimensionality or rank reduction algorithms reduce data from a large number of dimensions down to a few to several dimensions.

Choosing the right dimensions is based on those dimensions' abilities to maximize the discrimination among the different data points, effectively giving a brief synopsis of the data. The reduced number of dimensions enables clustering algorithms, which are often the next step in a workflow, to work more effectively and can make the data more understandable to a human. Some dimensionality reduction algorithms preserve a notion of "distance" between data points, while others do not. This can be important when the original data has some notion of physical distance to it. Prominent dimensionality-reduction algorithms include PCA, SVD and NME Some researchers favor NMF because the dimensions it identifies are more readily interpretable due to its non-negative values but, since it is an optimization, it does not define unique factors and is sensitive to initial conditions.


Clustering, also known as unsupervised classification, is another important algorithm in knowledge discovery. Clustering assigns each data point to a cluster, maximizing the similarity among points in the same cluster and minimizing the similarity among points in different clusters. Clustering on real-world data can be difficult because many popular algorithms require the user to specify how many clusters should be created, which is hard to know in advance, and because the algorithms don't provide insight into why data points were assigned to particular clusters. As examples, clustering is used in climate change research to identify the types of land cover that exist (shrub, grass, urban, agricultural, etcetera), and to identify annual climate patterns that are typical and have good predictive power. K-means clustering works well in this situation, where k centroids define the k clusters in the result, based on a chosen distance metric; often users run the algorithm for a range of values of k to see which value gives the best results. Other clustering algorithms include hierarchical, spectral (which can start to look like dimensionality reduction) and nearest-neighbor.

The parallel KD challenge The challenge is that knowledge discovery users seeking to exploit these KD algorithms are typically domain experts--scientists or analysts--not math experts or computer scientists. Their habit is to explore and visualize the data interactively from their desktops using very high level languages (VHLLs) like the M language of MATLAB (1) or Python. Without HPCs and appropriate tools for using them at their disposal, they are forced to limit the size of datasets and the types of problems they wish to explore rather than make a very large and time-consuming investment in parallelizing the KD algorithms. This serial-only approach sacrifices the ability to scale problems to larger computers in order to retain the productivity gains of VHLLs. As a result, the execution time of many researchers' analyses has ballooned from minutes or hours to days and even weeks, denying them the benefit of interacting with their data in real time.

However, the traditional approach to knowledge discovery with HPCs requires the serial algorithms originally developed in a VHLL to be rewritten into C, then parallelized using the message passing interface (MPI). While this approach can yield code that can scale to large sizes, the C and MPI interfaces are at such a low conceptual level that a rewrite can often take several months, even for experienced parallel programmers. In addition, once parallelized, modifications to the algorithms are equally expensive. Further, common KD algorithms often do not decompose naturally to run in parallel on multiple physical processors, making implementation even more cumbersome. This programming cost leads researchers to parallelize only KD algorithms they are certain they want to use, foregoing the ability to try several different KD algorithms on full-scale data.

New techniques

Fortunately, new programming models have emerged that allow knowledge discovery users to explore massive, unstructured datasets using their familiar desktop tools, and then to automatically parallelize the code to run on HPCs. This new problem-solving model combines the best of both worlds--the familiarity and interactivity of popular desktop tools with the computational muscle of the world's most powerful new straightforward and interactive parallelization software and multi-core servers. It has the added benefit of not requiring research teams to make huge investments in hardware and programmer training.


One example of this approach is a system that researchers at Pacific Northwest National Laboratory (PNNL) are creating to identify faces in streaming video. In another example, ecologists are using an improved modeling approach to look at wildlife migration and gene flows across fragmented landscapes using electrical network theory.

View to the future

Over the next few years, we can expect the availability of packages like the Knowledge Discovery Suite to accelerate the development of effective algorithms for knowledge discovery, providing much improved insight to researchers who are facing an onslaught of data from increasingly finer-grained sensors.

Acronyms GPU Graphics Processing Unit | HPC High Performance Computer | ICA Independent Component Analysis KD Knowledge Discovery | KDS Knowledge Discovery Suite | Message Passing Interface | NMF Non-negative matrix Factorization | PCA Principal Component Analysis | PNNL Pacific Northwest National Laboratory | SVD Singular-value Decomposition | VHLL Very High Level Language | Y2Y Yellowstone to Yukon Corridor

Designing Wildlife Habitats for Wide-ranging Species

New programming approach enables dramatic modeling improvements

Ecologists are modeling wildlife migration and gene flows across fragmented landscapes using electrical network theory. Corridors are areas that connect important habitats in human-altered landscapes. They provide natural avenues along which animals can travel, plants can propagate, genetic interchange can occur, species can move in response to environmental changes and natural disasters, and threatened populations can be replenished from other areas.


A good example is "Y2Y," or the Yellowstone to Yukon corridor, where U.S. and Canadian conservation organizations are trying to identify which habitats to conserve to protect species from harmful decline or extinction. Effective modeling can be instrumental in smart conservation planning, helping organizations decide which lands to preserve or restore and where to best invest their tight conservation budgets in order to preserve habitat and connectivity for wildlife populations. Circuitscape, (1) a tool written in Python and M, uses several knowledge discovery algorithms to map known characteristics of the target species to the topography and thereby model the species' behavior and identify likely corridors.

Traditionally, code may have taken three days to process a landscape with a million raster cells on a desktop workstation. Improvements resulting from using a suite of algorithms known as the Knowledge Discovery Suite (2) (KDS), which defines algorithms at a higher level with efficient parallel implementations, combined with vectorization and parallelization with Star-P, enable problems 10 times bigger to be processed within minutes on a HPC. Sequential performance is also increased dramatically. With this fast response time, ecologists can use these techniques as part of their routine workflow, rather than an esoteric tool only to be used in specialized situations.


(2.) Brad H. McRae, Brett G. Dickson, Timothy H. Keitt, and Viral B. Shah, "Using circuit theory to model connectivity in ecology, evolution, end conservation", Vol. 89, No. 10, pp. 2712-2724.

Understanding Associations among Faces in Streaming Video

PNNL researchers achieve greater algorithm and processor flexibility

Researchers at Pacific Northwest National Laboratory (PNNL) are accomplishing high-throughput analysis of unstructured data with parallel KD algorithms using an approach that allows users to use familiar desktop tools to do their analyses and then to automatically parallelize the code to run on HPCs. Or. Harold Trease and his colleagues are creating a system that identifies faces in streaming video and makes high-level associations about which faces are associated with other faces. (1) Once faces are isolated from background information, the workflow then characterizes each face using principal component analysis on a high-dimensional facial signature. It then uses the low-dimensional transformed data to create an adjacency matrix representing facial associations, a multi-step process involving creation and partitioning of a Voronoi/Delaunay mesh.

For the first implementation of this workflow, Trease and his team implemented each step discretely with C, Fortran and MPI, requiring detailed knowledge of each algorithm and considerable effort for the implementation. For a more robust implementation, the developers wanted the flexibility to change algorithms (e.g., substituting independent component analysis for principal component analysis) and the ability to exploit computational accelerators, such as graphics processors (GPUs) from NVIDIA or AMD/ATI. By using Knowledge Discovery Suite (KDS), PNNL researchers are able to achieve this flexibility. For instance, PCA is just one algorithm for reducing high-dimensional data to fewer dimensions that capture the diversity of the data more succinctly. SVD, NMF and independent component analysis (ICA) are other dimensionality-reduction algorithms with somewhat different results based on their different mathematical natures. A typical researcher wanting to reduce dimensionality will not know the details of PCA compared to ICA, nor when one is preferred over the other, especially for the nature of the researcher's specific data. Thus the KDS' menu of algorithms, each implemented for scalable performance, enables the researcher to calibrate the algorithm with the actual data at scale.

Another important benefit of expressing algorithms at a higher level is the greater leeway available to the library developer to exploit the power of accelerators like GPUs. Because accelerators tend to have niche algorithms for which they are extremely well suited, the library developer can map a high-level algorithm to the set of things an accelerator does well and reap the potential added performance, without the application developer being intimately aware of the use of the accelerator. This is important for Trease's application, which needs to use hardware accelerators in order to run fast enough.

(1.) H. Trease, T Carlson, R. Moony, D. Farber, and L. Trease, "Unstructured Data Analysis of Streaming Video Using Parallel, High-Throughput Algorithms," Proceedings of 9th IASTED International Conference on Signal and Information Processing, Honolulu, USA, 2007.


(1.) MATLAB is a registered trademark of The MathWorks. Other product or brand names are trademarks or registered trademarks of their respective holders. ISC's products are not sponsored or endorsed by The Mathworks, or by any other trademark owner referred to in this document.

Steve Reinhardt is Vice President of Joint Research at Interactive Supercomputing. He may be contacted at
COPYRIGHT 2009 Advantage Business Media
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2009 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:High Performance Computing
Author:Reinhardt, Steve
Publication:Scientific Computing
Date:Jan 1, 2009
Previous Article:SigmaPlot 11: now with total SigmaStat integration: imagine my joy as I discovered a complete package of publication-quality graphics software with...
Next Article:Probing OER's huge potential: the world needs good teachers--maybe you can help.

Terms of use | Privacy policy | Copyright © 2018 Farlex, Inc. | Feedback | For webmasters