Optimizing workflows in globally distributed, heterogeneous HPC computing environments: this complex task requires significant software support.
IBM, for example, offers their Platform LSF suite of tools built on top of the well-known LSF job scheduler that has been a core component in HPC centers for many years. Platform LSF provides the IBM Platform Session Scheduler and IBM Platform Data Manager tools to create 'virtual private clusters' that can asynchronously run jobs on a local cluster, geographically distant cluster, or inside the cloud. Jobs running within these virtual private clusters need only communicate with the scheduler inside the virtual private cluster. This means users can submit large volumes of tasks within the virtual private cluster that are able to run asynchronously on the remote hardware without needing to wait for the main scheduler's approval. In this way, IBM's Platform LSF is able to avoid communications limitations and speed-of-light latency--even across long distances--to deliver extreme scaling within the job scheduler. Similarly, the IBM Platform LSF Data Manager is used to stage data across distributed clusters via localized smart caches to eliminate data access delays as much as possible.
Such software tools help in creating and running tasks that can scale to run in these asynchronous, geospatially distributed environments--even with the added caveats that the environment can dynamically change through the addition and removal of cloud resources and clusters. These same software tools help users and systems administrators optimize their workflows and job scheduling to efficiently utilize systems that contain massively parallel accelerators and coprocessors, as well as address more 'mundane' hardware differences, such as variations in memory capacity and CPU type.
In the IBM Platform ecosystem, workflow creation is supported via a (graphical user interface) GUI that lets users draw the data flow and computational interactions. People interact much more naturally with a GUI, as it lets them graphically visualize the overall computational work and data flows. A well-designed GUI (and set of GUI templates) can abstract the workflow sufficiently so that script generators--much like a compiler for a parallel computer-can then create the scripts that contain the complex task and command invocations that implement the user workflow. Further, these scripts can be targeted to run on a specific hardware configuration (again much like a compiler generating code for multiple CPU architectures) be it for a local cluster or aggregation of multiple clusters and cloud environments containing a number of asynchronous 'virtual private clusters.'
Optimizing resource utilization means the systems team needs to see what is happening inside their globally distributed, asynchronously running multi-cluster HPC environment in real-time--a non-trivial data collection and visualization task by itself. Further, both users and the systems management team need to be able to analyze the performance of the HPC center so users can improve the efficiency of their workflows over the short term, and both users and the systems team can collaborate on HPC upgrades and new procurements to improve efficiency over the long term.
Both real-time data acquisition and the analysis of aggregate HPC datacenter information are big-data tasks that might be larger than some of the scientific questions being investigated! Think of the amount of monitoring and profile data that can be generated by many thousands of nodes in real-time, or the amount of data that must be gathered and stored for later analysis from those same nodes over the lifetime of the hardware. However, targeted data-driven decision-making is an essential part of data center operations and the procurement process, be it for a new system, system upgrade or to quantify cost and runtime machine requirements when contracting with a cloud-based service.
Balance ratios, as discussed in my 2007 Scientific Computing article, "HPC Balance and Common Sense," are a commonly used set of metrics that can extrapolate the characteristics of a newer, faster machine that can run a job mix efficiently based on the hardware characteristics of an existing system. The TOP500 site uses balance metrics based on synthetic benchmarks to compare systems. By extension, balance ratios and other metrics based on historical workload data for a site can be--and are--an invaluable tool for workload optimization and procurement planning. In short, balance ratios can distill a tremendous amount of 'big data' HPC performance data into a few numbers. They are but a few of the many analytic tools (many of which are not so concise) that can be used to analyze and optimize HPC data center procurements and operations.
Packages, such as IBM's Platform LSF, are nice in that they provide an integrated from user to systems management team experience. Other robust and respected job scheduling packages such as SLURM are also available. The SLURM ecosystem of tools also provide a number of similar tools, including the ability to run applications in a distributed environment such as the Teragrid.
Alternative profiling and analysis packages also exist. One example is the free NW perf tool set discussed in my February 2015 Scientific Computing article "Using Profile Information for Optimization, Energy Savings and Procurements (1)." The commercial Allinea MAP profiler also provides information programmers need to optimize their HPC workflows.
Regardless, people need the ability to find quantifiable, data-driven answers to their questions about application, workload and data center efficiency. The increasing size and dynamic nature of global HPC operations along with the inclusion of heterogeneous hardware just means people need additional help in monitoring and optimizing workflows and data center operations.
(1.) "Using Profile Information for Optimization, Energy Savings and Procurements, February 2015, Scientific Computing. www.scientificcomputing.com/articles/2015/02/ using-profile-information-optimization-energy-savings-and-procurements
Rob Farber is an independent HPC expert to startups and Fortune 100 companies, as well as government and academic organizations. He may be reached at editor@ScientificComputing.com.
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||HIGH PERFORMANCE COMPUTING|
|Date:||Jul 1, 2015|
|Previous Article:||The big 5 at ISC high performance 2015.|
|Next Article:||Thoughts on the exascale race: HPC has become a mature market.|