The care and feeding of your HSM environment.
One of the critical roles of the storage administrator is to understand and manage DFSMShsm, the hierarchical storage management (HSM) environment shipped with IBM z/OS systems. HSM provides backup, recovery, migration, and space management functions and enables storage administrators to manage storage at the data set level and device pool level. To optimize the HSM environment, storage administrators can set and apply rules in HSM.
Because HSM is so powerful and performs so many tasks, it has the potential to have problems. The areas of HSM that lead to problems are high CPU usage because of unneeded or wasteful actions, failures in space management or the recycle process, internal control data set errors, and problems with the aggregate backup and recovery support (ABARS) and/or auto dump processes. This article provides best practices for optimizing the HSM environment, specifically in lessening high CPU usage and improving space management, as well as providing tips for proactively managing HSM.
Lessening CPU Usage
High CPU usage generally raises red flags in IT shops. Using an SRM tool, you can correct some of the most common causes of high CPU usage associated with storage movements:
Migrating unnecessary data sets: An easy way to reduce CPU usage is to reduce or eliminate unnecessary HSM activity that is caused by ineffective management class policies or application JCL. SRM tools can help reduce or eliminate unnecessary HSM activity by tracking activity and thrashing and by tying the data sets to the DFSMS constructs.
In terms of DFSMS constructs, storage administrators should determine if the data sets are actually going to the proper storage groups to receive the proper HSM management. Review the SMS constructs to determine if the data has the correct management class assigned to the data sets. If the data sets have incorrect management classes, the data could be deleted before its useful life span is used, it could be retained for too long, or it could be in the wrong location for use (such as always on ML2 tape every time the data is needed). All of these situations create unnecessary overhead, headaches for users, and possible legal issues.
Errors in space management: You can correct errors in space management by changing management class policies, removing undefined DSORG data sets, removing uncataloged data sets, and halting any other unnecessary HSM recall/migration/backup activity. For example, dealing with return code 99, 19, 82 and 37 will buy back quite a bit of productivity. Return code 99 is caused by an undefined DSORG, which can cause errors during backup and migration. Return code 37 is issued when there is not enough contiguous space to migrate the data set. If you receive a return code 37, you can either change the management class to prevent the large data set from going to ML1, or you could increase the amount of space available in the ML1 pool. SRM tools enable you to quickly see which return codes you received and how to correct them.
Wasteful tape recycle: Wasteful tape recycle usually results when the percent full setting is set too low or not high enough. How the recycle is run--and when it is run--can also lead to wasteful recycles. Recycle should be part of an automated solution. It is best to run it during low tape drive usage, such as early in the morning just before normal business hours, or just before the nightly batch cycle runs.
Recall thrashing: Migrating and then immediately (or in a short period of time) recalling the same set of data sets wastes valuable HSM resources and CPU cycles and can be quite costly. Understanding what is being migrated and quickly recalled requires a thrashing report. Examining the thrashing report and who or what is recalling the data sets will help you set better management class policies. You can also code batch JCL to be more efficient in how it handles GDG bases or sequential files.
Small data sets are the nemesis of HSM. Even when using small data set packing (SDSP) files, the constant migration of small data sets to ML1 and then to ML2 tape only to be expired in a short period of time is useless. The process of compacting files when migrating is minimal when compared to the cost of DASD. For SDSP data sets, take a closer look at when the data migrates and when it expires. Changing the management class policies for these data sets to let them "live and die" on primary disk is a good idea. It reduces unnecessary migrations and expires, and the recycle of tape volumes.
Space Management Cycles
With increasing pressures to reduce costs, it is important to manage space efficiently and migrate unnecessary data off of the primary DASD. SRM solutions gather historical information on pools, volumes, data sets, and VTOCs and display the results. User-defined search engines can rapidly locate information required for daily space management. For example, storage occupancy should be viewed from the perspective of logical groupings, such as departments or applications. SRM solutions can use application definitions for these user-defined groupings. DADSM exits (pre- and post-processing exits as well as user-defined exits) enable SRM solutions to monitor space at allocation and deallocation to provide an accurate evaluation of an application's storage use versus its quota.
As companies strive to better align their IT operations with the needs of the business, application definitions provide the ability to deliver services based on priorities to the business. For example, a data set can be a member of up to four application definitions that can be tiered or hierarchical. Each application definition can have different quota controls, including monitor, warn, or reject of the allocation if the quota is exceeded. Important applications are assured of space, while those that are optional are restricted. You can remove restrictions when conditions change, such as when the request is made outside peak shift.
To ensure that space management is running efficiently, determine when the primary and secondary space management cycles start and end. Do these times interfere with production batch cycles, certain online transactions, database backup activity? Is the space management activity completing on time? Determine the length of time it takes to run space management. Adjust the space management windows to meet all the business needs. The idea is to review all of the resources used, including tape drives.
Determine if the space management cycles are spread over multiple hosts or run on the primary host. Be aware of activities being run concurrently at any given time to ensure that they do not cause contention. Determining the location of space management will help determine if there is an overlap in resources. In most cases, a single primary host is sufficient to execute all space management activities, regardless of how many LPARs are running. There is a limit to the number of tasks that can be executed concurrently. For example, Customer A has only 10 tape drives, and he has 4 LPARs that each run space management. If Customer A specified in the HSM PARMLIB that HSM has a maximum of 10 tasks for tape usage, each LPAR thinks it has 10. There are only 10 tape drives available (not 40). This situation could cause tape drive contention.
Additionally, consider what other processes are running while space management is running. For example, if you are running space management between 12:00 and 6:00 a.m., and your batch processing is doing recalls at the same time, the tape drives that are doing regular batch processing and HSM recalls will compete with tape drives used for space management.
Proactively Managing HSM Storage
It is important to understand what HSM is processing on a daily basis. This could help determine where potential problems may lie.
You can check the following key items through a combination of queries to HSM using an SRM tool:
* Space migrated
** Volume report
** Data set report
* Error summary report
* Thrashing reports
* Recycle by type
** Zero percent
** 25 or 30 percent
Divide the recycle of HSM tapes into two groups. Issue the recycle for zero percent, then 30 minutes later, issue the recycle for 25 or 30 percent. This enables you to reduce a larger percentage of tapes in a given period of time. The zero percent tapes do not require a tape mount.
Recycle the tapes based only upon the type of HSM tape. For example, on Monday, issue a recycle only for the Monday backup cycle tapes. Do not include ML2 or any other backup cycle. This helps reduce the number of tape drives used, and it keeps the recycle process on a schedule that is easily tracked.
Software solutions can monitor, analyze, and automate tasks to safeguard the health of storage subsystems and ensure that critical applications complete successfully. Managing HSM is not inherently easy, but SRM tools can make it easier to work with HSM by providing the following features:
* real-time HSM messages
* improved HSM collection database
* views of the active recall queue within HSM
* events, TSO messages, and alerts on threshold conditions, including queue length, batch recalls that have waited too long, and number of recalls from the same user
* the ability to prioritize HSM recall processing across the sysplex and the HSM Common Recall Queue (CRQ)
For example, you can automate the response to HSM and DFSMSdss (DSS) messages by automatically generating control statements in response to key HSM errors. Cryptic error log messages are filtered, reworded, and responded to, based on your criteria. This feature saves time and makes HSM easier to manage.
SRM solutions also collect detailed historical information about storage performance and capacity. Performance statistics for data sets include response time, I/O rates, and cache activity and can be collected and summarized at several levels. For example, you can use monthly summaries to forecast storage usage and use daily and interval information to pinpoint a problem. Using this data, storage administrators can establish performance and capacity exception management through threshold-based facilities. For example, you may want to generate an alarm for net capacity load in IBM RVA disk subsystems or a high cache read miss percentage in EMC Symmetrix devices. An alarm can be sent to the SRM solution to initiate corrective automated actions.
In conclusion, storage administrators should periodically ask themselves:
* Is HSM doing what I want?
* Am I being proactive enough?
* Does HSM have internal problems?
* Is HSM causing outside issues?
When you can answer these questions, you have a handle on your HSM environment and are able to manage the data center data efficiently and effectively.
Mike Spencer is the technical lead for MAINVIEW SRM Level 2 Support at BMC Software (Houston, TX)
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||HSM: Special Section; hierarchical storage management|
|Publication:||Computer Technology Review|
|Date:||Mar 1, 2005|
|Previous Article:||Enabling tiered storage through tape virtualization: delivering more performance, reliability and efficiency at lower cost.|
|Next Article:||Hitachi, Ltd., Hitachi Data Systems and IBM extend interoperability for storage systems, servers and software.|