Storage Network Management Working Group.
Identify, define, and support open standards needed for a manageability framework for storage networking systems, including Storage Area Networking (SAN), SAN Attached Storage (SAS), and Network Attached Storage (NAS). Our end goal is to lower TCO and increase SAN product manageability. By identify, we mean to make reuse of current best practices and industry standards from sources such as the NSIC, Open Group, DMTF, NMF, IETF, and others, in order to begin framework formation.
By define, we mean to propose new--or adapt existing--standard infrastructures to meet the needs of the SNIA manageability framework. By support, we mean to develop and promote the SNIA manageability framework within our companies and in the industry at large.
The SNMWG will endeavor to cooperate with other industry organizations that are working towards the common interest of SAN and enterprise manageability. The SNMWG will work in a coordinated and cooperative manner with other SNIA working groups.
Managing The Enterprise
Storage Network Management Working Group (SNMWG) was formed to focus on getting the industry to step up to and build standard interfaces which will provide information to do all of the Enterprise Storage Resource Management (ESRM) disciplines. As of June of 1998 this group was subdivided into smaller work groups to focus on:
1. High-end, multi-platform, intelligent storage facilities
2. Removable media, tape, virtual tape systems
3. Fibre channel management
4. Small locally attached RAID systems
We expect these groups to come up with common standards that will allow all ESRM software to compete on level ground, and create a world of ESRM-compliant devices. This will allow customers to truly choose ESRM software based on the richness of function, and it will allow them to manage all OEM devices from a single terminal in the enterprise.
We have yet to define what Storage Resources are and what the ESRM disciplines are for managing them. If we want to truly manage the enterprise, then we need to recognize that although hardware resources are a little more obvious than the software resources, we need to understand both of them.
Storage Hardware Resources:
* Disk storage
* High-end storage facilities
* Small RAID devices
* SSA (Serial Storage Adapters)
* JBODs (Just a Bunch Of Disks)
* Storage area networks
* Network attached storage
* Removable media
* Virtual tape servers
* High-end automated tape libraries
* Small tape libraries
* Optical systems
* Discrete tape drives/subsystems
* Server platforms
* Fibre Channel hubs/switches/routers/bridges
This collection of storage facilities for network storage offers unique management challenges. Let's take DASD subsystems today and compare them to those of the past.
Today, all vendors provide look-a-likes of the 3880 and 3990 family of storage control devices, even though the architecture of each vendor's storage facility is totally different under the hood. With mainframe channel commands, you can issue a Read Device Characteristics command and get asset information from any vendor's box. You can issue LISTDATA commands on MVS and get the appropriate cache performance information that is totally meaningful over all vendor boxes. Furthermore, the RMF (Report Management Facility) data holds true for any vendor's device. There is a complete industry for managing these 3880/3990 clones, with contributing corporations including IBM, Computer Associates, BGS/BMC, Boole & Babbage, Sterling Software, Candle Corporation, and many others.
The storage facilities of tomorrow have processors, operating systems, and several layers of cache. In addition, they are connected to several operating-system platforms concurrently, and may be the biggest problem of all. If these high-end, multi-platform, intelligent storage facilities become the majority of storage in the field (and chances are very good that it could happen), then the storage resource management industry has a major problem.
Simple things such as RMF data, and LISTDATA become the sound of one hand clapping since they only represent the S/390 view and not the Unix, AIX, Windows NT or other platforms they are also attached to. And, if you don't have a channel attachment, then how do you get the information in the first place and how do you drill-down to the platform level to determine which application or file is pegging the storage facility?
Some software vendors are trying to support large sets of hardware vendors but they are failing miserably because every vendor box has its own unique interfaces. This causes an explosion of software because it means custom code needs to be written for every unique box. No software company can keep up with this environment. Hardware vendors are embedding their own Web server applications with BUT interfaces into their storage facilities. The customer must then learn each BUT to micromanage each vendor's device.
Some of the newer virtual tape systems may be the most difficult storage facilities to manage. They contain a processor, an operating system, storage management software, disk storage, cache, and a complete tape library system with controllers, tape drives, and robots. They are also attached to multiple operating platforms. The disk storage inside basically acts like a "cache buffer" to the tape subsystem by emulating virtual tape drives. These virtual tape systems may soon find a different kind of clone from DASD storage vendors that doesn't even have a tape subsystem in it. These facilities will be a front end to any tape system acting very much like the virtual tape systems on the market today.
Storage Software Resources
The notion of software as a storage resource might seem a little nebulous at first but consider the amount of storage management being done on both mainframe and open systems by software such as DFSMShsm, SAMS Disk, SAMS Vantage, ADSM, ARCserve, and GEMS, just to name a few. It is not unusual to find multiple terabytes of storage being managed by products such as these. They usually are automated, and they normally run in windows throughout the day.
In that light, they are like machines grinding through the data. And like machines, they have processing failures for things like file in use during backup, log is full, workload too large for the window, etc. When they fail, it is just as severe (if not more severe) than a piece of hardware failing. For example, if DFSMShsm doesn't complete the PSM (Primary Space Management) cycle, then there may not be enough free space to run the daily corporate business workload.
In order to manage these software resources, intelligent agents need to monitor the events, errors, windows, etc. and provide the automation hooks to keep them running without failure. This is no different from monitoring the number of read/write errors on a tape head to allow the automount of a cleaning cartridge.
These products also need to be tuned for performance, and capacity planning. Both of these disciplines will have a real effect on the amount and type of support hardware required. Therefore, both what-is and what-if reports are needed as much as the threshold events for automation.
The following management disciplines are all part of Enterprise Storage Resource Management:
* Asset management * Capacity management * Configuration management * Performance management * Availability management * Outboard management * Policy management
The names of these disciplines are somewhat arbitrary, and there is some overlap in the available device information as it applies to each of the disciplines. These names were chosen not to highlight the information but to focus on the customer problem being addressed. With these categories, it becomes much easier to define the discipline as it applies to the particular storage resource being managed.
Some of the storage vendors are providing software to perform parts of the above disciplines on their storage devices only. This is not what customers want. They do not have the people or the time to micro-manage every vendor's device with specialized software. The next set of sections will provide more detail about the storage resource management disciplines listed above.
This discipline addresses the need to discover the resources, recognize the resource, and tie it to the rest of the topology. This means that an agent could distinguish the difference between a high-end DASD storage facility, a Fibre Channel switch, high-end virtual tape server, or other resource. After discovery, it would dynamically load the latest version of an agent and call an API for asset information. It would probably contain information such as:
* Vendor/model * Software/license/patch * Manufacture and support * Physical location * Graphic images
It is also important to discover software resources such as an ADSM Version 3 server or other ESRM agents/managers.
There are many functions that could be put under this discipline such as asset discovery, asset topology, asset lease management, and software and microcode management.
This set of information would vary depending on the resource being managed. For example, in large DASD storage facilities, we would need to understand multiple levels of capacity. Basically, IT departments don't ever want to run out of free space.
To do positive capacity planning, corporate resource managers need to understand the additional capacity available at both the physical and the logical storage levels. This includes information like a box's available free space/slots, unassigned volumes, free/used space within the assigned volumes, plus some file-level detail. They also need to understand the growth capacity based on the model, or how many frames with slots for disk drawers could be added if necessary. For a Fibre Channel switch, the capacity could be expressed as a data transfer rate based on the horsepower of the device or the number of ports.
In software, necessary information might include the number of backups, backup tapes, percent utilization, and percent scratch. IT management needs answers to question such as "If I backup this application, what will it do to my network and how much back-end storage will I need to hold it?"
In mainframe environments, this technology is a mature science. In a world of open systems connected to a set of high-end, multi-platform storage facilities, it is embryonic. VTOCs (Volume Table Of Contents), and VVDSs (VSAM VTOC Data Set), provide many of the answers on S/390 platforms. In order to provide the function for the enterprise, the ESRM software must understand every flavor of operating platform, all of the platform file systems, the configuration of every vendor device and their associated interfaces, and every flavor of storage management software.
Initially, it might seem virtually impossible to think about a common API, which would allow ESRM software to configure all OEM (Original Equipment Manufacturers) storage facilities. But on further investigation, there are many similarities in all storage facilities devices. Today, much of this information is kept at the host level. To allow an ESRM agent to collect this data directly, the storage facility would need to keep track of all performance counters.
Consider the high-end DASD systems being built today. They all have cache, some with multiple layers like drawer cache. Some have NVS (nonvolatile storage) for writes; others emulate this in the read cache. All have host adapters in the upper interfaces of the device. Each host adapter has a certain number of ports of various flavors, which connect to the host. All vendors have lower interfaces to the disks, called disk adapters which have a number of ports that connect to various transport types (SSA, SCSI, etc.). All have DDMs (Disk Device Modules) which usually fit into drawers and have varying amounts of raw storage capacity. There are only so many RAID types supported by vendors that can be addressed by a common API.
Other notions, such as sparing, mirroring, remote copy, and instantaneous copy also have common threads that could be represented in industry-wide models. All storage facilities have the notion of logical configuration and physical configuration data, and the means to switch between them easily.
Fibre Channel configurations, have topology connectivity management for hosts, devices, and the interconnect fabric of hubs and switches. Host connectivity is through adapters, which have a certain number of ports of a certain type. Hosts feed into switches in the fabric which can either connect directly to storage adapters, to other switches, or hubs. Hubs can be cascaded to other hubs or to device adapters.
IT departments need to be able to see the current configuration and they need to understand when a physical failure occurs, and what application(s) were affected. They need to be able to set the configuration based on business requirements such as high-availability, and high accessibility. Lastly, they need to be able to do this to any OEM device through the same user interface.
In a world of high-end, multi-platform, intelligent subsystems, it becomes critical to do more than the standard, classical performance analysis of problem isolation with the upper or lower interfaces, cache and NVS overload. IT managers must drill-down to the top volumes, and determine the platform, the application, and even the file causing the problem. Today this is impossible because there are no common platform-independent APIs to access standard, reliable performance information from all OEM storage facilities.
For Fibre Channel management, it may involve the management of zoning to ensure that the critical business applications get the bulk of the traffic capacity. It may also include the recognizing of trapped packets, which are stuck in the fabric but eating up the latent capacity.
Performance management of virtual tape servers might include things like monitoring the DASD buffer for hit-ratios of virtual tapes, and virtual mounts very similar to cache management in DASD storage facilities.
In software such as DFSMShsm and ADSM, it would include the ability to monitor automatic workloads such as backup or space management compared to the window they are expected to run in so that alerts can be externalized to start additional workload tasks if necessary.
Basically, IT departments don't want to recover. They want the ability to recover, but they simply don't want to fail in the first place. Availability management is about the prevention of failure, correction of problems as they happen, and the warning of key events long before the situation becomes critical.
For example, monitoring of the number of I/O errors on a tape head to automatically mount a cleaning cartridge is a good example of availability management. Another example would be a high availability function that upon the failure of a DASD mirrored pair would search for a spare, break the mirrored pair, re-mirror the good drive with the spare, and page the customer engineer to repair the bad drive so that the system does not go down. Indeed, one common thread in all data centers today is the fact that there are fewer people to manage the ever-growing farm of enterprise storage. Reports, graphs, and real time monitoring are useful, but only to a point. There are no people to sit in front of "GUI Glow Meters" to monitor the system. ESRM software must provide easy automation trigger events tied-in with policies and thresholds to allow the monitoring function to operate without people. There is an infinite set of automation and policy management functions that could be provided under ESRM software.
For example, if DFSMShsm is half way through the Primary Space Management window but only one-quarter of the way through the volumes, then there is a good chance that it won't complete. If PSM does not complete, then this company won't have the available free space to do their business every day. The real time monitoring would let the storage administrator see this as it was happening--if he/she were sitting in front of the screen at the time. A report or graph will let the storage administrator know this after the fact. Why not externalize a trigger event that will allow the storage administrator to put in an automation script to automatically start other Primary Space Management tasks under DFSMShsm so that it will complete?
This discipline addresses the management of hardware that contains built-in data movement, and high data availability functions. There are a lot of useful, time/people-saving functions that could be provided by ESRM software.
Today, there are many data movement functions being provided by various storage vendors, especially in the high-end DASD storage facilities. The data mining industry and the Y2K problem have created a huge market for data replication products such as DataReach, HDME, TimeFinder, ESP, InfoMover, InfoSpeed, FileSpeed, and SnapShot. The business continuance industry and the disaster/ recovery requirements have forced outboard storage technologies for remote data copy with such functions as concurrent copy, PPRC, XRC, and SRDF.
Although these functions are powerful, they do require some user management not only for the data identification, but also the scheduling, start, stop, and error handling. Plus, the user is expected to understand the nuances of every vendor's twist on the particular data/device movement function.
Policy Management is probably the most nebulous ESRM discipline. The scope of policies has such a large range of possibilities. For example, imagine a simple policy, which states that if any port of a Fibre Channel switch goes down, then the appropriate person should be paged. This is fairly straightforward. As we move up the food chain in this discipline, we see more complex possibilities. How about a policy that states that you never want to run out of free space? Or, how about specifying an average of 6 milliseconds or less on every I/O against a file that has PROD as the 2nd level qualifier?
As we wander through the policy ecosystem from an IT perspective, the policy levels can get incredibly complex. Banks in the United States must have all transaction summaries complete in order to determine the Ml and M2 money supply, and other important statistics. If the daily deadline is missed, then banks are forced to pay huge fines. How about a policy which states that a bank never wants to miss this deadline?
There may not be single policies that will cover all of the ESRM disciplines. It is clear that users do want the system to manage itself as much as possible and they do want to concentrate on doing whatever is necessary to have a successful business. At a minimum, ESRM software will have to provide primitives to allow automation of basically anything that could (and should) be automated. Combinations of the primitive policies may form the actual business policies. ESRM should provide the framework for establishing those policies, for setting the controls/thresholds/auto-scripts, and for managing the storage resources based on those policy definitions, thresholds, and controls.
If Enterprise Storage Resource Management, ESRM, is to discover and manage storage resources so a user can manage them from anywhere in the enterprise, then there are some key architectural decisions that need to be made.
First of all, to manage all resources from a client or terminal with a common look and feel implies a BUI (Browser User Interface) running from any platform under the preferred Web browser in today's technology platform choices.
Secondly, if ESRM is to discover and manage resources from a single point, this implies that there is ESRM manager and a bunch of ESRM agents for the individual storage resources. The ESRM manager must contain services to handle user login/administration/security authorization, data base functions, logging, reporting, graphics, auto-scheduling for both data collection, and reporting, and automation events.
The whole topic of ESRM discovery agents has implications on the architecture as well. For example, how does the storage resource get found in the first place? Today, we are probably inferring TCP/IP which means that the storage resources need to have an IP server to be auto-discovered. When a storage resource is "discovered," it is important to know what type of storage resource it is so that the correct interface mechanism can be used for that class of resource. For example, the ESRM Manager would want to know if this was a high-end multi-platform intelligent storage facility, or a Fibre Channel switch, or a small tape library. This architecture would support a unique agent for each general storage resource class. Once the resource type was known, the ESRM Manager could ensure that the proper version of the particular agent type was installed on the resource in order to obtain the asset, configuration, capacity, and performance information.
If the storage resource has some data that needs to be monitored for automation triggers, then it needs to have an SNMP server for SNMP traps for automation enablement.
Not all agents will be that simple. An intelligent agent might use a long history of thousands of values before it sets off an event trigger. This says that the agent must first collect the germane information. Well, how does the agent code get to the storage resource in this case? What operating platforms are supported? The only reasonable answer to this is to have ESRM agents become Java applications that only require a JVM (Java Virtual Machine) on the storage resource. This doesn't eat up a lot of storage or computing resource.
This brings up another point. Many of the storage resources will not want these ESRM agents to be eating up their CPU cycles, or other resources. It's doubtful that any DASD storage vendor would want to dedicate 75% of the processing power to ESRM agents. They want nothing to inhibit the throughput of their I/O subsystems. This would be true of the server platforms as well. Imagine a bank dedicating 75% of the processing power to ESRM agents on the server that runs their ATM system!
Architecturally, this translates to the concept of thin ESRM agents that get in, get the data needed to the ESRM manager, and then get out. A mobile Java agent (or agglet as it is called) running under a JYM fits the bill.
Lastly, where is the data being stored and what does the database look like? The obvious answers here are that the ESRM manager platform keeps an inventory of this information in an industry accepted relational database.
All in all, this is a pretty aggressive set of functions for any software vendor to take on. What is the tie-in to existing systems-management platforms?
Systems Management Tie-in
There is a set of functions described in this article that relate very closely with existing systems management software (e.g., Tivoli, CA-Unicenter, and HP OpenView). Why not just use one of them?
There are a number of interesting points here that ESRM development corporations will have to address. First of all, there are multiple terabyte users that have no systems management software. This, of course, could change in the future.
Secondly, if a user chooses one of the existing systems management software vendors, this is not a light decision at all in terms of cost, time, and people. They probably won't switch in the near future.
A third point of consideration is the amount of traffic that would be sent up to a systems management console. Imagine every message from software like DFSMShsm, ADSM, SNMP/MIB (Management Information Block) traps from every storage device, every server platform coming to a single console. It could get overwhelming. Lastly, a pragmatic consideration is the administrator of the storage resource. Usually, there is a storage administrator for products like DFSMShsm and ADSM. People are sometimes given responsibility for the tape libraries, or the DASD subsystems, as well. These people know and understand these resources. They don't usually understand how to set up systems management automation. Also, sometimes that process involves change control that needs to be scheduled at some time in the future. Storage administrators need to be empowered so instantaneous decisions can be made.
If DFSMShsm is not going to complete its backup window, then ESRM should have the function to allow the administrator to start another backup task. But every backup task eats up another tape drive. So why not empower the administrator to start up to 10 backup tasks? After that point, allow an alert to be sent to a higher-level systems management facility to make the determination if adding that 11th tape drive will impact production. This is layered systems management. It allows the user to choose local automation, systems automation, or a layered combination.
Given that we want to perform all of the disciplines above on all storage resources, are there any standard platform-independent interfaces that work on all OEM storage facilities? Other than the mainframe channel interfaces such as Read Device Characteristics, and LISTDATA, the answer is no.
The information that one gets today through channel interfaces from those same high-end, multi-platform, intelligent storage facilities is wonderful. The engineers of all storage manufacturers really bent over backward to preserve the concept of 3880/3990 performance model to allow incredibly accurate classical performance analysis for cache, NYS, upper interfaces, lower interfaces, and volumes, when connected to a mainframe only environment. We need to extend this wonderful information to the open systems world through an IP connection. We need to allow easier management of storage resources by not only providing the API but also by specifying a recommended underlying architecture and standards. To support large SAN configurations with all the high availability, reliability and security expected by today's Enterprise IT centers will require considerable cooperation and coordination amongst vendors in the storage industry.
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||Industry Trend or Event; charter|
|Publication:||Computer Technology Review|
|Date:||Sep 1, 1999|
|Previous Article:||FIBRE CHANNEL WORKING SUB-GROUP (FCWSG).|
|Next Article:||File System Working Group.|