Mirror, Mirror In The Data Center.
Remote data copy for disaster recovery
The twin nightmares of lost data and systems outages have pushed disaster recovery and remote backup to the top of IT agendas in the age of e-commerce. For System/390 server environments, continuous system and data availability Refers to the degree to which data can be instantly accessed. The term is mostly associated with service levels that are set up either by the internal IT organization or that may be guaranteed by a third party datacenter or storage provider. can be best achieved through redundant design with clustering and failover features. All critical components are duplicated to maintain system and application availability in case of a single component failure.
In a disaster, the most common and trustworthy facility for business resumption is the utilization of remote copy (also referred to as remote data mirroring). The remote copy solution produces an exact copy of production data in real-time at a remote site. The remote site can function as the production site in a seamless failover, should the main production site go down or encounter serious problems.
Several approaches and technologies are available for remote copy for the S/390 platform. The main options are host-based and controller-based remote copy and each has different strengths and limitations. Also, there are many other data copying products in the market, which can provide point-in-time copies of production data, on demand, for testing or concurrent backup.
Host-Based Remote Copy
A host-based, software-assisted data mirroring facility such as IBM's eXtended Remote Copy Extended Remote Copy or XRC is an IBM zSeries and System z9 mainframe computer technology for data replication. It combines supported hardware and z/OS software to provide asynchronous replication over long distances. (XRC XRC Exeter Resource Corporation (stock symbol)
XRC XML-Based Resource System
XRC Xport Robot Controller
XRC X-Ray Crystallography
XRC Exact Reals in C
XRC Extended Remote Copying (IBM) ) utilizes a program supplied with DFSMS/MVS that monitors all updates to the local primary volumes and asynchronously applies the same updates to the remote secondary volumes. This enables the host application to process the next transaction or I/O (Input/Output) The transfer of data between the CPU and a peripheral device. Every transfer is an output from one device and an input to another. See PC input/output.
I/O - Input/Output immediately after the write to the local primary volume is complete without having to wait for the write to the remote secondary volume to finish. Thus, performance is usually minimally impacted and the distance between the primary and the secondary sites can be almost unlimited.
However, XRC requires host resources such as CPU CPU
in full central processing unit
Principal component of a digital computer, composed of a control unit, an instruction-decoding unit, and an arithmetic-logic unit. , main and expanded storage Additional memory in IBM mainframes that is not normally addressable by applications. Introduced for the 3090 series, the data are usually transferred in 4K pages from expanded storage to central storage (main memory). See hiperspace. , DASD (Direct Access Storage Device) Pronounced "daz-dee." A peripheral device that is directly addressable, such as a disk or drum. The term is used in the mainframe world.
DASD - Direct-Access Storage Device for control, and journal data sets, as well as controller resources (such as cache for the sidefile to temporarily store the updates). The secondary volume may be out of sync and behind in currency with the primary volume and, therefore, requires special attention in recovery situations.
Fig 1 shows the components of XRC and the asynchronous operation Noun 1. asynchronous operation - operations that occur without a regular or predictable time relation to other events
operation - (computer science) data processing in which the result is completely specified by a rule (especially the processing that results from a of the remote update process. The System Data Mover Also called a "storage router," it is a device in a backup system that manages the transfer of data to the backup storage. See LAN free backup. (SDM SDM - Schematic Data Model ) software is a component of DFSMS/MVS, which can reside and run in the primary site application host, remote recovery host, or in a separate host in a third site. The SDM host is connected to both the primary and remote secondary controllers via ESCON (Enterprise Systems CONnection) An IBM S/390 fiber-optic channel that transfers 17 Mbytes/sec over distances up to 60 km depending on connection type. ESCON allows peripheral devices to be located across large campuses and metropolitan areas. or channel extenders through common carrier facilities. A common system timer is used to time stamp See timestamp. all updates to XRC volumes in order to ensure updates to primary and secondary are in the same time sequence. When the primary controller receives a "Write" update from the primary host, an I/O "Complete" response is immediately returned. The data is put in a sidefile in the controller cache.
The SDM software collects updates from all primary controllers periodically or on requests from controllers as a result of reaching a threshold. The software uses the time stamps on the update records and a special algorithm to form a group of records called a "Consistency Group," which guarantees data and sequential update integrity. This "Consistency Group" of updates is, then, applied to the secondary volumes to ensure that the secondary volumes are updated in exactly the same sequence as the primary volumes. Thus, sequential integrity or time consistency is maintained. Due to the asynchronous Refers to events that are not synchronized, or coordinated, in time. The following are considered asynchronous operations. The interval between transmitting A and B is not the same as between B and C. The ability to initiate a transmission at either end. remote update implementation, applications do not have to wait for the remote update to be completed. The performance impact due to remote copy is, therefore, normally minimal and extended distances can be supported.
A drawback of this implementation is that data in transit (in the primary controllers or in the SDM not yet formed into a "Consistency Group") will be lost in case of a disaster in the production site. Data recovery efforts are, therefore, required to recover the lost transactions and data. The time and efforts required depend on the customer's environment (i.e., application and configuration complexity, update intensity, and XRC set up). However, applications can be restarted at the remote site quickly after a disaster since XRC can maintain a set of remote volumes that can guarantee data integrity and time consistency.
Controller-Based Remote Copy
There are three different general implementations of controller-based remote copy: synchronous, semi-synchronous, and asynchronous operation.
Synchronous Operation Noun 1. synchronous operation - operations that are initiated predictably by a clock
operation - (computer science) data processing in which the result is completely specified by a rule (especially the processing that results from a single instruction); "it can
The benefits of controller-based remote copy via synchronous operation include the ability to synchronize remote volumes with production volumes, to update secondary volumes with the exact same sequence as the primary volumes to guarantee data and sequential update integrity, and to eliminate data loss in case of a disaster in the primary site. Controller-based remote copy enables current, reliable data backup and the ability to restart applications remotely with minimum delay.
The drawback with controller-based remote copy via synchronous operation is that it is necessary to wait for the remote write to complete. System and application performance can, therefore, be impacted. The degree of performance degradation depends on the update rate, record block sizes, and distance between the primary and the secondary sites (assuming that there are no resource or other bottlenecks). The supported distance is generally restricted to the ESCON distance of 43km maximum.
IBM's Peer-to-Peer Remote Copy (PPRC PPRC Peer-To-Peer Remote Copy
PPRC Pollution Prevention Resource Center
PPRC Physician Payment Review Commission
PPRC Pulp and Paperworker's Resource Council
PPRC Provisioning Preparedness Review Conference ) and products from Amdahl, Hitachi, EMC (1) (EMC Corporation, Hopkinton, MA, www.emc.com) The leading supplier of storage products for midrange computers and mainframes. Founded in 1979 by Richard J. Egan and Roger Marino, EMC has developed advanced storage and retrieval technologies for the world's largest companies. , and others, provide a synchronous data Synchronous data
Information available at the same time. To test option-pricing models, the price of the option and of the underlying should be synchronous and reflect the same moment in the market. mirroring capability between the primary and the secondary storage subsystems by the controller firmware without host involvement. Host software such as TSO (Time Sharing Option) Software that provides interactive communications for IBM's MVS operating system. It allows a user or programmer to launch an application from a terminal and interactively work with it. The TSO counterpart in VM is called CMS. commands are used to start, monitor, and control the remote copy operations.
Fig 2 shows an example of a PPRC configuration and operation. The local and remote controllers are connected by ESCON links. Primary volume updates are not posted complete to the application host until the data is safely written to the cache of the secondary controller and the primary controller receives confirmation.
PPRC provides users two options to ensure remote data integrity in case of remote copy error conditions. The first option is to present a permanent I/O error to the host, if the write cannot be completed successfully to both the primary and remote volumes. This usually brings down the application, resulting in all primary and secondary volumes fully duplexed with no data loss. This option provides the highest level of data currency and integrity, but at the cost of system and application availability.
The second option for ensuring remote data integrity involves exploitation of a facility provided by MVS (Multiple Virtual Storage) Introduced in 1974, the primary operating system used with IBM mainframes (the others are VM and DOS/VSE). MVS is a batch processing-oriented operating system that manages large amounts of memory and disk space. called the MVS DASD Error Recovery Procedures See: explosive ordnance disposal procedures. (ERP (Enterprise Resource Planning) An integrated information system that serves all departments within an enterprise. Evolving out of the manufacturing industry, ERP implies the use of packaged software rather than proprietary software written by or for one customer. ), which creates a timing window allowing the user to react to the error conditions. After the timing window expires, If I/0 will resume on the primary volume with the corresponding secondary volume suspended. This can avoid an application outage out·age
1. A quantity or portion of something lacking after delivery or storage.
2. A temporary suspension of operation, especially of electric power. , if the primary volume is not the source of errors. However, if the secondary volume is out of sync, data loss will result if a subsequent disaster should follow. The second option is, therefore, much more complex to implement. Users usually integrate this facility with other software such as automation and operation management packages in an optimum disaster recovery solution, maximizing primary site system and application availability while minimizing recovery time in case of primary site disasters.
The IBM Geographically Dispersed Parallel Sysplex Geographically Dispersed Parallel Sysplex (GDPS) is a family of business continuity and disaster recovery solutions based on IBM mainframe technology. It is a service offering supplied by IBM, rather than a software product. (GDPS GDPS Geographically Dispersed Parallel Sysplex (IBM)
GDPS Global Data Processing System
GDPS Ground Data Processing System ) is an excellent example of such an integrated disaster recovery solution and it can significantly improve availability by reducing the impact of both planned and unplanned outages. GDPS is a multi-site management facility that combines system code and automation with the capabilities of IBM (International Business Machines Corporation, Armonk, NY, www.ibm.com) The world's largest computer company. IBM's product lines include the S/390 mainframes (zSeries), AS/400 midrange business systems (iSeries), RS/6000 workstations and servers (pSeries), Intel-based servers (xSeries) 5/390 Parallel Sysplex IBM's System/390 clustering architecture. It allows multiple System/390 computers to work together as a single system. It supports data sharing with guaranteed integrity, extensive resource sharing and sophisticated workload balancing. clustering technology, storage subsystem mirroring, and databases to manage storage, processors, and network resources to minimize the impact of system outages. By spreading a Parallel Sysplex cluster over two sites and duplexing all data, workloads and applications can be manually or automatically switched between sites to avoid planned outages and minimize unplanned outage disruptions.
GDPS uses PPRC synchronous remote copy to minimize or eliminate data loss and uses Parallel Sysplex cluster functions along with system automation to minimize the duration of the recovery window. GDPS detects the first indication of a potential disaster and uses Automation, MVS Error Recovery Procedures (ERP) and PPRC facilities to create a set of secondary volumes that guarantee sequential consistency Sequential consistency is one of the consistency models used in the domain of the concurrent programming (e.g in distributed shared memory, distributed transactions etc). It was first defined as the property that requires that "... and data integrity. Should a disaster occur, applications could be restarted in the remote site using automation and Sysplex facilities and all data accesses could be automatically switched to the secondary volumes using the PPRC facility. All major S/390 storage vendors such as Amdahl, Hitachi Data Systems See HDS. , and EMC have announced their support for GDPS.
Semi-synchronous remote copy operation is an attempt to reduce the performance penalty of synchronous operation and to extend the distance that can be supported. Semisynchronous operation should result in a significant performance improvement over synchronous operation if there are relatively few writes and if they are evenly distributed across different volumes, but normally, I/Os are "peaky peaky
[peakier, peakiest] pale and sickly [origin unknown]
Adj. 1. peaky - having or as if having especially high-pitched spots; "absence of peaky highs and beefed-up bass" " and updates usually come in bunches, resulting in inevitable waits for remote writes to complete.
Also, semi-synchronous operation may create a serious data integrity exposure in a multi-controller environment, since there is no coordination between controllers to ensure that remote updates are applied to the secondary volumes at the exact same time sequence as the primary volumes. If a disaster should occur, a full recovery process must be implemented to recover the records in flight, even if it may be only one record per volume. Users may lose data currency and the integrity assurance that comes with a synchronous remote copy and they may experience little performance benefits in return. This makes semi- synchronous operation more suitable for data migrations than for disaster recovery solutions. IBM's PPRC does not support this mode of operation, but products from some vendors provide it as an option.
In semi-synchronous implementation, an update from the host will receive an "I/O Complete" response immediately from the primary controller after it puts the I/0 in the primary controller queue to be transferred to the remote controller later. If there already is a record in the queue for the volume to be updated, the I/O will be rejected until the previous update has been written to the secondary volume. Thus, the remote volume is never more than one record out of sync with the primary volume (Fig 3).
Due to the shortcomings A shortcoming is a character flaw.
Shortcomings may also be:
Fig 4 shows an example of Adaptive Copy operation. Updates from the host will receive an "I/O Complete" response immediately after they are received by the primary controller. The updates are, then, queued in the primary controller to be transferred to the secondary controller later. The user can control the amount of data allowed to be out of sync (i.e., the maximum number of updates to be queued).
Hitachi Data Systems recently announced a controller-based Asynchronous Remote Copy implementation, which provides enhanced data consistency Data consistency summarizes the validity, accuracy, usability and integrity of related data between applications and across the IT enterprise. This ensures that each user observes a consistent view of the data, including visible changes made by the user's own transactions and and integrity, using similar techniques as XRC. It is, therefore, subject to similar considerations as XRC such as data loss in case of disasters in the production site and if it is not compatible with GDPS implementation.
Data integrity is fundamental to an effective disaster recovery solution. Also of major importance is the ability to handle a rolling disaster, which is a disaster occurring over a period of time (i.e., milliseconds, seconds, or minutes) instead of instantly and, thus, resulting in extensive damage and data loss. To handle rolling disasters effectively, the disaster recovery solution must be able to detect the disaster early and the remote site must be aware of the error conditions and any data inconsistency or out-of-sync condition. This is normally handled effectively with a host-based remote copy implementation such as XRC, since it has connectivity to both the primary and secondary volumes and has overall knowledge of primary and secondary site status.
Data currency, performance, and distance must be carefully weighed in deciding between synchronous and asynchronous remote copy implementations. Synchronous remote copy can provide the highest degree of data integrity and currency, which can reduce recovery time and efforts. Therefore, most customers will choose a synchronous remote copy configuration to implement their disaster recovery solution. Remote sites should, therefore, be chosen within the distance restrictions. If that is not possible, then asynchronous remote copy solutions such as XRC, which can guarantee data integrity, provide the next best choice.
With the maturing of parallel sysplex clustering and remote data mirroring technologies, fast recovery from total site failures and disasters is now a reality. To implement an effective disaster recovery solution, the data mirroring technology must guarantee data integrity and be able to handle rolling disasters effectively. Together with synchronized remote copy solutions, GDPS is the ultimate continuous availability standard, which has evolved to fulfill the promise of total protection against component and site failures with automatic workload and application switching. With the trend towards greater automation and standardization, these technologies help to make disaster recovery a "no brainer" with failover features and less human intervention.
Michael Wong Michael Wong is the name of:
The opinions contained in this article are those of the author and not of Amdahl Corporation.