High Speed Transaction Recovery.
But the need for availability is increasing. The days of the long batch window where databases can be offline for extended periods to perform nightly processing are diminishing. Exacerbating this trend is the drive toward e-business. Coupling businesses to the Internet is dramatically altering the way we do business. It has created expectations for businesses to be more connected, more flexible, and importantly, more available.
When you integrate the Web with database management, heightened expectations are placed on DBAs to keep databases up and running more smoothly and for longer periods of time. When your business is on the World Wide Web, it never closes. People expect full functionality on Web sites they visit regardless of the time of day. Remember that the Web is worldwide. It may be three o'clock in the morning in New York, but it is always prime time somewhere in the world. So an e-business must be available and operational 24 hours a day, 7 days a week, 365 days a year (366 for leap years).
Of course, e-business is not the only driver for increased availability. Other factors include:
* The "fast food" mentality of customers who demand excellent service and demand it "now!"
* The desire to gain a competitive advantage in the marketplace by offering superior services at a time of the customer's choosing.
* The need to react to competitors who offer better service to customers because of higher data availability.
With demands for higher and higher availability, especially for e-commerce, traditional forms of database recovery are becoming more and more inadequate. Today, every DBA must become well versed in Transaction Recovery techniques in order to be ready to develop the optimal approach for each recovery situation. But what is "Transaction Recovery?" This article discusses Transaction Recovery from a DB2 for OS/390 point-of-view. Herein we will learn how Transaction Recovery provides the speed and ease of a point-in-time recovery with the selective capability provided by custom programming. Our discussion of Transaction Recovery will cover three areas:
1. The types of Transaction Recovery available and the general characteristics of the problems addressed.
2. The steps required to perform each type of Transaction Recovery, and the conditions and potential problems associated with Transaction Recovery are explained.
3. The features you will need for a high speed, accurate, and robust Transaction Recovery solution.
Types Of Recovery
There are many different types of recovery that can be performed. The first type of recovery that usually comes to mind is a recovery-to-current to handle some sort of disaster. This disaster could be anything from a simple media failure to a natural disaster destroying your data center. Applications are completely unavailable until the recovery is complete.
Another type of traditional recovery is a Point-in-Time (PIT) recovery. PIT recovery usually is performed to deal with an application level problem. Conventional techniques to perform a point in time recovery will remove the effects of all transactions since that specified point in time. This sometimes can cause problems if there were some valid transactions during that timeframe that still need to be applied.
Transaction Recovery is a third type of recovery that addresses the shortcomings of the traditional types of recovery: downtime and loss of good data. Thus Transaction Recovery is an application recovery whereby the effects of specific transactions during a specified timeframe are removed from the database.
Bad Transactions Happen To Good Databases
Historically, recovery was performed mostly because of disasters and hardware failures. However, this is simply not the case anymore. In fact, application level recovery needs to be performed the majority of the time, not hardware recovery. Industry analysts estimate that as much as 80% of application errors are due to application software failures and human error. Although hardware failures and operating system panics were common several years ago, today's operating systems are quite reliable, with a high mean time between failure.
In reality, except for disaster recovery tests, very few DBAs ever need to perform true disaster recovery. While media does fail, it's actually quite rare in this day and age. So, user errors and application failures are the most common causes of problems that require recovery. Therefore, these types of errors also are the primary cause for system unavailability. As databases grow in size and complexity, so, too, do the chances that bad transactions will corrupt the data on which your business depends.
Transaction Recovery Detined
Simply stated, Transaction Recovery is the process of removing the undesired effects of specific transactions from the database. This statement, while simple on the surface, hides a bevy of complicated details. Let's examine the details behind the concept of Transaction Recovery.
Traditional recovery is at the database object level: for example, at the table space or index level. When performing a traditional recovery, a specific database object is chosen. Then, a backup copy of that object is applied, followed by re-applying log entries for changes that occurred after the image copy was taken. This approach is used to recover the database object to a specific, desired point-in-time. If multiple objects must be recovered, this approach is repeated for each database object impacted.
In contrast with the granularity provided by traditional recovery, Transaction Recovery allows a user to recover a specific portion of the table space based on user-defined criteria. So only a portion of the data is affected. Any associated indexes are automatically recovered as the transaction is recovered.
The transaction may impact data in multiple table spaces, too. Thus, a transaction is a set of related operations that, when grouped together, define a logical unit of work within an application. In reality, transactions must be defined by the user's view of the process. This might be the set of panels that comprise a new hire operation. Or perhaps the set of jobs that post to the General Ledger.
Examples of user-level transaction definitions might be:
1. All UPDATEs issued by userid DSGRNTLD since last Wednesday at 11:50 a.m.
2. All DELETEs made by the application program PAYROLL since 8:00 p.m. yesterday.
Why is Transaction Recovery a much-needed tool in the arsenal of e-DBAs? Well, applications are prone to all types of problems, bugs, and errors. Using Transaction Recovery, the DBA can quickly react to application-level problems and maintain a higher degree of data availability. The database does not always need to be taken offline while Transaction Recovery occurs (it depends on the type of Transaction Recovery being performed).
Any number of problems can occur at the application level. Consider:
* Edit checks aren't always 100% reliable. Bad data is insidious. If there is any way garbage can creep in, it will. If you don't believe me, do a spot check on some of your production databases.
* Somebody messed with job scheduling or didn't check completion codes and processes were run out of sequence.
* No matter how much testing is done there is always the possibility that errors will occur when the code hits production.
* Something could be wrong with the system software.
* There are times, in particular in test environments, where you may want to run a test and then just roll the results back and try it again (and again and again....)
So, how exactly is Transaction Recovery performed? Given that we have identified the transaction to recover, there are three types of Transaction Recovery that can be performed:
* Point-in-time--remove all the transactions since a given point in time and then manually rerun or reenter the work that was valid.
* UNDO--back out the bad transactions only.
* REDO--remove all the transactions after a given point in time. Then, redo the good transactions only.
Let's examine each of these possibilities in a little more detail.
Point-in-Time [PIT] Recovery
Point-in-time recovery is the most basic strategy. It is also the only one actually supported by native database utilities. Refer to the diagram in Figure 1. The middle box indicates the bad transactions that need to be removed from the system. The desired result is to maintain "Good Transaction 1" and "Good Transaction 2," while eliminating the "Bad Transactions."
You must be able to determine a common recovery point for a set of table spaces. A DB2 QUIESCE certainly works fine, but if that is not available, you will have to determine a point of consistency to be used for recovery
After the point-in-time recovery, there is still a problem of missing transactions. If the information is still available, the user could rerun or re-enter "Good Transaction 2". Regardless of the type of recovery to be performed, if the error that caused the recovery to be performed is caught too late, subsequent processing could have occurred using the "bad data." How to deal with these types of problems depends on the nature of the data and the type of updates applied, and needs to be handled on an application by application basis.
UNDO Transaction Recovery
The second possibility is to deploy UNDO Transaction Recovery (Figure 2). This is the simplest type of SQL-based Transaction Recovery. It involves generating UNDO SQL statements to reverse the effect of the transactions in error. To generate UNDO SQL, the database log is read to find the data modifications that were applied during a given timeframe and:
* INSERTs are turned into DELETEs.
* DELETEs are turned into INSERTs.
* UPDATEs are turned around to UPDATE to the old value.
To accomplish this transformation you will need a solution that understands the database log format and can create the SQL needed to undo the data modifications.
Note that in the case of UNDO Transaction Recovery, the portion of the database that does not need to be recovered remains undisturbed. When undoing erroneous transactions, recovery can be done online without suffering an outage of the application or the database. But, the potential for anomalies causing failures in the UNDO is certainly a consideration. We will discuss this further below.
REDO Transaction Recovery
The REDO Transaction Recovery strategy is a combination of the first two recovery techniques we have discussed--but with a twist (Figure 3).
Instead of generating SQL for the bad transaction that we want to eliminate, we generate the SQL for the transactions we want to save. Then we do a standard point in time recovery eliminating all the transactions since the recovery point. Finally we reapply the good transactions captured in the first step.
Unlike the UNDO process which creates SQL statements that are designed to back out all of the problem transactions, the REDO process creates SQL statements that are designed to reapply only the valid transactions from a consistent point of recovery to the current time. Since the REDO process does not generate SQL for the problem transactions, performing a recovery and then executing the REDO SQL can restore the table space to a current state that does not include the problem transactions.
To generate the redo SQL statements, you will once again need a solution that understands the DB2 log format and can create the SQL needed to redo the data modifications.
When redoing transactions in an environment where availability is crucial, a PIT recovery can be done and then the application and database can be brought online. The subsequent redoing of the valid transactions to complete the recovery then could be done with the data online, thereby reducing application downtime.
Choosing The Optimum Recovery Strategy
So, what is the best recovery strategy? Of course, the answer is: it depends. While Transaction Recovery may seem like the answer to all your application recovery problems, there are a number of cases where it may neither be possible nor advisable. In determining when to choose Transaction Recovery, you must consider several questions:
1. Transaction Identification. Can all the problem transactions be identified? You must be able to actually identify the transactions that will be removed from the database. Can all the work that was originally done be located and redone?
2. Data Integrity. Has anyone else updated the rows since the problem occurred? If they have, can you still proceed? Is all the data required still available? Recovering after a REORG, LOAD, or mass DELETE may require the use of image copies or the table space may need to be undamaged. Will any other data be lost? If so, can the data lost be identified in some fashion?
3. Availability. How fast can the application become available again? Can you afford to go offline?
These questions actually boil down to a matter of cost. What is the cost of rework and is it actually possible to determine what would need to be redone (what jobs to run, what documents to reenter). This cost needs to be balanced against the cost of long scans of log data sets to isolate data to redo or undo, and the cost of applying that data using SQL. The ultimate Transaction Recovery solution should analyze your environment and the transaction(s) needing to be recovered, and recommend which type of Transaction Recovery to perform.
Transaction Recovery Planning
When planning for Transaction Recovery, you should review your image copy frequency. This frequency will help determine how far back you have to go to get a consistent state for a point-in-time recovery. Also, review your policy on Data Change Capture (DCC). With DCC toggled on, the generation of UNDO and REDO SQL statements will be faster because expensive row completion operations will not be needed during SQL generation. However, your Transaction Recovery solution should be able to function with or without DCC enabled. DCC is an option of the CREATE TABLE DDL statement in DB2. With it, data changes to DB2 tables are written to the log in an expanded format. To implement DCC, you need to alter your tables to include the clause DATA CAPTURE CHANGES.
Transection Recovery Steps
Transaction Recovery can be done by performing the following steps:
1. Identify the problem.
2. Identify the transactions causing the problem.
3. Identify available recovery points.
4. Identify the relative cost of UNDO vs. PIT + REDO (Strategy Selection).
5. Identify the cost of performing additional operations to resolve any integrity issues caused by UNDO.
6. Choose the optimal recovery strategy: PIT, PIT+REDO, or UNDO.
7. Generate the recovery job(s).
8. Execute the recovery job(s).
Transaction Recovery Solutions
In order to avoid the high cost in manpower, time, and machine resources of a manual transaction recovery process, a robust Transaction Recovery solution is needed. The first requirement is the ability to read and analyze log data to:
* provide detailed diagnostic information about the selected transactions
* identify where recovery should begin.
* find all bad transactions.
* store information for analysis.
* generate UNDO and/or REDO SQL for the different types of transaction recoveries.
The Transaction Recovery solution should address all of the problems associated with quickly analyzing problem transactions, determining the scope of the recovery, getting information to determine the best recovery method, and managing recovery resources that result in optimal recovery performance. Such a solution should provide:
* A single point-of-control over each stage of the recovery process. One interface for the entire Transaction Recovery solution is desirable because it minimizes the difficulty of analyzing, generating, and executing the recovery scenario. If you must switch from one interface to another to accomplish the recovery, needless complexity is introduced to an already volatile situation.
* Powerful log analysis capabilities with highly specific search capabilities to allow rapid problem diagnosis.
* Assistance in choosing the optimal recovery method. The solution must understand the tradeoffs in terms of the scope of the transaction, the potential impact on the application, and the time required for each type of Transaction Recovery to determine which type of recovery to perform: PIT, UNDO, or REDO.
* Automated job generation to eliminate programming errors.
* Ability to restart the process should the recovery fail anywhere along the way.
* A high performance 'capture agent' to quickly gather the required elements of the Transaction Recovery.
* A high speed capability for applying the SQL statements during the UNDO and/or REDO processing.
A key aspect of the analysis feature is the ability to specify sophisticated filters to identify the database objects that are affected by the transaction. The power of the filtering will determine how fast problem identification can be completed. It should be possible to identify the transaction by many different criteria, for example, by correlation ID (batch job), correlation type (batch, CICS, IMS), plan name, authorization ID, table space name, table name, object identifier (OBID), column name, update type, or other such qualifying criteria. Often a combination of object fields and unit of recovery (UR) header fields can be used to identify the transactions of interest.
Statistics about the number and types of SQL transactions should also be gathered.
Your solution should provide reports on image copy frequency and Data Change Capture impact to help assist your planning and readiness for Transaction Recovery.
During an actual Transaction Recovery, your solution must be able to analyze the log for a quiet point. Quiet points are often used as a beginning point of recovery. You still must determine if the quiet point found is the correct starting point for the recovery that you need to perform.
In choosing a Transaction Recovery solution, one of the most important features to look for is to provide assistance in choosing between the various recovery methods. Once the information is collected as to what needs to be recovered and to what point in time recovery is possible, a relative work estimate must be prepared for each type of transaction recovery.
Maintaining backout integrity is critical. It must be possible to determine the feasibility and impact of performing an UNDO or REDO process. Data may have been changed after the offending application process has been executed. These types of data changes can have an impact on the integrity of the data after a proposed Transaction Recovery.
Analysis of the subsequent activity on the database is critical to determining the feasibility of the recovery. Generally the activity will conform to known application patterns. It is the responsibility of the DBA and the application group together to determine the impact of the recovery on the subsequent work (and vice versa). It may be that through the judicious selection of the rows to be processed as well as the columns to be included in the generated SQL only the data in error can be processed.
For example, if the process in error just updated one or more columns that have no subsequent activity even if other columns on the same rows have been updated it should be possible to limit the generated SQL to just the "bad" columns leaving the rest of the changes alone.
As a rule of thumb, anomaly analysis will yield one of two outcomes; either there is little subsequent activity and a transaction recovery is feasible or there is a significant amount of activity and transaction recovery may be difficult or impossible.
Let's take a look at several examples of anomalies. Consider the scenario outlined in the following case:
* 08:00 Monday--A row is added for a new customer.
* 12:00 Monday--The new customer enters a large order for widgets.
* 18:00 Monday--The new customer provides a new shipping address.
Now, think about the havoc that would be caused if the same transaction that updates the shipping address at 18:00 also accidentally clears the total orders column for the customer. Simply backing out the transaction would remove the incorrect data (the total orders column) but it would also cause you to lose valid data (the new shipping address).
The key to this issue is an understanding of the error. Once it is understood that the only column in error is the total orders field, the Transaction Recovery can be made sensitive to that column only.
Let's look at a more complicated case. Consider the scenario outlined in the following case:
* 08:15 Tuesday--A row is added for a new customer.
* 09:30 Tuesday--The new customer fails a credit check and is deleted.
* 09:00 Wednesday--The new customer reapplies with updated information and passes the credit check.
In this scenario, what would happen if the job that added the new customer at 8:15 on Tuesday was run out of sequence causing numerous other problems. In this scenario, knowledge of the error alone is not sufficient to understand the implications of backing it out.
The desired result actually is to leave the new customer alone since the final record in the database is actually correct. If you just rollback the bad job, the insert of the new customer's first record will be turned into a delete that will remove the good record. The only alternative is to scan the log for any other activity on the new customer's record and report it. Once reported, the DBA can analyze the anomalies and hopefully make some reasonable choices on how to proceed (ignore changed where the key = 'New Customer'). The bottom line, though, is that the Transaction Recovery process needs to take this type of scenario into account, or good data may be lost.
Let's look at another case. Consider the following scenario:
* 13:00 Monday--An employee receives a major pay increase.
* 23:00 Tuesday--The employee's increase is updated into the PAYROLL table.
* 12:00 Thursday--The employee gets the promotion associated with the pay increase and it is updated into the EMPLOYEE table.
In this case, the payroll clerk had a grievance and mis-entered some data on Tuesday night when the employee's increase was entered into the PAYROLL table. The error is not discovered until Thursday afternoon.
This case illustrates a basic point about the timeliness of problem recognition. The sooner you recognize a problem, the more feasible Transaction Recovery becomes. Had the problem with the clerk been recognized on Wednesday prior to the next payroll run Thursday night, the problem would have been manageable (at least from the employee's point of view). By waiting until Thursday, the possibility of accurately backing out the bad clerk's data has decreased significantly. Even if it turns Out to be possible, a great deal of manual effort most likely will be required to select the set of data to rollback.
If you simply rollback the aggrieved clerk's work, both the employee's pay and promotion will be lost. You may have planned on re-entering the transactions from Tuesday at 23:00, but did you plan on re-entering all the other subsequent transactions (such as those from Thursday at 12:00)?
Let's look at a fourth and final case:
* 14:00 Tuesday--A customer submits a loan application.
* 23:00 Wednesday--The customer's loan application is entered into the system by the clerk.
* 09:00 Thursday--A loan officer uses the information to grant the customer the loan.
Consider the impact of an error being made when the loan application is entered into the system on Wednesday. If the information is mis-entered, users who subsequently rely upon the data could make potentially bad decisions. For example, approving a loan for someone who only qualifies because of data that was entered incorrectly.
This is my favorite example. It illustrates that no matter how much information can be obtained automatically, there is always the possibility of human decisions being made on invalid data. This does not invalidate the usage of Transaction Recovery to correct the problem, because indeed, any type of simple data recovery can not resolve a problem with a human. That is, if the customer already has spent the loan money, no recovery can get the money back!
Transaction Recovery Guidelines
Your recovery solution should produce a work estimate analysis to provide the information needed to choose the optimal recovery solution. This analysis should be based on the relative cost of each option: PIT, UNDO, or REDO.
Once this recovery work estimate analysis is completed, you can choose the most valid and effective method for recovery. Based on the recovery option you choose, Transaction Recovery should be able to automatically generate the job for executing that option.
Once the recovery JCL is generated, your Transaction Recovery solution should provide high speed apply processing for the UNDO or REDO SQL. The Transaction Recovery should use a highly multi-tasked apply process with advanced management capability and the ability to restart the process should it fail. A critical part of the multi-tasking feature is the ability to efficiently distribute work among multiple streams for parallel processing. Workload balancing by table and partition with respect for referential integrity constraints and table group should be supported. The high performance apply should give you the ability to convert from dynamic SQL to static SQL 'on the fly' to enhance performance. Other required features include the ability to process very large volumes of transactions and the option to apply changes either online or in batch.
* Retry on deadlock situations.
The apply process should have the capability to react to anomalies found during the processing of the SQL. The user may have accepted some anomalous situations during th e analysis process. Or it is possible that there has been even more activity on the database after the analysis and new anomalies have appeared. A minimum set of capabilities would include:
* The ability to log and defer SQL statements with problems (e.g. inserts where the key is already in the table, updates where the column values do not match, etc.).
* The ability to ignore any errors and just continue processing.
Although several products could be used to provide a Transaction Recovery solution, an integrated solution is key to providing a tool that makes Transaction Recovery an option that can be chosen with confidence.
In summary, Transaction Recovery has become a critical need in any complete recovery toolbox. The Transaction Recovery solution chosen must provide:
* Powerful diagnostic features for problem identification.
* Automated assistance in choosing optimal recovery method
* Features that provide speed, manageability, and accuracy.
And with a Transaction Recovery solution in your arsenal, you just might be able to deliver the availability required to help transform your company into an e-business.
Craig S. Mullins is director of DB2 Technology Planning for BMC Software (Houston, TX).
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||Technology Information|
|Author:||MULLINS, CRAIG S.|
|Publication:||Computer Technology Review|
|Date:||Apr 1, 2001|
|Previous Article:||Tape and Backup Issues In Storage Area Networks.|
|Next Article:||"Storage Will Be 75% Of All IT Spending Over The Next Couple of Years".|