Protecting million dollar memories.
AMD was founded in 1969 and today has more than $2 billion in annual sales, approximately 12,700 employees worldwide, manufacturing facilities in the United States and Asia, and sales offices throughout the world. It makes programmable products and applications solutions executed in submicron complementary metal-oxide semiconductor silicon. These components are used by the manufacturers of computers and communications equipment.
AMD's success is based on its proprietary submicron process technology and its other unique manufacturing processes. The company conducts product design, process-technology development, and wafer-fabrication activities at its Submicron Development Center in Sunnyvale. AMD also performs high-volume wafer fabrication in Austin, Texas, and at its joint-venture fabrication facility, Fujitsu AMD Semiconductor Ltd., in Aizu-Wakamatsu, Japan. It is building an additional manufacturing plant in Austin and has plans for a new facility and microprocessor design center in Germany.
AMD began by implementing its disaster recovery plan at Sunnyvale, the newer, state-of-the-art facility, and later expanding it to the Austin facility. That effort was spearheaded by Dan Perry, manager of technology services for AMD's computer-integrated manufacturing at Sunnyvale. The following refers generally to how the plan was put in place at the Sunnyvale site.
THE SITE. AMD's plant operates twenty-four hours a day, seven days a week. A centralized VAX computer system supports approximately 800 to 900 concurrent users (some "users" are employees and some are automated steps in the manufacturing process). The manufacturing environment includes floor control systems to ensure that the right components get to the right place at each step in the process, referred to as the shop floor control package.
AMD's work-in-progress inventory control enables the company to know exactly where a specific batch of chips is, how long it has been in the manufacturing process, its next location, and so forth. It also creates the records that management needs to calculate chip value.
The chip manufacturing process generally consists of more than one hundred steps, during which hundreds of copies of an integrated circuit are formed on a single wafer. The process generally involves the creation of eight to twenty patterned layers on and into the substrate - the foundation of a circuit - ultimately forming the complete integrated circuit. This layering process creates electrically active regions in and on the wafer surface.
The core of chip manufacturing is the fabrication area, where the integrated circuit is formed in and on the wafer. The fabrication process involves a series of functions called oxidation, masking, etching, doping, dielectric deposition, metallization, and passivation. Because of the number and complexity of steps in the fabrication process, more time and labor are invested here than in any other part of chip manufacturing. It usually takes ten to thirty days to complete the fabrication process. If the computer were to go down at any time during this continuous process, chips still in development could be wiped out (because they are not stable until completed), creating a substantial financial loss.
Although AMD had (and still has) a disaster recovery plan in place for its corporate systems and an outside disaster recovery expert to help execute the plan in an emergency, the company had not implemented a disaster recovery plan at its fabrication facilities for the same reasons many other computer-controlled manufacturing companies have not: it is a complex and expensive process; it is also intimidating to plan for a "doomsday" event. And the company had never actually had an emergency that caused its computer systems to go down long enough to wipe out chips in progress.
AMD had taken precautions to ensure that manufacturing data, such as information about inventory on hand and where materials are in the process, was protected. It should be noted here that, although much of the manufacturing process itself is automated, the data recording the completion of each step is entered manually by operators and is, therefore, labor intensive. The chip manufacturing process includes many highly computerized but standalone, "dumb" units that have no way to signal back to the master computer that a process has been completed successfully or what was completed. An operator keys in the required information during the fabrication process. Because that process involves thousands of transactions, re-creating all of those transactions manually after an emergency shutdown would take days.
To protect that data, the facility had backed up (as it still does) 160 gigabytes of disks on a weekly schedule that covered Friday to Sunday, and relocated the tapes on Tuesday to an offsite vault where they were retained pending future requirements to recover lost data.
Copies of these tapes were also retained on site in the computer center. Incremental backups were taken daily from Sunday to Thursday, but those tapes remained in-house and would be thereby vulnerable if the local tape repository were damaged.
The chip manufacturing process consists of more than one hundred steps.
In reevaluating its exposure, the company decided that this level of security was no longer adequate. It would, as explained, be costly and time consuming to re-create all of those transactions manually after an emergency shutdown. In addition, the company is in an industry that has no excess capacity anywhere, and therefore, in an emergency, AMD would not have the option of moving its fabrication center to another site.
STRATEGY SELECTION. After AMD management reconsidered Perry's presentation of the risks that the company faced without a disaster recovery plan for its manufacturing operations, Perry was given the go ahead to get the process rolling. Realizing that he lacked disaster recovery expertise and the time needed to put together a program, Perry decided to hire an outside consulting firm to help formulate a plan.
After a three-month search, AMD selected SunGard Planning Solutions, the consulting unit of SunGard Recovery Services Inc. The consultants assigned to AMD were the author and a colleague, Greg Valentine, who acted as SunGard project manager and lead consultant. Perry retained the lead role as project manager for the disaster recovery plans for both plants, setting the direction for what needed to be accomplished, garnering management support, and connecting the consultants with the appropriate people within the manufacturing, facilities, security, and data processing departments.
Business impact analysis. Typically, consultants advising a company in the implementation of a disaster recovery plan will suggest a range of strategies determined by the company's risk tolerance and budgetary considerations. To determine which strategies would fit AMD's financial and risk profile, the consultants conducted a business impact analysis (BIA) of the company's most critical functions and the applications that support them. SunGard then identified the impacts and risks of losing essential functions and evaluated the company's recovery capability needs to safeguard critical operations.
An important outcome of the BIA is to establish the "maximum acceptable outage duration" that the company can withstand before it will experience an unacceptable long-term effect on its operations.
The consultants also reviewed security practices and procedures to identify where the company's operations were vulnerable and analyzed its computer center to identify the most cost-effective methods for correcting those vulnerabilities.
As a result of the BIA, SunGard and AMD determined that the maximum acceptable outage duration for AMD's manufacturing operations was four hours. If the computer systems driving AMD's production of chips went down for longer than that, the result would be a possible loss of product, a significant loss of revenue, and ultimately, a negative effect on customer satisfaction.
In addition, the BIA process for AMD established a list of critical applications that had to be available on backup computers. The application prioritization list contains the applications that must be available to users in the fabrication center to keep it running while the main computer systems are being recovered.
AMD determined that its maximum acceptable outage duration was four hours.
The BIA identified and documented the interdependencies between systems, applications, and procedures to ensure that the recovery needs of users could be met and that no omissions would occur.
Acceptable impact level. Once management had endorsed the information obtained during the BIA, the next step was to develop strategies for providing an acceptable "impact" level or level of exposure. The strategies would take into account the information obtained from the work sessions, as well as three basic costs: basic implementation costs for buying and installing new equipment; ongoing costs, such as service fees from the telephone company for new lines that needed to be installed; and operational costs for procedures that had to be put in place to implement a disaster recovery plan.
AMD's maximum outage level of four hours and its other requirements meant a limited number of options for disaster recovery. SunGard presented AMD with two possible strategies.
Strategy A stated that AMD should establish an internal recovery capability. Specifically, AMD should procure backup computer systems and connect them via redundant pathing into the existing cluster and network infrastructure.
A backup system would allow the primary computer system to mirror or shadow its critical data across to the backup system, which would prevent the loss of any critical data during an incident. This option guarantees availability of the backup computer systems for disaster recovery purposes. But this strategy assumes that no physical damage to the fabrication or subfabrication sites has occurred. If physical damage has occurred, a longer-term recovery not covered by this disaster recovery program would be necessary.
Strategy B, favored by the consultants, was less costly. Under this option, AMD would subscribe to a commercial hot site that would provide the computer systems, resources, and telecommunication network connections required to meet AMD's maximum acceptable outage duration of four hours.
In this scenario, AMD could maintain copies of critical data at the hot site, allowing the hot site to take the place of the main computer center or network without losing an unacceptable amount of data whenever an incident occurred. This approach would be less expensive because it did not require the company to set up its own mirror system.
Because AMD wanted a tighter disaster recovery capability to reduce its risk, management chose Strategy A. With that decision made, it was time to implement the strategy by developing AMD's disaster recovery plan.
IMPLEMENTATION. Working on the theory that, if the fabrication facility is unharmed in an incident, the disaster recovery computer is probably unharmed as well, AMD wanted an internal reciprocal arrangement for disaster recovery in which alternative sites could be set up within AMD to provide recovery. Management determined that the backup computer environment could be placed in a subarea immediately beneath the manufacturing floor. This approach was feasible because AMD operates in a single cluster environment (centralized computer) rather than a multiple cluster environment (series of local or wide area networks).
This internal reciprocal configuration offers several advantages: it gives the company total control of the environment; it reduces the lag time for switching over to the backup system; and it can lower network costs.
The major downside to an internal reciprocal approach is that the number of concurrent users is limited in an emergency. However, this problem is strictly a capacity issue and can be altered simply by adding computer horsepower.
In the event of a serious disruption, AMD's disaster recovery plan now uses a switch-over process that involves redirecting all production activity from the primary systems to the backup disaster recovery system located in the subfab center.
Only a small percentage of the staff work on computers in the data center, however. Others work in a more typical office environment in other parts of the main AMD facility or nearby buildings. In an emergency, it would not make sense for those workers to go to the subarea where the backup computers are. Instead, they would be relocated elsewhere. AMD's plan uses excess space within its own campus as emergency operations for those employees.
Another part of plan implementation concerns information protection analysis. The initial BIA looked at information on a more cursory basis. Once the plan is being implemented, the analysis must be more detailed to determine, for example, how and how often information should be backed up; where, how, and when it should be stored in an off-site location; and how many versions should be maintained off-site.
The findings are examined in the context of the recovery time frame. The question for AMD was whether backup procedures could be carded out in the desired four hours.
The disaster recovery program includes overall incident management plans, work group recovery plans and teams, and alternative facility options as well as recovery testing, staff training, and maintenance programs. It covers the entire "who, what, where, when, how, and why," of recovery and provides the company with the equipment, highly detailed procedures, communications, records, supplies, workspace, training (and crosstraining) and personnel it needs in the event of emergency.
Procedures are broken down to address the incident management team, applications team, operations team, systems team, and network team. The incident management team, for example, provides for event recognition within thirty minutes to one hour of occurrence. Event recognition procedures cover both traditional and nontraditional causes of disasters and disruption.
The incident commander, who is responsible for overall management of recovery activities along with the team leaders of each group, has the authority to declare a disaster and mobilize the recovery teams into immediate action. That step initiates the recovery procedures section of the team plans.
The operations team is responsible for assisting in event recognition, problem resolution, and resource mobilization, as well as for activating the crisis management center, recording and monitoring recovery activities, and implementing operational recovery activities such as system switch over.
The network and systems teams also participate in event recognition, problem resolution, and resource mobilization duties, with their other recovery functions closely matching their ongoing responsibilities. For example, the person who monitors connectivity of the communications network day to day would be responsible for detecting a disruption and signaling that it was a potential incident.
READINESS. Once a recovery program is developed, tested, and approved, management must keep in mind that any change to the company's procedures, technology, information processes, or key personnel could affect the program's future effectiveness. On the other hand, the worst thing that can happen to a plan is that it requires a great deal of change and maintenance, causing personnel to ignore the plan and rendering it practically useless should a disaster occur. To prevent that kind of situation, part of the planning process involves identifying how often the plan needs to be reviewed and exercised.
SunGard recommends reviewing the plan quarterly and not less than semiannually, and exercising the plan semiannually or at least annually, including a mock disaster scenario that reinforces training. If a change occurs in the production environment, the plan should be revised and tested immediately.
To keep the AMD disaster recovery plan current, each team plan contains a section on preparedness and maintenance. AMD has held two semiannual tests to date. During the test, problem recognition, disaster recovery plan activation, and a complete switch over were accomplished successfully within the maximum acceptable outage duration. In fact, Perry's team significantly beat the four-hour time frame during testing, coming in at under an hour.
A computer disaster recovery plan protects a company from the possibility of staggering losses. No matter how a company chooses to go through the process, a disaster recovery plan is a vital component of the security management program.
James A. MacMicking is the operations director for the western region of SunGard Planning Solutions Inc., the consulting and software unit of SunGard Recovery Services Inc. He is in Austin, Texas.
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||Advanced Micro Devices' disaster recovery plan for its manufacturing processes|
|Author:||MacMicking, James A.|
|Date:||Jun 1, 1996|
|Previous Article:||Stranger in a strange land.|
|Next Article:||Preventing the artful dodge.|