Printer Friendly
The Free Library
14,381,205 articles and books
Member login
User name  
Password 
 
Join us Forgot password?

Finding fault: the formidable task of eradicating software bugs.


Finding Fault

Sitting 70 kilometers east of Toronto on the shore of Lake Ontario, the Darlington Nuclear Generating Station Darlington Nuclear Generating Station is a Canadian nuclear power station located on the north shore of Lake Ontario in Clarington, Ontario. The facility derives its name from the Township of Darlington, the former name of the municipality in which it is located.  looks much like any other large nuclear power plant of the Canadian variety. But behind its ordinary exterior lies an unusual design feature.

Darlington is the first Canadian nuclear station to use computers to operate the two emergency shutdown systems that safeguard each of its four reactors. In both shutdown systems, a computer program replaces an array of electrically operated mechanical devices -- switches and relays -- designed to respond to sensors monitoring conditions critical to a reactor's safe operation, such as water levels in boilers.

When completed in 1992, Darlington's four reactors will supply enough electricity to serve a city of 2 million people. Its Toronto-based builder, Ontario Hydro Ontario Hydro was the official name from 1974 of the Hydro-Electric Power Commission of Ontario which was established in 1906 by the provincial Power Commission Act to build transmission lines to supply municipal utilities with electricity generated by private companies , opted for sophisticated software rather than old-fashioned hardware in the belief that a computer-operated shutdown system would be more economical, flexible, reliable and safe than one under mechanical control.

But that approach carried unanticipated costs. To satisfy regulators that the shutdown software would function as advertised, Ontario Hydro engineers had to go through a frustrating frus·trate  
tr.v. frus·trat·ed, frus·trat·ing, frus·trates
1.
a. To prevent from accomplishing a purpose or fulfilling a desire; thwart:
 but essential checking process that required nearly three years of extra effort.

"There are lots of examples where software has gone wrong with serious consequences," says engineer Glenn H. Archinoff of Ontario Hydro. "If you want a shutdown system to work when you need it, you have to have a high level of assurance."

The darlington experience demonstrates the tremendous effort involved in establishing the correctness of even relatively short and straightforward computer programs. The 10,000 "lines" of instructions, or code, required for each shutdown system pale in comparison with the 100,000 lines that constitute a typical word-processing program or the millions of lines needed to operate a long-distance telephone network or a space shuttle space shuttle, reusable U.S. space vehicle. Developed by the National Aeronautics and Space Administration (NASA), it consists of a winged orbiter, two solid-rocket boosters, and an external tank. .

Computer programs rank among the most complex products ever divised by humankind, says computer scientist David L. Parnas of Queen's University Queen's University, at Kingston, Ont., Canada; nondenominational; coeducational; founded 1841 as Queen's College. It achieved university status in 1912. It has faculties of arts and sciences, education, law, medicine, and applied science, as well as schools of  in Kingston, Ontario Kingston, Ontario, is a Canadian city located at the eastern end of Lake Ontario, where the lake runs into the St. Lawrence River and the Thousand Islands begin.

Kingston is the county seat of Frontenac County.
. "They are also among the least trustworthy," he contends.

"These two facts are clearly related," says Parnas. "Errors in software are not caused by a fundamental lack of knowledge on our part. In principle, we know all there is to know about the effect of each instruction that is executed. Software errors are blunders caused by our inability to fully understand the intricacies of these complex products."

Practically no one expects a computer system to work the way it should the first time out. "A new chair collapses, and we're suprised," Parnas says. In contrast, "we accept as normal that when a comuter system is first installed, it will fail frequently and will only become reliable after a long sequence of revisions."

But there are many situations where that kind of performance is unacceptable. Computers that fly military or civilian aircraft, operate medical devices, manage transportation systems and perform crucial safety functions such as air-traffic control air-traffic control air nFlugsicherung f  must work without fail.

As computer-controlled systems increase in complexity and become ever more deeply embedded Inserted into. See embedded system.  in the fabric of society, the potential for costly failures rises. Indeed, some computer experts fear that we are courting disaster Courting Disaster is a weekly single panel webcomic about love, sex, and dating. The cartoonist, Brad Guigar is better known for his daily webcomic Greystone Inn and its successor, Evil Inc..  by placing too much trust in computers to handle complexities that no one fully understands.

Last November, the Association for Computing Machinery See ACM.

Association for Computing Machinery - Association for Computing
 sponsored a meeting in Arlington, Va., on the issue of managing complexity -- finding ways to build computer systems that are both large and trustworthy. "There are tons of issues out there," says Harold S Harold, 1022?–1066, king of England (1066). The son of Godwin, earl of Wessex, he belonged to the most powerful noble family of England in the reign of Edward the Confessor. Through Godwin's influence Harold was made earl of East Anglia. . Stone of the IBM (International Business Machines Corporation, Armonk, NY, www.ibm.com) The world's largest computer company. IBM's product lines include the S/390 mainframes (zSeries), AS/400 midrange business systems (iSeries), RS/6000 workstations and servers (pSeries), Intel-based servers (xSeries)  Thomas J. Watson Research Center The Thomas J. Watson Research Center is the headquarters for the IBM Research Division.

The center is on three sites, with the main laboratory in Yorktown Heights, New York, 45 miles north of New York City, a building in Hawthorne, New York, and offices in Cambridge,
 in Yorktown Heights, N.Y. "This is one we can't get our finger on. We don't have the answer."

Companie throughout the computer and communications industries communications industry, broadly defined, the business of conveying information. Although communication by means of symbols and gestures dates to the beginning of human history, the term generally refers to mass communications. , including giants such as IBM and AT&T, are having great difficulties developing the next generation of computer products, Stone adds. "We need to go to the next [higher! plateau in automation, and we can barely deal with the plateau that we're on now."

Two case histories -- testing the safety software for Ontario Hydro's Darlington plant and a software error that nearly crippled crip·ple  
n.
1. A person or animal that is partially disabled or unable to use a limb or limbs: cannot race a horse that is a cripple.

2. A damaged or defective object or device.

tr.v.
 AT&T's long-distance network -- nicely illustrate this point.

The software glitch A temporary or random hardware malfunction. It is possible that a bug in a program may cause the hardware to appear as if it had a glitch in it and vice versa. At times it can be extremely difficult to determine whether a problem lies within the hardware or the software. See glitch attack.  that disrupted AT&T's long-distance telephone service for nine hours in January 1990, dramatically demonstrates what can go wrong even in the most reliable and scrupulously scru·pu·lous  
adj.
1. Conscientious and exact; painstaking. See Synonyms at meticulous.

2. Having scruples; principled.
 tested systems. Of the roughly 100 million telephone calls placed with AT&T during that period, only about half got through. The breakdown cost the company more than $60 million in lost revenues and caused considerable inconvenience and irritation for telephone-dependent customers.

The trouble began at a "switch" -- one of 114 interconnected, computer-operated electronic switching systems In telecommunications, an electronic switching system (ESS) is:
  • A telephone exchange based on the principles of time-division multiplexing of digitized analog signals.
 scattered Scattered

Used for listed equity securities. Unconcentrated buy or sell interest.
 across the United States United States, officially United States of America, republic (2005 est. pop. 295,734,000), 3,539,227 sq mi (9,166,598 sq km), North America. The United States is the world's third largest country in population and the fourth largest country in area. . These sophisticated systems, each a maze of electronic equipment housed in a large room, form the backbone of the AT&T long-distance telephone network.

When a local exchange delivers a telephone call to the network, it arrives at one of these switching centers A switching center is a node in a telecommunications Circuit switching network which is connected to either another switching center and/or to end user devices. Switching centers are aware of other centers and possible routes between them such that on demand a center can establish , which can handle up to 700,000 calls an hour. The switch immediately springs into action. It scans a list of 14 different routes it can use to complete the call, and at the same time hands off the telephone number to a parallel, signaling network, invisible to any caller. This private data network allows computers to scout the possible routes and to determine whether the switch at the other end can deliver the call to the local company it serves.

If the answer is no, the call is stopped at the original switch to keep it from tying up a line, and the caller gets a busy signal. If the answer is yes, a signaling-network computer makes a reservation at the destination switch and order the original switch to pass along the waiting call -- after the switch makes a final check to ensure that the chosen lines is functioning property. The whole process of passing a call down the network takes 4 to 6 seconds. Because the switches must keep in constant touch with the signaling network and its computers, each switch has a computer program that handles all the necessary communications between the

switch and the signaling network.

AT&T's first indication that something might be amiss Verb 1. be amiss - interpret in the wrong way; "Don't misinterpret my comments as criticism"; "She misconstrued my remarks"
misapprehend, misconceive, misconstrue, misunderstand, misinterpret
 appeared on a giant video display at the company's network control center in Bedminster, N.J. at 2:25 p.m. on Monday, Jan. 15, 1990, network managers saw an alarming increase in the number of red warning signals appearing on many of the 75 video screens showing the status of various parts of AT&T's worldwide network. The warnings signaled a serious collapse in the network's ability to complete calls within the United States.

To bring the network back up to speed, AT&T engineers first tried a number of standard procedures that had worked in the past. This time, the methods failed. The engineers realized they had a problem never seen before. Nonetheless, within a few hours, they managed to stabilize the network by temporarily cutting back on the number of messages moving through the signaling network. They cleared the last defective link at 11:30 that night.

Meanwhile, a team of more than 100 telephone technicians tried frantically to track down the fault. By monitoring patterns in the constant stream of messages reaching the control center from the switches and the signaling network, they searched for clues to the cause of the network's surprising behavior. Because the problem involved the signaling network and seemed to bounce from one switch to another, they zeroed in on the software that permitted each switch to communicate with the signaling-network computers.

The day after the slowdown, AT&T personnel removed the apparently faulty software from each switch, temporarily replacing it with an earlier version of the communications program Software that manages the transmission of data between computers, typically via modem and the serial port. Such programs were very popular for connecting to BBSs before the Internet took off. . A close examination of the flawed software turned up a single error in one line of the program. Just one month earlier, network technicians had changes the software to speed the processing of certain messages, and the change had inadvertently introduced a flaw into the system.

From that finding, AT&T could reconstruct what had happened.

The incident started, the company discovered, when a switching center in New York City New York City: see New York, city.
New York City

City (pop., 2000: 8,008,278), southeastern New York, at the mouth of the Hudson River. The largest city in the U.S.
, in the course of checking itself, found it was nearing its limits and needed to reset itself -- a routine, maintenance operation that takes only 4 to 6 seconds. The New York New York, state, United States
New York, Middle Atlantic state of the United States. It is bordered by Vermont, Massachusetts, Connecticut, and the Atlantic Ocean (E), New Jersey and Pennsylvania (S), Lakes Erie and Ontario and the Canadian province of
 switch sent a message via the signaling network, notifying the other 113 switches that it was temporarily dropping out of the telephone network and would take no more telephone calls until further notice. When it was ready again, the New York switch signaled to all the other switches that it was open for business by starting to distribute the calls that had piled up during the brief interval when it was out of service.

One switch in another part of the country received its first message that a call from New York was on its way, and started to update its information on the status of the New York switch. But in the midst Adv. 1. in the midst - the middle or central part or point; "in the midst of the forest"; "could he walk out in the midst of his piece?"
midmost
 of that operation, it received a second message from the New York switch, which arrived less than a hundredth of a second after the first.

Here's where the fatal software flaw surfaced. Because the receiving switch's communication software was not yet finished with the information from the first call, it has to shunt To divert, switch or bypass.  the second message aside. Because of the programming error, the switch's processor mistakenly dumped the data from the second message into a section of its memory already storing information crucial for the functioning of the communications link. The switch detected the damage and promptly activated a backup link, allowing time for the original communication link to reset itself.

Unfortunately, another pair of closely spaced calls put the backup processor out of commission, and the entire switch shut down temporarily. These delays caused further telephone-call backups, and because all the switches had the same software containing the same error, the effect cascaded throughout the system. The instability in the network persisted because of the random nature of the failures and the constant pressure of the traffic load within the network.

Although the software changes introduced the month before had been rigorously tested in the laboratory, no one anticipated the precise combination and pace of events that would lead to the network's near-collapse.

In their public report, members of the team from AT&T Bell Laboratories who investigated the incident state: "We believe the software design, development and testing processes we used are based on solid, quality foundations. All future releases of software will continue to be rigorously tested. We will use the experience we've gained through this problem to further improve our procedures."

In spite of such optimism, however, "there is still a long way to go in attaining dependable distributed control," warns Peter G. Neumann Peter G. Neumann is a researcher who has worked on the Multics operating system in the 1960s. He edits the Computer Risks columns for ACM Software Engineering Notes and Communications of the ACM. He founded ACM SIGSOFT and is a Fellow of the ACM, IEEE and AAAS. , a computer scientist with SRI International (company) SRI International - One of the world's largest contract research firms. Founded in 1946 in conjuction with Stanford University as the Stanford Research Institute, they later became fully independent and were incorporated as a non-profit organisation under U.S.  in Menlo Park Menlo Park.

1 Residential city (1990 pop. 28,040), San Mateo co., W Calif.; inc. 1874. Electronic equipment and aerospace products are manufactured in the city. Menlo College and a Stanford Univ. research institute are there.

2 Uninc.
, Calif. "Similar problems can be expected to recur, even when the greatest pains are taken to avoid them."

Even a relatively short, simple computer program can prove difficult to check out, as illustrated by the tremendous effort required to ensure the correctness of the software for the Darlington power station.

Darlington's two shutdown systems operate independently, each using different sensors, different shutdown mechanisms and different computers controlled by software written by separate teams. Their sole purpose is to shut the plant down if the values of certain variables exceed present limits.

Although shutdown systems have a simple task, the computer-based version designed by Ontario Hydro engineers turned out to be significantly more complex than the straightforward, easily inspected mechanical controls it replaced. Complicated pathways and shared data took the place of individual, obviously connected devices.

Officials at the Atomic Energy atomic energy: see nuclear energy.  Control Board (AECB AECB Acute exacerbation of chronic bronchitis. See Chronic bronchitis. ) in Ottawa, Ontario, which regulates and licenses Canadian nuclear power plants, decided they needed outside help in evaluating the software instructions. "There's only so much you can do by reading it line by line -- the usual approach," says AECB's G.J.K. Asmis.

To dig deeper into the reactor's software, AECB turned to Parnas. Known as an outspoken critic of the Strategic Defense Initiative Strategic Defense Initiative (SDI), U.S. government program responsible for research and development of a space-based system to defend the nation from attack by strategic ballistic missiles (see guided missile).  because of its unprecedented reliance on software, Parnas has long argued that computer programmers must take a more disciplined approach to writing software in order to improve its quality and avoid serioud flaws. During the 1980s, he and his associates had developed a bank of mathematical techniques for evaluating computer programs.

"When I looked at the code [lines of instructions!, it became clear that I couldn't say if it was okay or not," Parnas recalls. "All I could say was that the documentation [explaining the function of each part of the program! was too vague."

For example, consider the specification: "Shut off the pumps if the water level remains above 100 meters for more than 4 seconds." The sentence appears clear -- but what if the water level varied during the 4-second period?

Parnas came up with three different interpretations of this statement, based on different ways of finding the average water level. A programmer could choose only one of the three. Which was correct?

When Parnas checked with the engineers at Ontario Hydro, he discovered that their interpretation, based on long experience with the design of shutdown systems, differed from the three choices he had suggested. This example told Parnas that the specifications for the shutdown software had to be expressed much more precisely.

The engineers proved reluctant to spend additional time writing more detailed specifications. "We argued that even if the specification wasn't written down in a mathematically precise way, an experienced designer would know what it means," Archinoff says.

However, AECB officials were sufficiently concerned about potential problems that they insisted on a thorough review incorporating a variety of software-checking techniques.

The Ontario Hydro team had already systematically subjected their software to a large number of carefully constructed tests designed to ensure that it functioned properly under a variety of circumstances. But planned tests such as these cover only a fraction of the possible paths through the software, and they often miss subtle cases.

Parnas recommended that Ontario Hydro also try random testing (programming, testing) random testing - A black-box testing approach in which software is tested by choosing an arbitrary subset of all possible input values. Random testing helps to avoid the problem of only testing what you know will work.  -- for example, by furnishing to the shutdown systems randomly generated sensor data to see how they responded. "That's often more effective than carefully controlled testing," he says.

Furthermore, because system designers usually can't guarantee that their specifications cover every possible way in which a system will be used, many now perform a hazard analysis A hazard analysis is a process used to characterize the elements of risk. The results of a hazard analysis is the identification of unacceptable risks and the selection of means of controlling or eliminating them. . The idea is to consider all the ways in which a system can fail, and then to work backwards through the hardware and software components to determine what factors could cause such failures. This enables designers to incorporate safeguards that specifically prevent these problems from occurring.

"You have to build safety into software," says Nancy G. Leveson of the University of California The University of California has a combined student body of more than 191,000 students, over 1,340,000 living alumni, and a combined systemwide and campus endowment of just over $7.3 billion (8th largest in the United States). , Irvine, who pioneered hazard analysis for software. "Just trying to get in correct isn't enough."

The final stage, involving techniques developed by Parnas and his colleagues to prove mathematically that the software does what the requirements ask, proved both exhausting and exhaustive.

Three separate teams went to work. One examined just the computer program and painstakingly pains·tak·ing  
adj.
Marked by or requiring great pains; very careful and diligent. See Synonyms at meticulous.

n.
Extremely careful and diligent work or effort.
 determined what each section of the program actually did. A second team converted the systems' original specifications into precise mathematical statements Noun 1. mathematical statement - a statement of a mathematical relation
math, mathematics, maths - a science (or group of related sciences) dealing with the logic of quantity and shape and arrangement
, written out in the form of tables. Finally, a third team tried to find any mismatch mismatch

1. in blood transfusions and transplantation immunology, an incompatibility between potential donor and recipient.

2. one or more nucleotides in one of the double strands in a nucleic acid molecule without complementary nucleotides in the same position on the other
 between the mathematically expressed specifications and the program functions determined by the first group, and listed all discrepancies.

"None of the jobs was fun, but they were doable," Parnas says. "The effect ... was to reduce the extremely complex task of reviewing the system to a large number of relatively simple tasks. The tasks were often dull and tiresome, but the systematic procedure ... made it possible to take breaks and to rotate personnel to prevent burnout Burnout

Depletion of a tax shelter's benefits. In the context of mortgage backed securities it refers to the percentage of the pool that has prepaid their mortgage.
."

"They ended up with hundreds of discrepancies, but most were benign," Asmis says. In many cases, reviewers found that the programmers had inserted extra instructions, such as additional safety checks, which were not called for in the specifications. The teams also uncovered a handful of errors. None of the errors proved serious enough to delay or prevent an emergency reactor shutdown.

In the end, despite many minor changes, the two computer programs remained essentially the same as before. "The engineers had put a lot of effort into trying to get it right, and basically they had succeeded," Asmis says.

The entire checkout took about three years. "the checking process had value," Archinoff says. "The problem is that it was extremely costly -- very labor-intensive and time-consuming. In fact, if we had to do it again, using the same methods, we wouldn't use software. We'd go back to using hardware."

Many of the frustrations in the checking process could have been avoided if the software designers had written the programs with review in mind. Ontario Hydro engineers and experts from Atomic Energy of Canada, Ltd. -- designers of the type of nuclear reactor used in Canada -- are now working with AECB to establish standards for future software projects. Then Ontario Hydro personnel will rewrite the Darlington shutdown software to reflect the new requirements.

"You want a program that not only works but also can be understood by more than one or two people," Asmis says.

The Darlington experience with safety-critical software is not yet common in the nuclear industry. In the United States, most existing nuclear power plants use computers only for functions unrelated to plant safety. However, officials at the Nuclear Regulatory Commission Nuclear Regulatory Commission (NRC), an independent U.S. government commission, created by the Energy Reorganization Act of 1974 and charged with licensing and regulating civilian use of nuclear energy to protect the public and the environment.  in Washington, D.C., believe that software control will inevitably creep into nuclear plant designs, and they are starting to prepare for the task of software evaluation.

Most computer programs don't go through the kine of careful programming and intense scrutiny applied to the Darlington shutdown system or the AT&T telephone network. The process is both costly and time-consuming, and many programmers lack the expertise to use the sophisticated methods necessary for ensuring software reliability software reliability - See also formal methods, safety-critical system.

ftp://ftp.sei.cmu.edu/pub/depend-sw. Mailing list: depend-sw@sei.cmu.edu.
.

"Education is important," Stone says. "There are a lot of techniques [for developing reliable software! on the table that are proven and work well, but they still aren't universally practiced."

Moreover, anyone can call themselves a computer programmer and market a software product. "There's no other technology that we depend on to the extent that we depend on software technology that is so unregulated Adj. 1. unregulated - not regulated; not subject to rule or discipline; "unregulated off-shore fishing"
regulated - controlled or governed according to rule or principle or law; "well regulated industries"; "houses with regulated temperature"

2.
," says software developer John Shore, president of Entropic Research Laboratory, Inc., in Washington, D.C.

It's not surprising, then, that computer programs contain errors and computer systems unexpectedly fail. Often, developers of commercial software work under so much pressure to deliver a product that new programs go out riddled with flaws. "Whether you're a small company struggling to survive or a big company with a big budget, the pressures become enormous, and you end up feeling that you've got to get something out the door to keep the customers satisfied or just to survive," Shore says. "One of the things that saves us is that a lot of customers have come to expect this. They understand how complicated software is."

Indeed, commercial software producers sometimes appear to rely on their customers to do a significant part of the software testing Software testing is the process used to measure the quality of developed computer software. Usually, quality is constrained to such topics as correctness, completeness, security, but can also include more technical requirements as described under the ISO standard ISO 9126, such  for them. Any user of such software must watch closely for problems and anticipate the possibility of sudden, inexplicable in·ex·pli·ca·ble  
adj.
Difficult or impossible to explain or account for.



in·expli·ca·bil
 failures.

As computer programs grow larger and more complex, and computer systems keep taking on greater responsibilities, managing the software monster becomes increasingly difficult.

"The problem is intrinsically unsolvable, but you can always do better," Neumann says. "It's a question of system design, of experience, of good software engineering techniques, of recognizing risks, and of continually adapting to a changing environment. There are no easy answers."
COPYRIGHT 1991 Science Service, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 1991, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

 Reader Opinion

Title:

Comment:



 

Article Details
Printer friendly Cite/link Email Feedback
Author:Peterson, Ivars
Publication:Science News
Date:Feb 16, 1991
Words:3271
Previous Article:Genetic markers improve colorectal screen. (test to predict development of adenomatous polyposis, a condition that often precedes colorectal cancer)
Next Article:Upping the antisense ante: scientists bet on profits from reverse genetics.
Topics:



Related Articles
The dangers lurking in military software production. (includes related article)
Software quality: "genuinely understand the market." (Jim McCarthy, director of Microsoft Corp.'s Visual C++ development group, stresses that...
Bugs: "a change in what the market expects." (interview with BugNet editor Bruce Brown)(Interview)
How to achieve "zero bugs." (advice from TurningPoint Systems Inc Pres Ken Tepper) (Technology Information)
Data points: where the bugs are. (Turning Point's test of ten multimedia software for bugs) (Technology Information)
Once-floundering ParaSoft finds success lies in bugs. (Parasoft Corp.'s software debugging service)
HDL ENHANCES FUNCTIONAL VERIFICATION SOFTWARE TO IMPROVE SOC DESIGN PRODUCTIVITY.(@Verifier 2.5)(Product Announcement)
Bugs blare in software set to music. (Loony Tunes).
Developers sceptical of senior management's commitment to software quality.(IT News)
Code quality cause for concern among development managers.(IT News)

Terms of use | Copyright © 2009 Farlex, Inc. | Feedback | For webmasters | Submit articles