Transients' effect on the reliability of programmable electronics.
It has been said that we are living in the age of microelectronics and computers. They are present in almost every electronic product and system and are also used heavily in products which are not normally classified as electronics. These items range from washing machines to automobiles. All these products and systems have one thing in common. Their electronics are mostly based on microelectronics hardware and their operations are programmed by software. In other words, they are programmable electronics. Programmable electronics' reliability depends not only on the reliability of the constituent hardware and software but also on the ambient physical environment.
Electronic hardware is inherently more reliable than most mechanical equipment due to the lack of wear and tear. From the advent of the transistor in the late 1940s to the latest million-transistor microprocessor chip, the reliability of microelectronics has improved steadily. The device failure rate model follows the Weibull distribution in its early life and is followed by a very long, useful life of constant failure rate. Typically the constant failure rate ranges from a few ppb to a few hundred ppb. Thus, electronic hardware is rarely responsible for failures, even for very complex computer systems.
As the processing power of microprocessors (measured in million instructions per second or MIPS) increases, software complexity also increases to harness their power for better performance and to produce more functions. While the control program for a washing machine may be just a few thousand lines of instruction, it is not unusual nowadays to find software with a million lines of instruction, even in personal computers. Software of such complexity also controls the modern telephone exchanges, aeroplanes and non-stop computers for banking and finance. The proliferation of programmable electronics gives rise to concern over the risks of software. To contain the risks of software, structured programming, software quality assurance and fault-tolerance techniques are increasingly being used[1,3]. In a survey of well-debugged programs, MTTF ranging from 1.6 years to 5,000 years was reported.
It is well known that temperature and humidity affect the reliability of electronics. The methods to reduce their detrimental effects are also well known. One aspect of the physical environment, however, is not widely known, although it is gaining recognition as one of the most serious elements that affects electronics in general and programmable electronics in particular. This is the susceptibility of electronics to electromagnetic interference (EMI), which is also known as radio frequency interference (RFI). In short, EMI affects programmable electronics' reliability through interaction with the hardware and software.
Transient is one particular form of EMI that is a major cause of failures for programmable electronics. It is represented as a short burst of electromagnetic energy that enters into a victim equipment via conduction on cables and other forms of conductor or via electromagnetic radiation. Strong transients can cause permanent physical damage while weak transients cause only transient faults that involve no physical damage. Nevertheless, transient faults can still cause havoc to programmable electronics' operations. Since there is no evidence of physical damage, failures due to transient faults are often confused with software faults and mislead failure analysis in the wrong direction. It is for these reasons that transients and their effects need to be understood better by those responsible for product quality and reliability.
Susceptibility of programmable electronics
EMI is part of the physical environment. It is either natural or man-made. There are many sources of EMI. They include lightning, radio transmitters, motors, electrical circuit breakers, electrostatic discharge, personal computers. EMI could be transient in duration, as produced by lightning or continuous, as produced by broadcast radio. EMI is either conducted by cables such as power and data interface cables or radiated through the atmosphere. If suitable countermeasures against EMI are not taken, sensitive electronic equipment would be interfered with. The result of the interference could be temporary or permanent loss of performance. Due to the proliferation of electrical and electronic apparatus, particularly computing devices, man-made EMI has been on the increase. The situation became so serious that in the early 1980s regulations were imposed internationally to limit the amount of EMI that can be emitted from computing devices or information technology equipment.
Limiting EMI emission of computing devices, however, does not eliminate EMI completely. There are still the natural and other man-made EMI sources, like lightning and electrical circuit-breakers. They produce interferences that last only a short time, ranging from nanoseconds to milliseconds. They enter electronic equipment via power and interface cables or couple into the equipment as transient electromagnetic radiation. The results are transient voltages and currents, transients in short, in the electronic hardware.
While programmable electronics are not the only potential victims of transients, they are especially susceptible to this form of EMI and the possible failures could be more serious. Take, for example, a radio receiver with no programmable electronics; the effect of a lightning strike nearby may be just a clicking noise added to the received signal. On the other hand, transients caused by lightning may lead a traffic light controller, controlled by programmable electronics, into an unsafe state, such as turning all the green lights on. Similarly, a financial transaction computer may enter a piece of wrong data with serious financial consequences.
In view of the ever-increasing use of electronics, especially programmable electronics, and the concern about their safety and reliability, international standards on immunity against EMI have been adopted and will be imposed, from 1996 onward, on products sold in the European Community. The European Norm 50082-1 will cover residential, commercial and light industrial environments, while 50082-2 will cover industrial environments, In other words, mass-produced apparatus as well as industrial, scientific and medical equipment will be affected.
Transients and failures
Transients can cause faults, which in turn can cause errors and errors can cause failures. Following the computer community, the definitions of these terms are given below[2,7]:
* A fault is an incorrect state of hardware or software resulting from failures of components, physical interference from the environment, operator error, or incorrect design.
* An error is the manifestation of a fault within a program or data structure. It is a deviation from accuracy or correctness.
* A failure is the non-performance of some action that is due or expected.
Faults can be classified into three types: permanent, intermittent and transient (often intermittent and transient are not differentiated in usage). A permanent fault exists indefinitely until it is corrected by repair to the hardware. An intermittent fault appears, disappears and reappears repeatedly. It is due to impaired physical conditions of the hardware and can be repaired by part replacement or correction. A transient fault appears and disappears within a very short period of time and involves no damage to the hardware.
Transient caused by EMI is a major, but not the only, cause of transient faults. Electrostatic discharge or (ESD) can also cause transient faults through the concomitant electromagnetic radiation. Defective software has also been identified as another major source of transient faults in software intensive systems. According to some case studies of mature and well-debugged systems, transient faults account for more than 80 per cent of all failures observed[2,8,9].
The mechanisms of how transients produce failures are very complicated. Simply stated, they depend first on the physical interaction between the hardware and the sources of transients and, second, on the states of the software at the times that the transient faults occur. Three more factors complicate the situation further and make transient faults and their associated failures so hard to deal with. These factors are now discussed.
Probability of failure
A transient does not always cause a failure. Transients are inherently random in nature. Their frequency of occurrence, waveforms and strength are random variables. Thus, a transient may or may not produce a fault. When it does produce a fault it could be a transient, an intermittent or a permanent fault depending on its strength and waveform [ILLUSTRATION FOR FIGURE 1 OMITTED].
Even when a transient fault occurs an error does not always result. For instance, if a fault causes data to be 1 when it should be 0 an error will occur. On the other hand, if the data are already a 0 then the transient fault does not result in an error. Similarly, an error does not always end up in a failure. For instance, a transient may cause data error but if the data are not read and used, or are overwritten by correct data, then it cannot cause a failure. So failure due to transient is a highly complicated and random process.
In the case of a very strong transient with very fast rise and fall times the induced transient voltages and currents in the hardware will also be high and distributed extensively throughout the hardware. This leads to a very high probability for transient-induced failure. This could also happen with less severe transients if the hardware design is very poor and, therefore, highly susceptible to transients. Otherwise, the probability of failure due to transients will be low and could be modelled by a Poisson random process of rare events as below.
The software which is being executed by the hardware is characterized by the presence of time intervals during which the software is susceptible to transient faults. Such intervals could be called the susceptible windows, for example, the intervals during which crucial data are being transferred between the processing unit and some memory or input/output device. Typically, these susceptible windows represent a small fraction of the total observation time, hence random transients hitting at susceptible windows can be considered as rare events. The probability of developing transient failure could be calculated easily.
Let the observation time be T, within which there are m number of identical susceptible windows, each of duration t. The probability, f, of developing at least one failure, with the equipment in question subject to n number of random transient faults, occurring one at a time and with uniform probability distribution function throughout T, is given by:
f [approximately equal to] 1 - e -(nmt/T)
provided that the following condition is met,
mt [much less than] T.
The implications of the above expression are obvious. The more frequent are the transient faults or the occurrences of susceptible windows within a fixed period of time, the higher the probability of developing a failure. The fewer in number are the susceptible windows, the longer it will take to develop a transient-induced failure.
An error does not always cause a failure immediately. In some cases it may take a long time to do so. The time period between the occurrence of an error and its associated failure is called error latency. Take for example, a piece of data which is corrupted while it is written into a memory device - a failure will not occur until it is retrieved and used by the processor. Thus, the error is dormant and undetected. Such an error is called a latent error and can be likened to a computer virus.
As pointed out earlier, transients are random in nature. When a failure occurs and is detected, the source of the transients could have disappeared or become quiescent for a long time. This makes troubleshooting and tracing the origin of the failure extremely difficult. Transient faults could produce many different failures and some of them seldom repeat. During the product-development stage, transient faults could often be masked by the more dominant hardware and software faults. All these factors could lead the engineers to wrong conclusions when diagnosing failures.
Design against transients
Due to the above reasons, a defensive design Strategy is preferred and needed to combat transient faults so as to achieve reliability. Such a strategy could be implemented at several levels. The first and most fundamental level is the hardware. Shielding, proper grounding of cables, filtering, good circuit board layout, installing transient absorbers are essential techniques for fault avoidance.
The next level is at the software and data structure. At this level, the design objective is fault tolerance. The purpose of fault tolerance is to prevent faults leading into errors and errors leading into failures. There are many techniques used to achieve fault tolerance, e.g. error-correction coding, redundancy, performance monitoring[2,7]. Some techniques employ only software while others use software and additional hardware.
However, it is important to realize that no fault tolerance technique gives 100 per cent fault coverage. Some errors may not even be detectable, so failure could still occur despite fault tolerance techniques. The additional hardware and software to implement fault tolerance could also fail due to transient faults. Moreover, fault tolerance techniques often mean an additional workload which slows down system performance.
To test the adequacy of design, a transient simulator should be used in prototype testing. The objective is to force the equipment under test into failure so that its weak points can be discovered and ameliorated. The IEC standard 801-4 recommends test severity levels, procedures and equipment for susceptibility tests. In adopting this test standard it should be recognized that no simulation test can reproduce perfectly conditions which will be encountered in the field. However, performing susceptibility tests in the laboratory during the development stage of a product is still of great value, as errors and deficiency can be detected earlier and corrections will be less costly.
Finally, it should be noted that transient faults cannot be prevented by better manufacturing practices. Unlike electrostatic discharge, transient faults have nothing to do with manufacturing, process control, or material handling, so reliability of a piece of equipment in a harsh electromagnetic environment must be achieved in the design.
Risks and costs
Here we consider the potential risks and costs for neglecting transients' effects on reliability. Figure 2 shows a typical design-manufacture-operate process flow. A supplier should be alert to the potential problem and consider it at the initial specifications phase. Immunity level to transients should be defined clearly. In a contract manufacture situation, the supplier should bring the potential problem to the customer's attention and require the intended operational electromagnetic conditions and immunity level be stated explicitly in the initial specifications. Failing to take the above precautions may result in the following consequences, [ILLUSTRATION FOR FIGURE 2 OMITTED]:
* During the prototyping and testing phase, the engineers may by chance discover the transient problems and would redesign the prototype, incorporating fault avoidance or fault tolerance. At this phase the redesign would cost extra engineering time.
* The transient problems are undiscovered during the prototyping and testing phase. This could easily happen because transient fault is rare in laboratories as their electromagnetic ambience is usually benign. Even if it is discovered, there is often a tendency to ignore it, since it is a transient and often non-repeatable phenomenon. The next likely opportunity to discover the problem would be after the design is frozen and released for manufacturing. The electromagnetic ambience in a manufacturing plant tends to be harsher than in laboratories, hence it is more probable for transient faults to manifest themselves. However, at this stage the redesign cost would have escalated significantly. Besides extra engineering time there would be material scrap and extensive revisions of documents. At this stage, redesign will require some or all of the following measures: the re-layout of printed circuit boards, re-routeing or changing the types of cables and wires used, addition of components for EMI suppression and sometimes modifications of the software for fault tolerance.
* The transient problems remain undiscovered until field testing. At this stage the customers would be involved. Facing failures that are extremely hard to diagnose, for reasons given earlier, the supplier-customer relationship would be strained. The elusiveness of the sources of transients means many trips to the field by the engineers. Material costs and man-hour overrun would escalate further when compared to earlier discovery (above). The flexibility for a redesign is considerably reduced because of the time and finance involved are not budgeted.
* It is possible that even field testing does not expose the inherent susceptibility to transients. One possible reason is insufficient test duration. As explained above, a transient failure depends on the rare concurrence of the transients and the susceptibility windows, so over a short period of time failures associated with transients may not develop (another possible reason is long error latency). Thus, the error remains dormant during the entire field test. Subsequent to the field test and acceptance, maybe after a long time, the error becomes active and failure develops. For a safety critical system or a system involved with high finance, the consequence could be serious and result in societal loss.
Given the possible serious consequences that transients have on programmable electronics, management must be alert to the potential problem. It must take the necessary steps to ensure the confinement of transients' effects on reliability. Following the ISO 9001 standard on quality systems, management's responsibility should include at least the following:
* Define all the personnel at various levels and functions who will be responsible for ensuring that the specifications, design, testing and installation do take transients into account.
* Review contracts or product specifications to ensure that the intended operational electromagnetic environment is well defined. If the latter is not defined by the customers then relevant standards should be followed.
* Help to set as a design objective, the immunity levels of the equipment in question towards defined transients.
* Ensure that all test plans include transient susceptibility tests with defined procedures. Susceptibility tests must be performed in the development stage as well as final and field testing.
* Review the design and test records and check if the immunity design objective is achieved.
* Ensure that service records reflect any incidence of failures due to transient faults.
* Establish a document that records the objective, plans and results pertaining to the above points.
Although implementing the above points and the entailing work represents additional cost to the supplier, it should be compared to the potential loss due to negligence. Indeed, as explained earlier, the loss to the supplier and possibly to society could be extremely high.
The nature of transients and the associated failure mechanism in programmable electronics have been discussed. The importance of design and management's role with regard to transients have been stressed. In view of the pending European regulation on immunity against EMI and the possible, serious consequence of ignoring the issue, management must not neglect transients' effects on product reliability. It must take the lead and ensure the reliability of products in their intended operational environment.
1. Irland, E.A., "Assuring quality and reliability of complex electronic systems: hardware and software", Proceedings of the IEEE, Vol. 76 No. 1, January 1988, pp. 5-18.
2. Siewiorek, D.P. and Swarz, R.S., Reliable Computer Systems Design and Evaluation, 2nd ed., Digital Press, Geneva, 1992.
3. Avizienis, A. and Laprie, J., "Dependable computing: from concepts to design diversity", Proceedings of the IEEE, Vol. 74 No. 5, May 1986, pp. 629-38.
4. Littlewood, B. and Strigini, L., "The risks of software", Scientific American, November 1992, pp. 38-43.
5. Ott, H.W, Noise Reduction Techniques in Electronic Systems, Wiley, New York, NY, 1988.
6. Davies, J., "The European (CENELEC) generic immunity standards", EMC Test and Design, November-December 1992, pp. 49-50.
7. Johnson, B., Design and Analysis of Fault Tolerant Digital Systems, Addison-Wesley, Reading, MA, 1988.
8. Iyer, R.K. and Rossetti, D.J., "A measurement-based model for workload dependence of CPU errors", IEEE Transactions on Computers, Vol. C-35 No. 6, June 1986, pp. 511-19.
9. Duba, P. and Iyer, R.K., "Transient fault behavior in a microprocessor, a case study", Proceedings of IEEE International Conference on Computer Design, 1988, pp. 272-6.
10. Papoulis, A., Probability and Statistics, Prentice-Hall, Englewood Cliffs, NJ, 1990, Chapter 3.
11. International Electrotechnical Commission (IEC), IEC 801-4 Electromagnetic Compatibility for Industrial-process Measurement and Control Equipment, Part 4: Electrical Fast Transient/Burst Requirements, IEC, 1988.
12. International Organization for Standards (ISO), ISO 9001 Quality Systems - Model for Quality Assurance in Design/Development, Production, Installation and Servicing, ISO, Geneva, 1987.
Tang, H.K. and Er, M.H., "EMI-induced failure in microprocessor-based counting", Microprocessors and Microsystems, Vol. 17 No. 4, 1993, pp. 248-52.
|Printer friendly Cite/link Email Feedback|
|Author:||Tang, H.K.; Lee, Brian|
|Publication:||International Journal of Quality & Reliability Management|
|Date:||Feb 1, 1996|
|Previous Article:||Quality management, problems and pitfalls: a critical perspective.|
|Next Article:||Practical examples.|