Printer Friendly
The Free Library
14,734,913 articles and books
Member login
User name  
Password 
 
Join us Forgot password?

What are all those nines? A failure probability primer. Just the facts, Jack.


The IT infrastructure of an organization is a complex system of electronic and mechanical components. Users and managers of this infrastructure are concerned with reliability and service. From the customer side, reliability is the answer to the question: "Is this component going to break and, if so, how often?" If perfect reliability cannot be guaranteed, the issue of service arises. Here, a customer will want to know how long it will take and how much it's going to cost to recover from a failure.

Life would be considerably simplified if reliability and service could be purchased in guaranteed, discrete units. In this ideal world, product A might have known reliability such that it fails precisely once every two years while product B fails every three years. Recovering from failure might take four hours for every single failure of product C vs. two hours for each breakdown in product D. It is regrettable that outside of this ideal world both reliability and service are hard to quantify Quantify - A performance analysis tool from Pure Software. .

Reliability Measures

Reliability can be specified either in terms of a failure rate per increment To add a number to another number. Incrementing a counter means adding 1 to its current value.  of operation or a number of operations or hours between failures. Annual Failure Rate (AFR AFR African
AFR Australian Financial Review
AFR Afrikaans (South African language)
AFR Air France (ICAO code)
AFR Alternate Frame Rendering
AFR Applicable Federal Rate
) and Mean Time Between Failures (MTBF (Mean Time Between Failure) The average time a component works without failure. It is the number of failures divided by the hours under observation.

MTBF - Mean Time Between Failures
) are common examples. This specification is based on a model and a set of assumptions about system failure. The validity of this number depends upon these assumptions conforming reasonably closely to reality.

Reliability of What?

The reliability of a system is a function of its parts. These components have sub-assemblies, cables, chips, etc. A single measure can only apply to one combination of components. Some equipment can have highly variable configurations, e.g. PC servers, disk arrays, tape libraries and so forth. Reliability specifications for the base system need to be adjusted for any additional elements beyond those in the configuration that was tested to derive the reliability parameter.

Standard Duty Cycle

Reliability numbers are based and predicated on an expected duty cycle and mode of operation. For example, a disk drive specification might be the mean hours of power on operation between failures. The tests that generate the data underlying this number will involve some variables such as the number of reads and writes and data transferred per hour. Actual usage won't match this exact pattern of usage so field reliability may differ from the specification.

Constant Failure Rate

Failures are distributed randomly across time. The chance of failure in the first hour of operation is exactly the same as in the thousandth or the ten thousandth or in any other hour over the service life of the equipment. This is the famous "bathtub" curve. Over the unit's service life, failure per unit of time is constant. This is the flat region at the bottom of the bathtub.

The reliability parameter only applies to the flat bottom of the bathtub. The assumption is that the early failures associated with burn-in, defective components and assembly errors are all caught before the unit ships to the customer. Failures start to rise at the end of the product's life as it starts to wear out, rust or otherwise degrade TO DEGRADE, DEGRADING. To, sink or lower a person in the estimation of the public.
     2. As a man's character is of great importance to him, and it is his interest to retain the good opinion of all mankind, when he is a witness, he cannot be compelled to disclose
. The reliability measure does not indicate when this degradation will occur or how fast or how quickly the curve rises.

Mean Time Between Failures is often confused with service life. The two numbers are completely unrelated. An MTBF greater than the service life does imply that the expected failure rate is less than one per unit over its lifetime but it says absolutely nothing about how long this life is.

Obtaining the Reliability Parameter

Manufacturers estimate the true reliability for the whole population by looking at a sample. The corresponding parameter for the whole population can be inferred statistically. The quality of the inference (logic) inference - The logical process by which new facts are derived from known facts by the application of inference rules.

See also symbolic inference, type inference.
 depends on all the assumptions discussed above and on two other conditions. The sample has to be representative of the total population and it has to be large enough to give a statistically valid result.

Usefulness of the Parameter

Unless a user has a really large installed base and the resources to conduct its own tests and reporting, the manufacturer's reliability figure is the only one available. Despite all the limitations, it provides useful information. Most manufacturers do not issue intentionally in·ten·tion·al  
adj.
1. Done deliberately; intended: an intentional slight. See Synonyms at voluntary.

2. Having to do with intention.
 misleading specifications. They are prevented from doing so by their own sense of ethics ethics, in philosophy, the study and evaluation of human conduct in the light of moral principles. Moral principles may be viewed either as the standard of conduct that individuals have constructed for themselves or as the body of obligations and duties that a  as well as the presence of large actual and potential OEM (Original Equipment Manufacturer) The rebranding of equipment and selling it. The term initially referred to the company that made the products (the "original" manufacturer), but eventually became widely used to refer to the organization that buys the products and  customers who will be certain to verify the numbers.

Estimating Probability of Failure in a Given Time Period

From the reliability data one can calculate the probability that a given unit will operate without failure for a period of time or a number of cycles.

P([bar.F]) = [e.sup.[-t/[theta Theta

A measure of the rate of decline in the value of an option due to the passage of time. Theta can also be referred to as the time decay on the value of an option. If everything is held constant, then the option will lose value as time moves closer to the maturity of the option.
]]]

P(F) = 1 - P([bar.F])

Where P(F) is the probability that the unit will not fail in t increments of operation (time, cycles) and [theta] is the average life expressed in units of operation per failure, e.g. MTBF.

[FIGURE 1 OMITTED]

In Excel, the EXP function will do this calculation.

Note that these probabilities are for one or more failures in the interval. The expected number of failures is higher because a unit might fail more than once. The expected number of failures is t/[theta] or t*[lambda], where [lambda] is the failure rate in failures per unit of operation.

The graph below shows the difference between the probability of failure and the expected total number of failures for various fractions of MTBF.

Combined Probability in a System

A system made up of multiple components will fail if any element fails. A desktop PC is only usable if the CPU CPU
 in full central processing unit

Principal component of a digital computer, composed of a control unit, an instruction-decoding unit, and an arithmetic-logic unit.
, the keyboard and the monitor are all working. If reliability figures are available for each component then the probability of the unit failing in a given period, can be computed as above. If the probabilities of not failing for each component are P(/[F.sub.1]), P(/[F.sub.2]), P(/[F.sub.3]), etc then the combined probability of the system not failing is

P([bar.F.sub.S]) = P([bar.F.sub.1]) X P([bar.F.sub.2]) X P([bar.F.sub.3]) X ...

for n different non-redundant components

[MATHEMATICAL EXPRESSION A group of characters or symbols representing a quantity or an operation. See arithmetic expression.  NOT REPRODUCIBLE IN ASCII ASCII or American Standard Code for Information Interchange, a set of codes used to represent letters, numbers, a few symbols, and control characters. Originally designed for teletype operations, it has found wide application in computers. ]

and

P([F.sub.s]) = 1 - P([bar.F.sub.S])

Probability of Failure in a Redundant System

If a component is duplicated in a system then the system will operate unless all redundant components fail.

P([F.sub.S]) = P([F.sub.1]) X P([F.sub.2]) X P([F.sub.3]) X ...

for n different non-redundant components

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

and

P([bar.F]) = 1 - P([F.sub.S])

Note that in the previous case we calculated the probability that the system would not fail over a given interval but in for the redundant system we work from the probability that components will fail to get the probability of system failure. The computation Computation is a general term for any type of information processing that can be represented mathematically. This includes phenomena ranging from simple calculations to human thinking.  is set up in both cases so that we multiply mul·ti·ply
v.
1. To increase the amount, number, or degree of.

2. To breed or propagate.
 the probabilities of events, all of which must occur to produce the indicated result.

Probability of Network Failure

Many complex systems consist of networks of components. Some of these components are redundant while others are critical paths so that any failure will bring the whole system down. The probability of failure for the entire network can be computed by combining the two rules given above. From the network diagram This article is about computer network diagrams. For project management network diagrams, see Network diagram (project management).
A network diagram is a schematic depicting the nodes and connections amongst nodes in a computer network or, more generally, any
, look at each redundant pathway and compute To perform mathematical operations or general computer processing. For an explanation of "The 3 C's," or how the computer processes data, see computer.  the probability of failure for this pathway. Redraw To redisplay an image on screen whether text or graphics. The concept is that the first time elements are displayed, they are "drawn," and if something is changed, they are "redrawn." Applications often have a Refresh command that redraws the screen.  the network diagram, replacing the redundant pathway with a single pathway having this combined failure probability. Then compute the probability that the critical path will not fail.

[FIGURE 2 OMITTED]

For a real network there will usually be components for which no reliability data is available, like cables. Quite often these components are simply ignored when assessing reliability. This may be acceptable if experience indicates that the failure rate is negligible This article or section is written like a personal reflection or and may require .
Please [ improve this article] by rewriting this article or section in an .
. A more rigorous approach is to build them into the model but assume perfect reliability, i.e. P(F)=0. After the model is built one can test the sensitivity of the network reliability to these components by change this P(F) to a very small but positive number.

Off All the Gin Joints in All the World, What's the Chance that the Network Will Fail in Mine?

The science of reliability measurement depends on reducing flesh-and-blood device to a model that can be treated mathematically. The model allows failure to be treated as a fraction of a population. The whole population consists of all the operating hours or cycles in all the units in the whole world. The model provides useful information that can answer real world problems but it is only the shadow on the wall of the cave. The limitations of the model and its implications need to be considered just as carefully as the number.

Nick Harper
:For the Tennessee Titans player, see Nick Harper (American football)
:For the My Family character, see Nick Harper (My Family)


Nick Harper
, CFA (Computer Fraud and Abuse Act of 1986) Signed into law in 1986, the CFA was a significant step forward in criminalizing unauthorized access to computer systems and networks. The Act applies to "federal interest computers" that include any system used by the U.S. , is vice president, Business Development at Spectra Logic Corporation (Boulder, CO)

www.spectralogic.com
COPYRIGHT 2003 West World Productions, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2003, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

 Reader Opinion

Title:

Comment:



 

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:Backup/Restore
Author:Harper, Nick
Publication:Computer Technology Review
Geographic Code:1USA
Date:Dec 1, 2003
Words:1494
Previous Article:Reducing the cost of storage management.(Backup/Restore)
Next Article:Automatic disk provisioning: the real story.(Backup/Restore)
Topics:



Related Articles
Averting disaster with redundant hardware.(computer hardware)
Tape and Backup Issues In Storage Area Networks.(Technology Information)
Image backup & disaster recovery.(Backup/Restore)
Tape or disk: why not both?(Storage Management)(Industry Overview)
Data grid disaster puts SMBs at most risk.(Disaster Recovery)(small to medium size enterprises)
Is regular backup enough? Join the continuous backup revolution.(Backup/Restore)
Rapid restores from data disasters.(Disaster Recovery)(Zetta Server)
Plan for the worst, hope for the best: backup and disaster recovery.(Disaster Recovery & Backup/Restore)
Data protection: the #1 storage priority; There's no ILM process without it.(Data Protection)(Information Lifecycle Management)
Overcoming recovery barriers: rapid and reliable system and data recovery.(Data Protection)

Terms of use | Copyright © 2009 Farlex, Inc. | Feedback | For webmasters | Submit articles