MTBF (Mean time between failure)

E

Thread Starter

ECS MOHAMMAD IRFAN

I have obtained the failure records for a unit from the manufacturer. It includes:

1. No of units produces in a period of years
2. Quantity of units along with dates when the units had problems/failures.

Is there any known formula/method to calculate MTBF (Mean time between failure) value for this unit if I have the above information

Best Regards

Mohammad Irfan
Siemens Pakistan
 
R

Robert R. Stephens Pennzoil Products

One formula in the Reliability handbook is MTBF= Total No of Units*Period of Test/No of Failures. The most accurate way to get the true MTBF is to plot the failures with Weibull software and read the characteristic life(MTBF) by inspection from the graph. Try this site for Weibull software. <A
HREF="http://hotbot.lycos.com/director.asp?target=http://software-guide.com/cdprod1/swhrec/019/796% 2Eshtml&id=4&userid=4EVYjf5sXFQg&query=
MT=Weibull&rsource=DH">Weibull Curve Fitter</A>
 
J

John Peter Rooney

If each unit is working 24 hours/day, 7 days a week, then the calculations become simple.

The "At-Risk" Time is the sum of all the elapsed calendar time for all units. The number of failures should be confirmed as reliability related failures. Then, MTBF equals the At-Risk Time divided by the number of failures.

Example:
100 units installed.
One calendar year has elapsed (i.e. installation date was 21 March 1999).
The "At-Risk" time is 100 years = (100 units times one [1] calendar year).

Let the number of confirmed failures be two (2).
Then the MTBF is 100 years/2 = 50 years MTBF.
This is the point estimate, and you can put statisical confidence limits on this point estimate by the use of the Chi-Square approach, which will give you upper and lower limits at specified confidence levels: 60%, 90%, 95% and
99% are commonly used.

At The Foxboro Company, we bar-code our units out and bar-code them in for repairs. We know when they left and when they returned and most units are used 24 hours/day, 7 days/week. We then calculate the At-Risk time and divide by the number of true reliability failures: clercial errors, for example, are not reliablity failures. We then publish a quarterly report from Quality Data Systems on the field proven MTBFs per IEC 300, part 2-1.

If you need more, please see P.T.D. O'Connor's book, "Practical Reliability Engineering" John Wiley & Sons.

Sincerely,
John Peter
John Peter Rooney, ASQ CRE #2425
E-mail: [email protected]
 
M
If the exponential MTBF distribution is assumed (a reasonable assumption by theory and measurement). If n failure events are included in he data for the measurement period T, the failure rate upper bound, lambda_u and we assume that everything has been operated continuously then

lambda (failure rate) = number of failures/T

MTBF=1/lambda

where T is the total operating time of all units (the records should be able to provide that to you)

For the upper confidence limit on the failure rate,

lambda_u= chisquared (alpha, 2*n)/(2*T)

where alpha is the level of significant, i.e., 100(1-alpha)% is the level of confidence. Sorry, that's a statisticians way of saying that if you
want 90% confidence that you are below the failure rate of lambda_u, then alpha is 0.1.

n is the number of failures


The lower bound of the MTBF is 1/lambda_u.

See www.sohar.com/meadep for more information.

Regards


--------------------------
Myron Hecht email: [email protected]
SoHaR Incorporated
8421 Wilshire Blvd., Suite 201 Phone: (323) 653-4717 xt. 111
Beverly Hills, CA 90211 Fax: (323) 653-3624
Web Site Home Page: http://www.sohar.com
 
V

Vitor Finkel

Sorry Guys,

But what are those numbers good for ?

Are all manufactured parts installed and running, or is there a certain percentage of them on the shelf, as spare, production inventory, etc ?

How can you possibly believe a manufacturer is aware of ALL the failures those parts ever had in the field ? Ever noticed that some users would
not return all damaged parts back to the factory ? Either to the fact that they still have enough spares, repair it on the field or elsewhere, or just buy new parts when needed and scrap old damaged parts, when their numbers are not very high, specially on countries where exportation to repair and re-importation after repair requires a huge amount of paperwork and red-tape, that's common procedure, after the warranty period.

So why bother to look at those numbers, if they may be way off reality ?

Vitor Finkel [email protected]
P.O. Box 16061 tel (+55) 21 285-5641
22222.970 Rio de Janeiro Brazil fax (+55) 21 205-3339
 
R

Rooney, John Peter

MTBF, like any other measurement, has uncertainties.

Victor Finkel asks,
Sorry Guys,

But what are those numbers good for ?

See Below.
If you do not calculate MTBFs what do you use?
Or do you not quantify the impact of failures on your process plant. If the gentleman in Pakistan knows how many units he has, how long they have
run and how many units have "failed", then he calculate the point estimate of the MTBF (without worrying about exponentials) and then ask the
manufacturer what their worldwide field proven MTBF numbers are. The issue is then quantified >>>and quantification is what sets apart the engineers from the liberal arts majors.

TRUE: (1) we adjust numbers on the basis of some units being in a dormant (spares) state. YOU want a paper on the failure rate of dormant units,
kindly contact me.

TRUE: (2) we do not receive back all of our failed units: some run into the customs barrier. Some failed units are given the "deep six" which is familiar to any one who has served in the U.S. Navy. We receive back some 99% on "complex" and expensive units. On transmitters, we estimate some 40% return rate (depends upon the model code).

TRUE: (3) hardware failures do not represent all the system type failures out there. With fault tolerant pairs we often see that both modules work and there is no obvious component cause. This is the source of much discussion when we calculating field proven MTBFs.

TRUE: (4) Predicted values can differ greatly from field proven MTBFs. This is old. I first saw an IEEE paper on this in the 1960s. YOU want an IEEE paper by me on Field Proven MTBFs (1994) just send me E-mail.

But, to
John Peter Rooney, ASQ CRE #2425
E-mail: [email protected]
 
J

John Peter Rooney

>MTBF, like any other measurement, has uncertainties.
Victor Finkel replied:

OK, I think we can take that for granted.
In the Instrumentation Industries we are so oftenly striving for less than 1 % on measurement accuracy's.

Almost a decade ago I was replacing the wooden deck on my house and spending a lot of time cutting the wood to close tolerances. My neighbor noticed that and said it was typically of an engineer. A quarter inch is good enough!

What does 1% measurement accuracy in instrumentation have to do with cutting
wood for a deck?
Nothing.
What does 1% measurement accuracy in instrumentation have to do with reliability & MTBF?
Nothing.
Why use the inapplicable tolerances of instrumentation for deck building or for MTBF?

Since the late 1960s, Reliability Engineers have followed the difference between predicted MTBFs and actual field proven MTBFs. You can expect +300% better in some cases. Forget my papers: Look at Tandem Computer, Cupertino, California, who put out an IEEE paper in 1994 which gave much higher MTBFs for digital boards than predicted and slightly higher MTBFs for power supplies than predicted. This has been the case for more than 30 years so Victor Finkel is "beating a dead horse".
+++++++++++++++++++++++++++++++++++++++++++++++
JP Rooney said:
>If you do not calculate MTBFs what do you use?
>Or do you not quantify the impact of failures on your process
plant.
>If the gentleman in Pakistan knows how many units he has, how long
they have
>run and how many units have "failed", then he calculate the point
estimate
>of the MTBF (without worrying about exponentials) and then ask the
>manufacturer what their worldwide field proven MTBF numbers are.
>The issue is then quantified >>>and quantification is what sets
apart the
>engineers from the liberal arts majors.

Victor Finkel replied:
Excuse me, but there is more than just quantification to set apart Engineers from whateverelse you want to compare with. I've seen MTBF's ranging up to 20,000 years on "Critical Mission" controllers, where common failures were
not considered by the manufacturer at his calculations, and yet they were so obvious at a first look at the equipment. Real MTBF for that
should probably be closer to 20 years, ( and that's already impressive ).

How about that for a "Measurement Uncertainty" ? A 10,000 % error ?

Centuries ago, Lord Kelvin was the one who said that you do not know something unless you can quantify it. This is especially true for
engineers. You are not excused. Quantification sets the engineering discipline apart from the liberal arts. Victor Finkel appears to be running
on "feelings" rather than attempting to understand what the analyst was doing when 20,000 years was given as "critical mission" MTBFs.

For example, we have control processors that operate in the duplex fault tolerant mode. If you consider hardware only (software is a horse of a different color!), then 20,000 years MTBF is fairly easy to achieve. When you go to triple modular redundancy, values of that magnitude are a given. If Victor Finkel's comment means that someone did not consider a common mode fault, such as earthquake, fire or corrosion, then I can understand that. I have been told that my availability assessment reports are far too long
since I write that I am not considering fork lifts running over cables, and earth quakes, etc. These things really belong in a Process Hazard Assessment rather than a reliability analysis.
+++++++++++++++++++++++++++++++++++++++++++++++
J.P. Rooney wrote:

>TRUE: (2) we do not receive back all of our failed units: some run into the
>customs barrier. Some failed units are given the "deep six" which is
>familiar to any one who has served in the U.S. Navy. We receive back some
>99% on "complex" and expensive units. On transmitters, we estimate some 40%
>return rate (depends upon the model code).

Victor Finkel replied:

OK, apparently the gentleman from Pakistan's message did not mention he would take that into consideration, so I am glad you are mentioning that here.

Some years ago, when in Brazil we had some strong restraints on Importation regarding several industrial products, mainly electronics, I had
vendors that reported they had Zero returns from here, for some instruments. I don't think that kind of data would help to figure the device's MTBF.

Hey, Victor, it is even worse than that. I have had some of my company's field engineers calculate a field proven MTBF of infinity when they had experienced zero (0) failures. Anyone who is a Certified Reliability Engineer knows that as long ago as 1961 Igor Bazovsky, in his seminal textbook, made it a standard to increment zero failures to one failure for division, as our number system does not like division by zero. On the other hand, we have had cases where the number of failures from Germany were very small until the key word search was expanded to include "Deutschland".

Victor, are you trying to point that some vendors are fallible or are you saying that they are malevolent? Perhaps a little of both?
+++++++++++++++++++++++++++++++++++++++++++++++
J.P. Rooney wrote:

>TRUE: (3) hardware failures do not represent all the system type failures
>out there. With fault tolerant pairs we often see that both modules work and
>there is no obvious component cause. This is the source of much discussion
>when we calculating field proven MTBFs.
Victor Finkel replied:

Right, that can cause huge uncertainties on MTBF guesswork. Hot-stand-by CPU's for PLC's have a strange history of transferring the
CPU's well when tested, but so oftenly not when a CPU failure really occurred.

It's not that point, Victor;
rather we receive back two (2) CPs when only one has failed, or, perhaps, neither CP has failed, but the software was at fault. We count both returns as a conservative estimate of the field proven MTBF for the Control Processor, CP. We do this to meet the requirement of IEC 300, the standard on "Dependability". However, this is not "huge" uncertainty but understandable root cause variations in our field returns. We tell our customers up front, the way IEC 300 wants you to do, that there is uncertainty in the returns.
+++++++++++++++++++++++++++++++++++++++++++++++
and more:
>TRUE: (4) Predicted values can differ greatly from field proven
MTBFs. This
>is old.

Victor Finkel replied:

In deed it is. Field conditions vary enormously, from power supply voltage to corrosive atmospheres, ambient temperature and moisture,
maintenance procedures, instrument air cleanliness and humidity etc. My guess based on actual experience is that for Field Instrumentation ( against control room instrumentation ) MTBFs may vary up to 10 X between different installation sites.

Predicted values can hardly consider the impact of field environment ( oftenly not anticipated or informed ).

Until recently, MIL-HANDBOOK-217 was the source for Predicted MTBFs. Mil-217 has environmental factors that vary from ground benign to ground
fixed to navel sheltered to rocket launch and uninhabited aircraft spaces. Having served in the United States Navy, I can tell you that a process
plant, dumping a large cauldron, shakes and bounces like an aircraft carrier (U.S.S. Tarawa) in the North Atlantic. If we predict for an air-conditioned control room MTBF, and the environment has vibration variables that are
similar to ship-board, but not quite, the predictions will be off. MIL-HANDBOOK-217 does not include an environment that is really similar to a control room in a process plant.

Further, corrosion is a major issue. Sulfur and sulfur compounds attack electronics in a non-linear fashion. These issues have to be handled on a plant by plant basis. I have seen mild environments in all buildings but one in an oil refinery in South America. The electronics in that building had a terrible failure rate, but, in the same plant, with the same atmosphere, other buildings had little trouble. Their MTBFs were much higher. Part of this was due to the maintenance of the chemical filtering system and some of it due to the prevailing winds!

However, we have an engineer who has been dealing with this subject for more than 20+ years, so we are hardly uninformed. I can recommend to you the work in the ISA standard on environments, where the ISA lumps really bad environments into the GX category. You guessed it: the "X" truly stands for unknown.
+++++++++++++++++++++++++++++++++++++++++++++++


Victor Finkel then wrote:

I've been working with Safety Shutdown Systems, and that is a field were you have to calculate your MTBFs and prove that the designed system is within "statistic tolerable risks". I've seen good and bad systems designed before those new standards demands, and I've seen good and bad systems designed according to those regulations. Believe me, there is more to Good Engineering Practice than pure quantification, ( specially when it comes from unreliable data ).

The data are not unreliable; it appears to me that you have not considered all the difficulties involved in collecting and administering the data
effectively.

As we struggle to meet IEC 61508 SIL requirements, understanding is required, not casting aspersions on the life work of so many.
Sorry ladies and gentlemen for being so windy.

Sincerely,
John Peter Rooney, ASQ CRE #2425
E-mail: [email protected]
 
M
Just a point of Clarification

Yes, there are many uncertainties associated with MTBF. However, the reason why MTBF of an individual component in and of itself is important is for maintenance planning (i.e., how many spares do I need or how much time should I plan for a technician to be spending on this subsystem). When running a plant, what is more is (a) system level reliability for a fixed
period (such as a shift or a PM interval), and/or (b) availability (if the plant is operating continuously).

In any well running installation, both numbers should be in the 98 to 99.x% range, and in such a case, 1% is meaningful. For example, a 1%
availability error is about 88 hours (more than 3.5 days) per year of downtime.

Availability and system level reliabiltiy are the result of combining several quantities and hence, the errors associated with an individual
subsystem or component are less. In fact, the restoration time (the sum of repair time, waiting for the repair technician, and waiting for the spare if the technician doesn't have it available) generaly has a far greater impact on availability than MTBF.
 
CAUTION:

When comparing different SIS, then I recommend that you not rely on just MTBF (Mean-Time-Before-Flare, for those in HPI plants). This value can
be very misleading when used without determining the Overall Integrity Factor. This is a measure of Reliability and Security capability factors
as related to time. The former relates to an SIS's ability to provide the required action on demand. The latter relates to an SIS's ability to
not cause unwarranted action (false alarms).

An example... when using MTBF, a triplicated voting system is aways more reliable than a non-voting system. However, when assessing reliability
over time, then TMR is only more reliable than SIMPLEX until 'time" reaches about 70% of the objective "mission" time, From that point forward TMR is less reliable.

Regards,
Phil Corso, PE
Trip-A-Larm Corp

ps: Perception is Everything
 
Where can I find the derivation of the formula

lambda_u= chisquared (alpha, 2*n)/(2*T)

Thanks
Maheendra
 
Top