What is MTTF, MTTR, MTTD, or MTBF? An introduction to incident and service metrics

In addition to the typical metrics that you may think of as being part of a service, that is CPU, instance count, disk, etc… there is another class of metrics data that tells you about the potential reliability of your service.

These are, MTTF, MTTR, MTTD, and MTBF. These are Mean Time To Failure, Mean Time To Resolve, Mean Time To Detection, and MTBF.

These are all metrics that cannot be observed directly. That is you cannot take a single data point on a graph and say this is our MTTF. That’s because it takes at least two data points and must be computed.

Further, you need to decide on what timeline you’ll compute this. Say over the last year? Six months?

You may have seen a variety of acronyms associated with these metrics, here are some that you’ll encounter:

MTTF - Mean Time To Failure. This is the average of how long between when something goes down. Since its of course up in between failures, this is often just “uptime” averaged over a period. From reliability engineering, this is intended to be used for systems and components that can’t be repaired and instead or just replaced.

MTTR - Mean Time To Repair. This is the average of how long it takes for things to come back up once they are down. This time period represents all the work of repairing the component of the system.

MTTD - Mean Time To Detection. This is the average of how long it takes to realize something is down. So for example if something went down at 1200, but no one noticed or was alerted until 1210, then the time to detection was 10 minutes. If you had multiple incidents over time, you could use the data points to average this.

MTBF - Mean Time Between Failure. Similar to MTTF, but for repairable items.

A warning about incident metrics

I primarily include these definitions so that you can be aware of what they are. It can be helpful/important to know about the existence of these metrics as you’ll often hear their use encouraged.

It’s also important to know that by using these metrics, you can make yourself blind to some more important things.

Most of these metrics come from reliability engineering, but not software engineering. That means the physical world. Even there, it can be argued that many of these metrics aren’t appropriate. If one motor started rusting and lead to failure, would you expect others? Well, it depends on the conditions doesn’t it?

When we talk about people and their behavior in complex situations such as incidents and outages, these metrics become less and less relevant.

Putting too much effort or thought into these metrics whispers the lie that all incidents are the same and if you can control some of these factors than you can improve your incident response.

The problem is this isn’t true. At the very least it’s backwards. Fixing many other things may help these metrics improve. At the worst, focusing on these will keep you from ever asking the right questions, and keep you from getting the right answers.

So how do you begin improving the things that drive these metrics?

  1. Ask questions
  2. Understand that these metrics will never tell you the truth.

You can lay groundwork in a similar manner as other disaster planning, things happen that you don’t expect. All you can do is be well prepared for that.

Focus on things you can control, like how soon you can detect an incident. Then ask questions about that number.

Questions you might wish to answer/know about your incidents and your team:

Complementary Process Diagnostic

With 30 minutes spent looking over your existing pipeline or process, we can diagnose the root cause(s) of slow releases or bugs. After 30 minutes on the phone or Skype with me, you will:

  1. Understand why your existing process may be allowing bugs to reach production and slow you down
  2. Learn ways you can increase the speed that features reach customers
  3. Have a list of ideas for improving your release process

We’ll have to move fast, of course, to get through all this in 30 minutes but don’t worry, I’ll keep us on track. To receive your meeting agenda and scheduling instructions, enter your email address below and click the button.