Hard disk failure data from Google



Surprising bit from article. That PDF looks interesting:
The report said that there was a clear trend showing “that lower temperatures are associated with higher failure rates”.

“Only at very high temperatures is there a slight reversal of this trend.”

But hard drives which are three years old and older were more likely to suffer a failure when used in warmer environments. [/indent]

Very surprising indeed. Then again, I had always assumed high temperature for a hard disk to be 90 (F) degrees and up.

However, skimming through the PDF and working in a job where I’m required to do a lot of work with data centers all I can say is JESUS those guys monitor the shit out of everything in there if they can get this much data for that study.

I’m a little paranoid on hard drive failure. Since I started encoding my CDs back in 1993, I must have reencoded my entire collection three or four times. Some because the old CD’s back then have lousy failure rates, lasting maybe 8 years. I take reasonable care of my cd’s, placing them in jewel cases or the lined-books. I don’t store them in climate controlled rooms.

These are all my own opinion:

CPU’s I allow to run to 60s, 70s comfortably, dialing down fan for noise. Hard drives I feel uncomfortable in the 40’s. I like to place them directly in front of intake fans, stuck at the bottom of the case with bits of sorbothane.

A firmware engineer for one of the big companies tells me one of the common failure points is in the electronics, in which case you’d see failure very early. This seems supported by Google’s data with the early spikes. Another “intuitive” bit in the study is once you see a bad cluster, move that junk off the drive cause it’s gonna die.

My typical hard drives for home are 35C to 45C. I put them in front of a slow intake fan, as close to bottom of case as possible. Google’s data shows higher failure rates around 30 C. What ambient or case temperatures were these drives in? I don’t have experience in server rooms.

They don’t mention that specifically. But based upon heat ranges all over the board I’m guessing here that it’s a data center without floor blown cooling. So a rack that isn’t near an air handling unit might have temperatures greater than say, one near the unit. The overhead type of cooling leads to more hotspots (in my opinion.)

The side problem is that no two racks tend to get built out the same once equipment gets placed. In Googles case maybe they have a standard form factor, but since they use hard drives that are off the shelf and vary in size, I’m guessing they use enclosures that are different sizes too. So one case may run hotter than another, and those furthest from the cooling source would run very hot.

Probably the strangest thing I’ve seen is when a data center loses all cooling for an extended period. You get to see exactly what breaks down and when. Surprisingly I’ve seen a large storage array full of hard disks come through extended periods with NO problem. Servers on the other hand … not so much.

For most of you, any hard drives you have in your place of residence will most likely be in the 40C to 60C temperature range. So keep those fans blowing/sucking air over your hard drives.

Those of you who cool your homes down to the ambient temperature of a data center (~20C, 50% humidity) are fucked. :-)

Those are amazing results. However, do note the following:

What stands out are the 3 and 4-year old drives, where the trend for higher failures with higher temperature is much more constant and also more

We can conclude that at moderate temperature ranges it is likely that there are other effects which affect failure rates much more strongly than temperatures do.

Plausible explanation: those super-cool drives that fail early do so because they were run to the ground on extremely heavy duty – which is also why they received so much cooling in the first place. Correlation, not causation…

Anyway, my hard disk is usually around 35°C which is the optimal temperature no matter how you interpret those charts. :p