SSD Life Expectancy

Just bought an expensive SSD and want to get scared? Wumpus has posted a list of SSD failures in his own and colleagues’ systems.

I bought a set of three Crucial 128 GB SSDs in October 2009 for the original two members of the Stack Overflow team plus myself. As of last month, two out of three of those had failed. And just the other day I was chatting with Joel on the podcast (yep, it’s back), and he casually mentioned to me that the Intel SSD in his Thinkpad, which was purchased roughly around the same time as ours, had also failed.

Portman Wills, friend of the company and generally awesome guy, has a far scarier tale to tell. He got infected with the SSD religion based on my original 2009 blog post, and he went all in. He purchased eight SSDs over the last two years … and all of them failed. The tale of the tape is frankly a little terrifying:

Super Talent 32 GB SSD, failed after 137 days
OCZ Vertex 1 250 GB SSD, failed after 512 days
G.Skill 64 GB SSD, failed after 251 days
G.Skill 64 GB SSD, failed after 276 days
Crucial 64 GB SSD, failed after 350 days
OCZ Agility 60 GB SSD, failed after 72 days
Intel X25-M 80 GB SSD, failed after 15 days
Intel X25-M 80 GB SSD, failed after 206 days 

Says he’ll keep buying SSDs anyway, even though they fail after one year on average!

Just based on the posts here, I can tell SSD is not quite ready for prime time. It’s not an upgrade I plan to make anytime soon.

If you are using it as an OS partition rather than a data drive, failure is more an annoyance than a catastrophic event. A major annoyance, especially considering how expensive the drives are. But at least it’s not a data-loss nightmare.

Should be ok in Europe, though. 2 year warranty on everything!

I bought an iMac last June and added one of the then-new Sandforce drives from OWC (a 240GB Mercury Extreme Pro model). I sold that iMac a couple of months ago, and just got an email from the buyer that the drive bit the dust – right after I had purchased one of the new generation 6Gb drives from OCZ (Vertex 3) for my laptop. I hope this one lasts longer than that one did…

So what happened to all those “will last for umpteen years” claims? I mean, I’d expect things from dodgy performance-oriented companies like OCZ to die well before expected – gamer-type people really expect stuff to fail regularly – but I’m shocked that Intel drives would be flaky. Intel customers are stolid, mainstream types.

I don’t think of OCZ as dodgy, but yeah, it’s weird that a drive with a 2,000,000 hour MTTF and 3 year warranty (like my failed OWC and new Vertex 3 both have) would die so quickly.

Given that my first drive failed after 1 week of operation, I’m now backing up its warrantied replacement more often than I backup my media drive. Paranoia will save me, I’m sure of it.

Of course, getting to pay out the nose for shipping and waiting the better part of two weeks for the whole process to complete was an utter treat. These things fail constantly; you’d think they’d have streamlined the process a bit.

I think the rated hours relates to the memory capacity,… um… how many times a bit can flip. It’s why garbage collection wears the drive out faster. It doesn’t appear related to the actual durabilty of the whole drive “system” together, it appears.

From where the fuck do the MTBF stats that the companies announce comes from then?
Pure lies? :(

edit: this was useful 5th USENIX Conference on File and Storage Technologies - Paper

Conclusion

Many have pointed out the need for a better understanding of what disk failures look like in the field. Yet hardly any published work exists that provides a large-scale study of disk failures in production systems. As a first step towards closing this gap, we have analyzed disk replacement data from a number of large production systems, spanning more than 100,000 drives from at least four different vendors, including drives with SCSI, FC and SATA interfaces. Below is a summary of a few of our results.

Large-scale installation field usage appears to differ widely from nominal datasheet MTTF conditions. The field replacement rates of systems were significantly larger than we expected based on datasheet MTTFs.
For drives less than five years old, field replacement rates were larger than what the datasheet MTTF suggested by a factor of 2-10. For five to eight year old drives, field replacement rates were a factor of 30 higher than what the datasheet MTTF suggested.
Changes in disk replacement rates during the first five years of the lifecycle were more dramatic than often assumed. While replacement rates are often expected to be in steady state in year 2-5 of operation (bottom of the ``bathtub curve''), we observed a continuous increase in replacement rates, starting as early as in the second year of operation.

Fascinating stuff.

It’s also interesting to note that the Kroll Ontrack/IBAS guys say that restoring data from flashbased media is much much harder than from harddrives, so don’t use them for data-storage.

My first one (more than a year old) is now my games drive and the new Corsair is my main OS/CS5-drive… and I back up every night, so with my EU warranty I should be fine and the extra speed is worth it.

I have bought more than that, for as long if not longer than that (though mostly intel) and not had a single SSD fail yet. A G.Skill Titan stuck in a laptop however has degraded so badly performance wise I am tempted to just junk it.

Is the speed difference worth it? Haha. I break out into hives now every time I have to use a computer with a spinning platter.

Man, this runs completely contrary to the calculations Anandtech did in their original X-25m review:

http://www.anandtech.com/show/2829/6

If I never install another application and just go about my business, my drive has 203.4GB of space to spread out those 7GB of writes per day. That means in roughly 29 days my SSD, if it wear levels perfectly, I will have written to every single available flash block on my drive. Tack on another 7 days if the drive is smart enough to move my static data around to wear level even more properly. So we’re at approximately 36 days before I exhaust one out of my ~10,000 write cycles. Multiply that out and it would take 360,000 days of using my machine the way I have been for the past two weeks for all of my NAND to wear out; once again, assuming perfect wear leveling. That’s 986 years. Your NAND flash cells will actually lose their charge well before that time comes, in about 10 years.

I mean, at least the good news is that you won’t lose your data when an SSD fails, unlike an HDD. Hopefully my X-25m gen2 fairs better…

Are you sure you don’t lose your data when it fails? Yes, in theory it just can’t rewrite; but in theory they shouldn’t be failing this fast.

People who have had them fail: Were you able to pull all the data off afterward?

Thanks for raising my awareness of this. It’s making me seriously consider getting AppleCare for my new-ish (2 months) MacBook Air. Let Apple deal with replacing the 256GB drive when (if? nah, when…) it fails.

Apple’s Air’s SSDs are caseless and soldiered onto the board. You’re stuck with an Apple repair regardless.

It doesn’t. You are assuming that the flash drive failures are a result of cell wear. But that might not be the case. It could be manufacturing defects in the memory chips, weak solder points, it could be controller chips dying. capacitors exploding. who knows?

Cell wear is only part of the lifespan equation.

Calculating when all of the flash memory goes bad doesn’t really mean anything since you’ll start hitting blocks that fail early and become unreadable long before that point. Being able to remap the block doesn’t help you get that lost data back, and you’re not going to keep using a drive that keeps randomly losing data.

I don’t know. Wumpus’ experience is what it is, but if SSDs were going down after a year or so I think there would be a lot more internet anger… I haven’t searched but you’d expect lengthy threads of doom on HARDOCP etc. And essentially useless time to failure numbers are a much bigger story than OCZ using a poorer 25nm process, or Intel not implementing TRIM say, both of which did intrude on my consciousness.

So to be honest I’m a bit sceptical.

Going off to back up now :)

In Wumpus’ defense he is probably working with both the best case for speed benefits and the worst case for wear and tear - which is compiling. Tons of tiny intermediate files can be generated every compile cycle, and lots of programmers like to hit compile as often as they breath.