Welcome to Mike Grusin's

“No matter where you go, there you are.”
-Buckaroo Banzai

Bitflips

Filed under: Uncategorized — mgrusin at 9:43 am on Thursday, October 8, 2009

65595_memorychipMy first space “work” was with, what was back then, large-scale DRAM memory chips.  (By “large” I’m talking 4096 bits per chip; this was the very beginning of the personal computer revolution.  You kids get off my lawn!)  Shrinking the circuitry was increasing the capacity of these devices by leaps and bounds, but unfortunately this was also causing an alarming rise in random, radiation-induced errors.  Smaller circuitry carries a smaller amount of charge, making it increasingly susceptible to the charged particles that are always zipping around and through us.  A charged particle hits a memory cell, and suddenly what was once a “1” is now a “0”, and your bank balance (or anti-lock braking system or rocket ignitor) is now better or (more likely) worse than you left it.

At the time, IBM researchers found that most of these errors were coming from impurities in the chip packaging itself, which is a solvable problem.  But a small number of errors were caused by cosmic radiation; high-energy particles generated by supernovae and other events in deep space.  The earth’s atmosphere and magnetic field shield us from most of these particles, but a few will always get through.  While reading about this in high school, it occurred to me that this could be a serious problem for computers in space.  So I wrote up an experiment proposal for NASA’s Space Shuttle Student Involvement Project, and was accepted as a finalist.  (While presenting my work at NASA Ames, an engineer asked me if he could blame all of his bugs on cosmic rays.  I replied with what I had calculated as the error rate for a single 4kb chip: “Sure, once every sixteen years.”)   The problem remains, but most hardware manufacturers consider it a one-in-a-million event, and pretend it doesn’t exist.  Crashes due to software bugs are vastly more common.

Thus it is interesting to hear of a new long-term study by Google on DRAM errors.  Google has so much computing capacity that it is uniquely positioned to perform such research (and it’s gratifying that they take the time to do so).  Surprisingly, the study found error rates 15 times higher than previously expected, but they also found that most of those errors were coming from the same group of chips, suggesting that manufacturing variability is a cause.  The study also found that age is a factor, with error rates sharply increasing at 20 months.  This could be due to chip designers cutting their silicon margins to the bone, knowing that due to Moore’s law, commercial hardware has a very limited lifetime these days.

What I took away from that early work was that nothing is perfect.  If you’re programming a computer and you put a byte of data into memory, the chances are extremely good that you’ll get the same byte back.  But it’s not 100% guaranteed.  And if it’s VERY IMPORTANT that you get the same byte back, you should invest in fault tolerance of some sort.  Spacecraft computers work around this problem with special hardened construction, redundant “voting” systems, and error-correcting codes in memory (which are an effective solution to the DRAM error problem, though it costs an extra 25% to store the correcting information).  I sometimes  wonder why these reliability lessons from space aren’t incorporated more into our everyday machines, but until software quality improves, it’s usually cheaper to just reboot.