Attack of the Cosmic Rays!
Posted in System administration on June 24th, 2010 by Nelson Elhage – 102 CommentsIt’s a well-documented fact that RAM in modern computers is susceptible to occasional random bit flips due to various sources of noise, most commonly high-energy cosmic rays. By some estimates, you can even expect error rates as high as one error per 4GB of RAM per day! Many servers these days have ECC RAM, which uses extra bits to store error-correcting codes that let them correct most bit errors, but ECC RAM is still fairly rare in desktops, and unheard-of in laptops.
For me, bitflips due to cosmic rays are one of those problems I always assumed happen to “other people”. I also assumed that even if I saw random cosmic-ray bitflips, my computer would probably just crash, and I’d never really be able to tell the difference from some random kernel bug.
A few weeks ago, though, I encountered some bizarre behavior on my desktop, that honestly just didn’t make sense. I spent about half an hour digging to discover what had gone wrong, and eventually determined, conclusively, that my problem was a single undetected flipped bit in RAM. I can’t prove whether the problem was due to cosmic rays, bad RAM, or something else, but in any case, I hope you find this story interesting and informative.
The problem
The symptom that I observed was that the expr program, used by shell scripts to do basic arithmetic, had started consistently segfaulting. This first manifested when trying to build a software project, since the GNU autotools make heavy use of this program:
[nelhage@psychotique]$ autoreconf -fvi autoreconf: Entering directory `.' autoreconf: configure.ac: not using Gettext autoreconf: running: aclocal --force -I m4 autoreconf: configure.ac: tracing Segmentation fault Segmentation fault Segmentation fault Segmentation fault Segmentation fault Segmentation fault …
dmesg revealed that the segfaulting program was expr:
psychotique kernel: [105127.372705] expr[7756]: segfault at 1a70 ip 0000000000001a70 sp 00007fff2ee0cc40 error 4 in expr
And I was easily able to reproduce the problem by hand:
[nelhage@psychotique]$ expr 3 + 3 Segmentation fault
expr definitely hadn’t been segfaulting as of a day ago or so, so something had clearly gone suddenly, and strangely, wrong. I had no idea what, but I decided to find out.
Check the dumb things
I run Ubuntu, so the first things I checked were the
/var/log/dpkg.log and /var/log/aptitude.log files, to determine whether any suspicious packages had been upgraded recently. Perhaps Ubuntu accidentally let a buggy package slip into the release. I didn’t recall doing any significant upgrades, but maybe dependencies had pulled in an upgrade I had missed.
The logs revealed I hadn’t upgraded anything of note in the last several days, so that theory was out.
Next up, I checked env | grep ^LD. The dynamic linker takes input from a number of environment variables, all of whose names start with LD_. Was it possible I had somehow ended up setting some variable that was messing up the dynamic linker, causing it to link a broken library or something?
[nelhage@psychotique]$ env | grep ^LD [nelhage@psychotique]$
That, too, turned up nothing.
Start digging
I was fortunate in that, although this failure is strange and sudden, it seemed perfectly reproducible, which means I had the luxury of being able to run as many tests as I wanted to debug it.
The problem is a segfault, so I decided to pull up a debugger and figure out where it’s segfaulting. First, though, I’d want debug symbols, so I could make heads or tails of the crashed program. Fortunately, Ubuntu provides debug symbols for every package they ship, in a separate repository. I already had the debug sources enabled, so I used dpkg -S to determine that expr belongs to the coreutils package:
[nelhage@psychotique]$ dpkg -S $(which expr) coreutils: /usr/bin/expr
And installed the coreutils debug symbols:
[nelhage@psychotique]$ sudo aptitude install coreutils-dbgsym
Now, I could run expr inside gdb, catch the segfault, and get a stack trace:
[nelhage@psychotique]$ gdb --args expr 3 + 3 … (gdb) run Starting program: /usr/bin/expr 3 + 3 Program received signal SIGSEGV, Segmentation fault. 0x0000000000001a70 in ?? () (gdb) bt #0 0x0000000000001a70 in ?? () #1 0x0000000000402782 in eval5 (evaluate=true) at expr.c:745 #2 0x00000000004027dd in eval4 (evaluate=true) at expr.c:773 #3 0x000000000040291d in eval3 (evaluate=true) at expr.c:812 #4 0x000000000040208d in eval2 (evaluate=true) at expr.c:842 #5 0x0000000000402280 in eval1 (evaluate=<value optimized out>) at expr.c:921 #6 0x0000000000402320 in eval (evaluate=<value optimized out>) at expr.c:952 #7 0x0000000000402da5 in main (argc=2, argv=0x0) at expr.c:329
So, for some reason, the eval5 function has jumped off into an invalid memory address, which of course causes a segfault. Repeating the test a few time confirmed that the crash was totally deterministic, with the same stack trace each time. But what is eval5 trying to do that’s causing it to jump off into nowhere? Let’s grab the source and find out:
[nelhage@psychotique]$ apt-get source coreutils
[nelhage@psychotique]$ cd coreutils-7.4/src/
[nelhage@psychotique]$ gdb --args expr 3 + 3
# Run gdb, wait for the segfault
(gdb) up
#1 0x0000000000402782 in eval5 (evaluate=true) at expr.c:745
745 if (nextarg (":"))
(gdb) l
740 trace ("eval5");
741 #endif
742 l = eval6 (evaluate);
743 while (1)
744 {
745 if (nextarg (":"))
746 {
747 r = eval6 (evaluate);
748 if (evaluate)
749 {
I used the apt-get source command to download the source package from Ubuntu, and ran gdb in the source directory, so it could find the files referred to by the debug symbols. I then used the up command in gdb to go up a stack frame, to the frame where eval5 called off into nowhere.
From the source, we see that eval5 is trying to call the nextarg function. `gdb` will happily tell us where that function is supposed to be located:
(gdb) p nextarg
$1 = {_Bool (const char *)} 0x401a70 <nextarg>
Comparing that address with the address in the stack trace above, we see that they differ by a single bit. So it appears that somewhere a single bit has been flipped, causing that call to go off into nowhere.
But why?
So there’s a flipped bit. But why, and how did it happen? First off, let’s determine where the problem is. Is it in the expr binary itself, or is something more subtle going on?
[nelhage@psychotique]$ debsums coreutils | grep FAILED /usr/bin/expr FAILED
The debsums program will compare checksums of files on disk with a manifest contained in the Debian package they came from. In this case, examining the coreutils package, we see that the expr binary has in fact been modified since it was installed. We can verify how it’s different by downloading a new version of the package, and comparing the files:
[nelhage@psychotique]$ aptitude download coreutils [nelhage@psychotique]$ mkdir coreutils [nelhage@psychotique]$ dpkg -x coreutils_7.4-2ubuntu1_amd64.deb coreutils [nelhage@psychotique]$ cmp -bl coreutils/usr/bin/expr /usr/bin/expr 10113 377 M-^? 277 M-?
aptitude download downloads a .deb package, instead of actually doing the installation. I used dpkg -x to just extract the contents of the file, and cmp to compare the packaged expr with the installed one. -b tells cmp to list any bytes that differ, and -l tells it to list all differences, not just the first one. So we can see that two bytes differ, and by a single bit, which agrees with the failure we saw. So somehow the installed expr binary is corrupted.
So how did that happen? We can check the mtime (“modified time”) field on the program to determine when the file on disk was modified (assuming, for the moment, that whatever modified it didn’t fix up the mtime, which seems unlikely):
[nelhage@psychotique]$ ls -l /usr/bin/expr -rwxr-xr-x 1 root root 111K 2009-10-06 07:06 /usr/bin/expr*
Curious. The mtime on the binary is from last year, presumably whenever it was built by Ubuntu, and set by the package manager when it installed the system. So unless something really fishy is going on, the binary on disk hasn’t been touched.
Memory is a tricky thing.
But hold on. I have 12GB of RAM on my desktop, most of which, at any moment, is being used by the operating system to cache the contents of files on disk. expr is a pretty small program, and frequently used, so there’s a good chance it will be entirely in cache, and my OS has basically never touched the disk to load it, since it first did so, probably when I booted my computer. So it’s likely that this corruption is entirely in memory. But how can we test that? Simple: by forcing the OS to discard the cached version and re-read it from disk.
On Linux, we can do this by writing to the /proc/sys/vm/drop_caches file, as root. We’ll take a checksum of the binary first, drop the caches, and compare the checksum after forcing it to be re-read:
[nelhage@psychotique]$ sha256sum /usr/bin/expr 4b86435899caef4830aaae2bbf713b8dbf7a21466067690a796fa05c363e6089 /usr/bin/expr [nelhage@psychotique]$ echo 3 | sudo tee /proc/sys/vm/drop_caches 3 [nelhage@psychotique]$ sha256sum /usr/bin/expr 5dbe7ab7660268c578184a11ae43359e67b8bd940f15412c7dc44f4b6408a949 /usr/bin/expr [nelhage@psychotique]$ sha256sum coreutils/usr/bin/expr 5dbe7ab7660268c578184a11ae43359e67b8bd940f15412c7dc44f4b6408a949 coreutils/usr/bin/expr
And behold, the file changed. The corruption was entirely in memory. And, furthermore, expr no longer segfaults, and matches the version I downloaded earlier.
(The sudo tee idiom is a common one I used to write to a file as root from a normal user shell. sudo echo 3 > /proc/sys/vm/drop_caches of course won’t work because the file is still opened for writing by my shell, which doesn’t have the required permissions).
Conclusion
As I mentioned earlier, I can’t prove this was due to a cosmic ray, or even a hardware error. It could have been some OS bug in my kernel that accidentally did a wild write into my memory in a way that only flipped a single bit. But that’d be a pretty weird bug.
And in fact, since that incident, I’ve had several other, similar problems. I haven’t gotten around to memtesting my machine, but that does suggest I might just have a bad RAM chip on my hands. But even with bad RAM, I’d guess that flipped bits come from noise somewhere — they’re just susceptible to lower levels of noise. So it could just mean I’m more susceptible to the low-energy cosmic rays that are always falling. Regardless of whatever the cause was, though, I hope this post inspires you to think about the dangers of your RAM corrupting your work, and that the tale of my debugging helps you learn some new tools that you might find useful some day.
Now that I’ve written this post, I’m going to go memtest my machine and check prices on ECC RAM. In the meanwhile, leave your stories in the comments — have you ever tracked a problem down to memory corruption? What are your practices for coping with the risk of these problems?
Edited to add a note that this could well just be bad RAM, in addition to a one-off cosmic-ray event.
Out-of-this-world convenience!
Memory errors didn’t force me to restart my system. Don’t let kernel updates force you to restart yours. Try Ksplice Uptrack, and win back your 3am maintenance window.

Did you run memtest on your RAM? It doesn’t look like a cosmic ray to me, just faulty RAM.
To be totally honest, I haven’t run memcheck yet. I’ve been meaning to do so ever since this happened, but I keep putting it off because I don’t want to be without my desktop.
So, yes, it may well be a bad RAM stick. But I hope it’s still a good story. And, even though these errors have recurred, they’re rare enough that I’ll defend the theory that it could be cosmic rays, but the lower noise tolerances on the bad chips mean that lower-energy rays that would have been shrugged off by a healthier chip are causing the bit flips.
Yeah, while your write up of finding the exact point in memory where this happened is awesome, the fact that you’ve seen it more often than once suggests faulty hardware rather than a cosmic ray. Any one know the odds of a cosmic ray striking a bit in RAM?
The exact same thing happened to me once, only it was the vim binary that segfaulted. memtest86 confirmed bad ram.
@colin I found this to be interesting and related:
http://lambda-diode.com/opinion/ecc-memory-2
Great post, there were many techniques in your investigation that I wasn’t aware of.
That also happened to me. Some program I was working on suddenly had stopped compiling. The error was on one of the default library header files, where a single character was changed — but the file was read-only. Comparing the changed bytes, the only change was indeed in a single bit. Running memtest, there was a failure in a single bit, in the same position (relative to the 32-bit word).
I’d also suggest you to run memtest and check it.
Great article… definitely exposed me to a lot of good techniques that I hadn’t heard of before. Makes me think of this:
http://xkcd.com/378/
Very awesome post, thanks for such enlightening posts
There’s a classic FAQ on this topic at http://www.bitwizard.nl/sig11/
I had fun with unreliable memory a while back (66 MHz RAM in a 100 MHz bus, long story).
Nice post. Regardless of the cause, I feel like this helps people learn how to investigate stuff like this.
Beware of drop_caches if you’re using an rhel4 (ancient, I know, but our IT dept insists on it). In some kernels that old, it causes the machine to lock up.
I assume you didn’t reboot your machine in a long time, giving time to these errors to accumulate.
Ohh silly me, of course this is the ksplice blog : )
I had bad memory which during a system update wrote a corrupt kernel to my boot partition. That was not a nice experience.
Turned out that the memory was so corrupt it wasn’t even possible to repair the partition either, random memory corruptions occurred everywhere in the memory. Finally had to exchange it for new RAM.
I wonder if my RAM is particularly susceptible to cosmic rays? Memtest always says it’s fine, but I often wonder about some of the quirks I see. Probably faulty software most times, but you always get those few cases that really baffle you.
So, a couple of months ago there was a front page article where the reporter had heard of cosmic ray bit flipping and he actually said that the reason for the sudden acceleration of Toyotas might be cosmic rays! Really, and it wasn’t April fools. Supposedly, the cosmic rays repeatedly hit just the right bit in the software to control sudden accelertaion, but mostly on Toyotas. Makes you wonder what drugs they’re putting in the old compiler back in Tokyo, doesn’t it?
I had a bad stick of Ram in a 2gb machine that was causing a segfault with dpkg & aptitude so I decided to do a simple test to check out the ram I tried booting Slitaz & puppy Linux both of which run entirely from the Ram and neither one would even load. Last time I buy Rendition lol.
Why didn’t you just reboot the computer first off? Would have saved you how many hours it took to debug and write this post. Also, 12GB of RAM? Why so much?
Robin: Well, I checked the notes I took while debugging, and it only took me about 20 minutes, start to finish, to track it down, so it wasn’t all that long. You’re right that it’s certainly possible I would have lost less time by simply rebooting, though.
Given that, though, I still think there were several good reasons for debugging the issue. For one, if it hadn’t gone away after rebooting, rebooting would have potentially lost history and state that might have helped me reconstruct what had changed. For another, I’m a software engineer, and one who spends a lot of time working at low levels in the software stack. I like to believe that I have a pretty good handle on how software systems work, and I just plain don’t like it when they behave in ways that don’t make sense. So when they do, I have a very strong instinct to try to understand why, just out of personal curiosity. And finally, I just plain don’t like rebooting. Why do you think I’m working for a company that makes software that removes the need to reboot?
As for why I spent time writing the blog post? That was pretty much just for fun. And because I enjoy writing up stories of debugging adventures, in the hopes that I can maybe teach someone else something.
As for 12GB of RAM? “Because I can” is one reason — RAM has gotten incredibly cheap. More practically, I find that having absurd amounts of RAM to use as disk cache really does speed up my work significantly, especially in my multi-gigabyte
linux-2.6git checkout and build tree.You definitely did teach some things to some people, as evidenced by some of the earlier comments, and you can add me to that list, too. Fascinating story and some great low-level debugging tips!
Hey it looks like this just happened to the Voyager 2 as well, maybe you are on to something:
http://www.jpl.nasa.gov/news/news.cfm?release=2010-151
Wow — I’m in awe. Thanks for taking the time to detail how you found this. And I agree with your answer to Robin: I’m a sysadmin, and I don’t like it when I don’t understand something…and writing stuff down to teach other people is a good thing.
thx for the tip about flushing the cache… wish I had known long ago. Not to mention, an excellent example of logical debugging procedure. I’ve saved this article as a PDF.
Make sure your motherboard can handle ECC ram before you buy it… might need to replace the board as well.
This happened on a dreamhost server recently; thankfully they do use ECC ram:
http://www.dreamhoststatus.com/2010/06/22/swordfish-getting-reeled-in/
Many years ago I dropped a few unused devices and relinked a Sun 3 kernel. After that the system would panic if I tried to access the tape drive. Not always, but in a predictable fashion. Right after a reboot it was fine, but if I rebooted, let the system sit idle for a few days and attempted a tape operation it would panic.
I finally concluded after several weeks of testing that I had a bad bit which w/ the stock kernel mapped to an unused driver, but w/ the stripped down kernel mapped to the tape drive. Over time the boot time initialization of the driver suffered bit rot leading to the panic.
Needless to say I decided the stock kernel would do just finr.
Hi Nelson,
Wonderful article and lot of free lessons.
I am quite amused at the comments about “why don’t you just reboot! Why did you spend time to debug or write such articles!”Geesh! Hope you would ignore such bozo comments, and continue your postings. In fact you appeared very mature emotionally, not just technically, which shows in your patient response to such stupid comments!.
Thanks in all.
Just one thing keeps hurting my brain after reading this article. What about the momment when you install coreutils-dbg package? didn’t it just overwrite the previous expr application file but with the debugging symbols (that makes it another file)? maybe I miss something somewhere, but I don’t really get it.
great article by the way.
I used to be involved with radiation effects on electronics for space applications. This included testing, simulating and design for cosmic ray upset in addition to gamma, gamma-dot and neutron. I’m currently involved in semiconductor device reliability professionally.
In general, most cosmic ray effects are single-events. The technical name for cosmic ray upset is also “single event upset” or SEU. The cosmic ray is actually a _single_ highly ionized nucleus traveling at relativistic speeds. Though that individual particle has crazy-high kinetic energy, momentum transfer isn’t the primary interaction mechanism: rather it’s charge induction and transfer by virtue of being highly ionized and interacting with a highly charge sensitive material: doped semiconductor.
The net effect is primarily the injection of charge in places that aren’t expecting the sudden appearance of charge at levels comparable to normal circuit current densities. This can cause SRAM cells to flip and can program or erase DRAM or Flash cells. The effects are rarely permanent. The memory cell usually still functions and only needs to be re-written. With extremely high doses or dose-rates, yes, it can be permanent, but this is rarely the situation terrestrially. In space, yes.
If you see a permanent failure, it will generally be something else related to simple reliability failure mechanisms. Deep submicron geometry semiconductors have a number of failure mechanisms that can lead to a cell reliability lifetime in the decade to sub-decade range. Being statistical and specifically having some mechanisms being log-normal, you can get a catastrophic failure much earlier than the expected lifetime with a non-zero probability – i.e. failures occur before the spec’ed lifetime expectation.
topper_harlie: The
-dbgsympackages on Ubuntu install separate files, that only contain the debugging symbols for the relevant binaries. GDB knows (either through configuration or a patch — I’m not sure which) to look at these files to get debug information. For instance:Kind of pointless really to buy ECC ram for a regular desktop.
Unless you’re someone who crunches models all day, things such as ICPR, ModFlow, or air modeling software, it’s really pointless.
Run memtest, find out if you have a bad stick of ram… replace with another non-ECC stick of ram. The only thing that makes wise business sense, is to run ECC ram on heavy duty workstations or servers.
Once upon a time I was managing a herd of 50 machines for an experiment that distributed across the globe. At some point, one of them started giving me interesting experience: flipping bits. The fun stuff was, that I could ssh to it from far away just fine, but if I tried to do ls -l . I would get a certain bit (LSB) in filename characters flipped. Eventually the machine had to travel back for service, where the likely culprit was visible: the pack of dust collected was so thick that rotating fans, even by hand, was something. I guess thermal stress eventually got to damage something along the disk->controller->bus->cpu path leading to such interesting corruptions. VLSI circuits can suffer from repeated thermal stress (ask overclockers if they agree).
I am truly inspired by your efforts here, I am glad you are working in the Linux world..I wish more supposed “guru’s” were more like you and take the time to be curious, and, SHARE.!!
Thanks for posting!
Regards
JimS
(100% Linux 100% of the time)
I totally would have rebooted but I’m glad you didn’t. Interesting stuff!
Memory errors can be rather sneaky. Somewhat related, I bought some RAM for my eMac a few months ago on eBay. There are no noticeable defects when I boot Mac OS X or Linux with it, and it passes Apple Hardware Test*. However, I wrote a userland memory checker for Linux that simply allocates a bunch of memory, writes pseudorandom bytes to it, and verifies it. The new memory failed this test (but the old RAM did not, nor did running this program on my other machine fail). It seems Apple Hardware Test does little more than write/read checkerboards to/from RAM.
My memory checker can be found at http://constellationmedia.com/~funsite/pub/memcheck-0.0.2.tar.bz2 . Other than the LGPLed mtwist library, it’s in the public domain. Enjoy
* I can’t run memtest86+ on the eMac because it’s a PowerPC.
+1 for the recommendation to read:
http://www.bitwizard.nl/sig11/
Around 1999/2000 i debugged a bunch of motherboards running AMD K6-III/450 processors that had issues with disk-to-memory transfers across the bus getting corrupted — possibly motherboard design, cheap RAM that we used, dunno.
If i did Linux kernel compiles in a loop after a few hours it would freak out and either segfault, bus error or else a file being compiled would become corrupted in the buffer cache and it would start to consistently throw the same syntax error.
What i found was that going into the BIOS and underclocking the CPU down would make the errors go away — so it was definitely not a cosmic ray issue.
The sig11 FAQ was invaluable to me in explaining what was going on. And back in my day we didn’t have /proc/sys/vm/drop_caches, we just used dd.
I somewhat doubt that this bit error was caused by a cosmic ray, and suspect the motherboard, and it is one reason why I like spending a little more money for good RAM rather than buying the cheapest Taiwanese RAM you can get your hands on. If the 12GB of RAM was bought with the cheapest possible RAM to get the most amount of RAM per dollars, this may be the root cause.
On the subject of cosmic rays, when I was an undergrad physics major we measured the lifetime of muons created by atmospheric cosmic rays in one of the labs.
I’d recommend doing repetitive linux kernel builds until failure and establish a distribution of how long it takes before the builds start to fail. Then try to underclock the RAM bus or CPU and see if the behavior improves.
BTW,
One thing that I have been pondering is that we’re currently in a solar cycle low. I’d expect to see a very low rate of ionizing radiation issues in servers right now causing memory errors and hardware failures.
It would be interesting to monitor hardware failures and bit errors like this over time across the Enterprise in order to be able to see the impact of cosmic rays as solar cycle 24 increases in intensity. Right now I’m way too busy just building out an infrastructure to be able to build and maintain servers in an internal cloud, so I don’t have the resources to put into this. Someone could do a very good paper, similar to studies of drive failures and temperature across the Enterprise, to show the actual impact of cosmic rays in the Enterprise. I’m not aware of anything showing the practical impact of solar cycles in the datacenter.
Somehow you’d need to be able to take data over time and show that discrepancies in the data were not due to different hardware revs in your datacenter or due to the age of the servers, and be able to isolate bit errors like this. Not very easy to do, unless you have one datacenter above ground and a duplicate with identical hardware running in a mineshaft somewhere.
“I haven’t gotten around to memtesting my machine,” but you had time to write up this (excellent) article?
Cheers
Stephan
Small correction:
…On Linux, we can do this by writing “3″ to the /proc/sys/vm/drop_caches file, as root…
The “3″ is missing.
Thanks for the explanation nelson, didn’t knew that
Problem is that ECC ram isn’t supported by Intel desktop CPU:s its the differentiator between Xeon and core lines.
I bet for a hard disk error, not RAM related
Have you tried to turn it off and on again?
Actually “Cosmic rays” and any radio/wi-fi and so on rays on the earth have the ability to induce small amount of electricity in conductors and semi-conductors, which is exactly why they are dangerous to the computers. But the case of your desktop is solid metal which have connection to your power supply module case, which is connected to the ground (the third pin in your power cable). This protected every part in your computer from any rays. Actually without a computer-case a random flip anywhere (not just in RAM) every minute due to the large amount of radio waves around. So the computer-case is solution for the desktops and there is probably similar solution for notebooks.
As a final words – a flip in the power, faulty contact or penetrating earth signal ray is far more possible reason for breaking the clock signal and miss-writing/miss-reading or bit-flip in memory than the cosmic rays.
sounds like a ksplice bug to me, try rebooting.
Even NASA’s Voyeger 2 spacecraft crashed due to single bit-flip
http://www.jpl.nasa.gov/news/news.cfm?release=2010-151
What if this can happen at the brain level!?
Re: Palmen, a conducting case will protect against radio frequency interference, which is what you’re talking about, but it won’t protect your RAM if a weakly-interacting particle decides to hit it. Neutrinos, for instance, can pass straight through the Earth, so the metal case won’t help much in stopping them.
Of course, the odds of a neutrino hitting your RAM are fairly low…
I was doing my student internship at Compaq when ASCI Q was built and they discovered, statistically then by firing artificial cosmic rays at it, that the ES45 ‘B-TAG cache parity error’ was often caused by cosmic rays. A bit of work with google should find a bunch of papers, e.g.
http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6TJN-4GWC0JH-F&_user=10&_coverDate=12%2F31%2F2005&_rdoc=1&_fmt=high&_orig=search&_sort=d&_docanchor=&view=c&_searchStrId=1381489383&_rerunOrigin=google&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=fa84c67a1f74999725262c2ca01db110
There are a bunch of them, including one where they showed the statistical difference in error rates between the top and the bottom of the rack!
Very good article and kudo’s to your diagnostics skills on chasing down this problem rather than simply rebooting.
Hi Nelson,
I don’t think the problem lies in the RAM. As you say, you get this error consistently using the same program. I take it, you keep getting it even after you have rebooted your machine?
Once rebooted, your RAM will be cleared and the next time you run expr, it will have been re-read from the Hard drive and most likely ended up in a different location on your RAM.
I suspect that expr itself has been corrupted on your hard drive, either by a physical fault on that location of your harddrive where it resides or an intermittent errorneous write that occured either when you installed expr.
You should conclusively be able to determine a hard drive fault by.
1. Download a working version of expr from somwhere and test it.
2. Copy your version of expr to a different location (a hard copy of course).
You should see that the downloaded version is working, while neither the copied version or the original are.
Calling this a single bit flip is unsubstantiated. The entire upper 48 bits are 0 and for all you know, a whole byte or more was written.
Most (maybe all) AMD Phenom processors support ECC, you just need to find a motherboard which does.
Very interesting debug story. Thanks for sharing.
I have an interesting issue with my main server which causes all sorts of weird issues. Static entering through the USB port. Where I live it is very dry and we get a lot of static.
Just a thought.
@Mats
You’re probably trolling, but here goes.
You’ll notice if you actually read the article (it’s worth reading it) you will find that the problem resolved itself when the cache was flushed, meaning the correct data was on the hard drive.
Mats, did you even read the article?
Really interesting and educative article. I’ve learned some new commands along the way.
Thanks!
I don’t have anything useful to add to the cosmic ray argument, though that hasn’t stopped other people commenting, but I just wanted to say that I liked your post because of the way it explains not only the resources used, but the thought processes that lead to their selection. I also liked the liberal use of examples and really appreciate that you’ve struck a good balance between. mind-numbing detail and mind-splitting complexity.
Nelson: Thanks for the write up! I learned a lot about Linux debugging!
I have studied this SEU effect extensively, and I would like to point out to everyone that this is caused by *secondary* cosmic rays. Primary cosmic rays hit a nucleus in the upper atmosphere, resulting in a cascade of secondary particles. The dangerous secondary products are fast neutrons, and these are the particles that end up causing SEU’s. They do this indirectly by smashing into a silicon nucleus, breaking it up and creating an alpha particle, which is highly charged and runs amok through the silicon circuit, leaving charge pairs (holes and electrons) in it’s wake. At the right angle and trajectory, this crowd of charge pairs can induce a bit flip in a memory element. Fast neutrons are not distributed equally, and tend to occur at much higher levels close to the magnetic poles, as well as higher altitudes. They peak at 60,000 feet, somewhat above the standard height for airline flights. Airplanes are at much higher risk of SEU, because of it. At sea level, the effect is *much* lower, by a factor of 10 or more. My company actively monitors the susceptibility of our chips to SEU’s, by building arrays of chips on giant boards, and letting them run 24/7 while monitoring them every second for a bit flip. We have sites at sea level, and on mountain tops (Hawaii, New Mexico, Colorado) to test SEU rates at high elevations. Companies like Cisco drive us to improve SEU hardness, and recovery, because they want their products to have 6 nines of reliable uptime. For the consumer market, there’s very little push to deal with or even acknowledge the issue. There’s lots of suspicion about the Toyota sudden acceleration problem, but the only way to prove the issue is to take a motor control board, and stick it front of a particle accelerator that produces fast neutrons, and just bombard the heck out of it. When we do this to our chips, we get a very fast estimate of SEU susceptibility. Every generation of devices we build is better at detecting and correcting SEU’s.
I had a similar situation a few years back on a server. A particular PHP binary was segfaulting. It turned out that it was the cached copy in memory that was the issue. In this particular case, it was definitely bad RAM that was slowly getting worse because similar random events continued happening with increasing frequency. Eventually, I replaced the RAM and the server operated flawlessly for a couple years.
I would agree with most commenters that it is likely bad RAM rather than a cosmic ray but, as you say, without doing some testing, you can’t rule out cosmic rays, random electrical effects, and so on.
Outstanding article.
1. You’re willing to track down this fail. +5 for curiosity
2. You’re willing to write up the process. +5 for “others-may-learn-from-my-experience”
3. You’re able to write for an audience with less experience/skill than you. +5 for pedagogy
4. Your deft manner with trollish/pedantic posters. +5 for zen master craft
Very interesting and educational.
As an aside I came here via a link at the BBC.
http://www.bbc.co.uk/blogs/seealso/2010/06/tech_brief_35.html
I had a bit flip trip me up about 1968, when British mathematician John Conway published the Game of Life (self-replicating bit patterns that end up ‘moving’ across a grid). In a Scientific American article, he posed a question about what is the fate of one particular starting configuration. I immediately programmed it on Caltech’s IBM 7040/7094 and got an answer. Sixty-six other people sent in answers, all agreeing among themselves, but different from mine. I reran my program and got their answer; this time, the bit did not get dropped. A colleague had a more expensive lesson. He did a long run of a molecular quantum mechanics calculation and got an intriguing drop in energy at one nuclear separation. He spent a lot of money running calculations at nearby separations, only to find that the drop in energy was the artifact of a dropped bit. I doubt it was cosmic rays – more likely is a memory bit out of spec.
Is this Samsung RAM from a couple of years ago by chance? Some of this RAM has a known flaw that a write anywhere in an 8M block could affect a single bit elsewhere in the same block. Yeah, that one was a great joy to track down!
Good artical for “newbee(s)” I have seen similar problems in mainfraims (yes I am that old) mini/supermini computers / micros. I remmber a “microcode-rom” that got a “slow” bit caused minicomputer to intermitatly fail on a conditional jump (should have jumped but did not) days of troubleshooting and “foresic” analysis to find it. 40 years of experiance says that 99% of problems are software but everyone should remember that there is still that 1%+ that is hardware (either poor design / defective “chips” / and sometimes just plain “random”)
Awesome article, well done. Scientific method FTW!
LOL at all the n00bs suggesting you should have just rebooted…
I, too have diagnosed a single-bit error in RAM before. Mine was at a customer site in an embedded x86 device where an array of pointers was being filled with regularly-spaced addresses from a large chunk of contiguous memory. The delta between the pointers was supposed to be N bytes, but one of the pointers was off by 32; sure enough one bit was flipped.
Code inspection showed that the pointer values were only set during initialization, that header information laid down only during init was at the off-by-32 address, and that things were humming along quite nicely until the previous buffer was filled with a packet large enough to overwrite the misplaced header. Good times.
Good writeup. But, there’s nothing that proves this error was caused by a single bit flipping. It could have been some weird buffer overflow or heap spray that wrote a 0 to the byte that contained that 0×40.
I have had what appeared to be memory errors as well, and I was “fortunate” in that it appeared to be a motherboard problem in the upper memory area. I could reboot, and be running for a little while until the upper reaches of memory started to get used (like, when I was copying files), and then !!BLAM!! Memtest86 would take 20 minutes to finally notice the problem, swapping memory chips would cause the same symptoms, and removing one of the two memory chips caused the problem to go away, at the cost of a slower machine.
If you’re concerned about losing your desktop, try running the memory test just before going to sleep. If you’re hoping to avoid rebooting, well, sorry. That’s going to be really tough
That was a nice debugging story. For the sake of repeatability I would like to add that if the coreutils you downloaded was a different build than the one you had, you would have not been able to diff meaningfully. To avoid this and because I’m paranoid about repeatability and observability, specially when something strange happens, I never update/upgrade anything from the Net, it can change myriads of libraries, executables and whatnot, changing behavior that should remain stable. Otherwise, good story, and you have very good debugging skills. Keep it up.
In the 1960′s the mainframe industry learned a lot about the random effects of not having parity checks on memory and on address/data buses.Cosmic particles are only one of many unpredictable corrupting conditions Modern electronics often appear cavalier in their lack of attention to possible metastability events. It would have been expected that by now ECC and bus parity would have been economically standard protection.
Don’t forget that “cosmic” particles are also a natural emission of ram chip packaging materials. The high-spec chip ceramic packages were found to emit more particles than “cheap” plastic ones.
Huh.
Perhaps you should consider a Faraday Cage or lead foil shielding?
If your motherboard won’t take the specialty RAM, a little lead foil might bring peace of mind.
Back in the day – i put a smoke detector above my Model1 Radio Shack – not trusting the thing form going up in smoke. The radiation from the smoke detector kept zapping the memory – all the problems stopped when i moved the detector across the room……
While building a cosmic ray shield, why not take the opportunity to make a tin foil hat too?
@William Carr: My thoughts exactly. I found a paper at which has a nice study of the attenuation provided by several materials over long periods of time. Page 142 has a nice summary but in short, lead is the best followed by copper, iron and aluminum. Since lead is not the safest material to handle, perhaps a sheet of copper would complement nicely the plastic or plastic+thin iron/zinc sheet usually found on desktops.
I also agree with previous posters that the most likely culprit is a bad component, probably a bad RAM chip. Or as sydkahn implies, perhaps an altogether different source of magnetic radiation.
Great article anyway, and very interesting comments also.
Oops, the URL for the paper I mention is lost in transit – I enclosed it in angle brackets and it seems to have been treated like a very strange HTML tag
Here it is without the brackets:
http://www-d0.fnal.gov/~diehl/Public/snap/meetings/NASA-97-cp3360.pdf
I haven’t done much memory debugging for a few years, but I did quite a bit at a previous job.
For systems with ECC using a linpack configured to use up as much ram as possible and achieve maximal efficiency along with a driver that flags correctable errors was better at finding problem dimms than memtest. I suspect this was because memtest didn’t stress the rest of the system, so environmental factors like ambient temperature may differ between memtest and actual applications.
For systems without ECC, if memtest can’t find a problem linpack can be helpful to verify that the system is having a problem. It solves the same problem twice using different methods and compaires the results, reporting if there is a discrepancy (and likely bad hardware).
Very interesting article, excellent debugging skills.
By chance I came across an article the other day by Arjan van de Ven about common problems in device drivers – one of them is problems with self-written synchronization mechanisms that will fail when run on SMP systems. According to Arjan, it has already taken 5 years of bugfixing since the introduction of SMP into the kernel to get rid of many bugs in the core system. It’s not unthinkable that you have run into one of those either.
FYI, the article was at http://www.fenrus.org/how-to-not-write-a-device-driver-paper.pdf.
@Vince,
R-pentomino FTW!
About a year ago on an 8GB machine I was working with a long-running windows service I wrote that started complaining when another program called it. I popped up the debugger, walked through the code, found a global initialized by an enum had experienced a bit flip – turning something like an 0×00000200 into 0×00100200. Like you, I cannot prove it was cosmic ray induced but I suspect it was and think I have experienced one or two similar odd failures in the last four or five years. I also suspect most bit flips are in uncritical regions (how much of what is in RAM actually has an effect on program flow?)
My new home machine has ECC RAM…
Several years ago I started to see segfaults in all programs on my server. I decided that it was a RAM error (there were four A-DATA chips, 1GB each). Since I couldn’t go there at that moment to replace the RAM, and the server was important to run smoothly, I simply modified boot arguments to append mem=3G and rebooted, to eliminate the fourth chip entirely. The problem was not fixed, so I appended mem=2G and rebooted again, to eliminate fourth and third chips. This time the problem vanished. So the third RAM chip was broken. No cosmic rays detected
RAM is definitely the biggest source of all computer HW problems. I build computers professionally and I test every PC over night with memtest and roughly every tenth machine has faulty RAM. If you add some more RAM into your PC and this RAM is not exactly the same is your old one, its very probable, like 50%, it would not work properly. Nice story anyway, I bet its a bad RAM in your machine.
Gotta love the RAM.
Hi,
Great great post!! Very good story, and the tools you used were really useful.
(Also I loved the explanation for why one should use sudoe tee file instead of echo >).
Best regards!
Excelent post, excelent blog. It would be nicer if you run the memtest but, anyway, it’s a great post.
Cheers.
I’ve had a similar problem some times ago: one of my intern came to me with strange gcc compilation output, where it couldn’t find an include, which did exist and add always worked. But when looking closely to the output, one character in the reported unfound include path differed from the source one. I checked, and the character differed from the original only by one bit. Of course, restarting the compilation worked. Memchecked the RAM, it was fine, so I drew the same conclusion