This is what it sounds like when kernels cry

For the last month or so, my linux box (the one I use to get most of my work done) has been freezing up occassionally. I hadn't been able to discern exactly why. There's been no one thing that I've done which caused it to happen, so it's been hard to troubleshoot. To complicate matters, it started happening when I got my Shuttle hooked up and started using a mouse and keyboard via a USB KVM switch (which has been real nice; the KVM even does sound). There's a lot that's changed over the last couple months.

I'd never seen it hung after sitting idle, so I thought maybe it was something to do with the USB subsystem, my kernel/drivers, etc. It was frustrating me. Since it only happened "every once in a while" (i.e., I'd get a good one, two or four days uptime) I put it on my list of things to look at later. Hey, I put up with Windows for years. This was nothing new to me. I could handle Linux randomly dying on me for another month or two.

Today I noticed that less (1) froze, and became a defunct process. Then over the course of a half hour or so, a few other new commands I had tried to start began freezing. My Linux box was in the process of going titsup.tar.gz.

I managed to save everything I was working on (although I had to forcefully kill two apps) and even got X shut down. I played with the idea of syncing the filesystem buffers before I rebooted, but a thought struck me: "What if I have bad RAM? Do I want to access it prematurely by flushing the buffers?" I didn't, and opted for a normal reboot. The box crashed and burned about halfway through the shutdown. This is what it looked like. (I've only seen a kernel panic a couple times, so I thought it was worthy of a pic.)

When the machine came up again, I went looking through the system logs. I saw this:

Jun 4 13:34:45 lazlo shutdown: shutting down for system reboot
Jun 4 13:34:45 lazlo init: Switching to runlevel: 6
Jun 4 13:34:46 lazlo login(pam_unix)[2954]: session closed for user wee
Jun 4 13:34:47 lazlo rhnsd[2948]: Exiting
Jun 4 13:34:47 lazlo rhnsd: rhnsd shutdown succeeded
Jun 4 13:34:47 lazlo atd: atd shutdown succeeded
Jun 4 13:34:47 lazlo rc: Stopping keytable: succeeded
Jun 4 13:34:47 lazlo cups: cupsd shutdown succeeded
Jun 4 13:34:47 lazlo xfs[2920]: terminating
Jun 4 13:34:47 lazlo xfs: xfs shutdown succeeded
Jun 4 13:34:47 lazlo mysqld: Stopping MySQL: succeeded
Jun 4 13:34:47 lazlo bcrypt: Unmounting encrypted filesystems.
Jun 4 13:34:48 lazlo rc: Stopping bcrypt: succeeded
Jun 4 13:34:48 lazlo gpm: gpm shutdown succeeded
Jun 4 13:34:49 lazlo httpd: httpd shutdown succeeded
Jun 4 13:34:49 lazlo sshd: sshd -TERM succeeded
Jun 4 13:34:49 lazlo xinetd[2740]: Exiting...
Jun 4 13:34:49 lazlo xinetd: xinetd shutdown succeeded
Jun 4 13:34:50 lazlo crond: crond shutdown succeeded
Jun 4 13:34:50 lazlo apmd[2689]: Exiting
Jun 4 13:34:51 lazlo apmd: apmd shutdown succeeded
Jun 4 13:34:51 lazlo kernel: <1>Unable to handle kernel paging request at virtual address 425d5b78
Jun 4 13:34:51 lazlo kernel: printing eip:
Jun 4 13:34:51 lazlo kernel: c0159748
Jun 4 13:34:51 lazlo kernel: *pde = 00000000
Jun 4 13:34:51 lazlo kernel: Oops: 0000
Jun 4 13:34:51 lazlo kernel: cmpci emu10k1 ac97_codec sound soundcore parport_pc lp parport
Jun 4 13:34:51 lazlo kernel: CPU: 0
Jun 4 13:34:51 lazlo kernel: EIP: 0060:[] Tainted: P
Jun 4 13:34:51 lazlo kernel: EFLAGS: 00010a87
Jun 4 13:34:51 lazlo kernel:
Jun 4 13:34:51 lazlo kernel: EIP is at find_inode [kernel] 0x24 (2.4.20-8)
Jun 4 13:34:51 lazlo kernel: eax: 00000000 ebx: 425d5b50 ecx: 00007fff edx: c25c0000
Jun 4 13:34:51 lazlo init: no more processes left in this runlevel
Jun 4 13:34:51 lazlo kernel: esi: 00000000 edi: c25d5b50 ebp: 0017ee58 esp: dbf03e38
Jun 4 13:34:51 lazlo kernel: ds: 0068 es: 0068 ss: 0068
Jun 4 13:34:51 lazlo kernel: Process rc (pid: 7817, stackpage=dbf03000)
Jun 4 13:34:51 lazlo kernel: Stack: 00000000 dec3c640 c24db400 0017ee58 c25d5b50 0017ee58 c24db400 c0159a78
Jun 4 13:34:51 lazlo kernel: c24db400 0017ee58 c25d5b50 00000000 00000000 0017ee58 c7907a40 de64dd80
Jun 4 13:34:51 lazlo kernel: c7907a40 e0868d5c c24db400 0017ee58 00000000 00000000 de6162e4 fffffff4
Jun 4 13:34:51 lazlo kernel: Call Trace: [] iget4 [kernel] 0x54 (0xdbf03e54))
Jun 4 13:34:51 lazlo kernel: [] ext3_lookup [ext3] 0x7c (0xdbf03e7c))
Jun 4 13:34:51 lazlo kernel: [] real_lookup [kernel] 0xc3 (0xdbf03e9c))
Jun 4 13:34:51 lazlo kernel: [] link_path_walk [kernel] 0x40f (0xdbf03eb8))
Jun 4 13:34:51 lazlo kernel: [] __vfs_follow_link [kernel] 0x33 (0xdbf03ef8))
Jun 4 13:34:51 lazlo kernel: [] update_atime [kernel] 0x6f (0xdbf03f04))
Jun 4 13:34:51 lazlo kernel: [] link_path_walk [kernel] 0x52c (0xdbf03f14))
Jun 4 13:34:51 lazlo kernel: [] path_lookup [kernel] 0x3d (0xdbf03f54))
Jun 4 13:34:51 lazlo kernel: [] __user_walk [kernel] 0x49 (0xdbf03f64))
Jun 4 13:34:51 lazlo kernel: [] sys_access [kernel] 0x7b (0xdbf03f80))
Jun 4 13:34:51 lazlo kernel: [] system_call [kernel] 0x33 (0xdbf03fc0))
Jun 4 13:34:51 lazlo kernel:
Jun 4 13:34:51 lazlo kernel:
Jun 4 13:34:51 lazlo kernel: Code: 39 6b 28 89 de 75 f1 8b 44 24 20 39 83 94 00 00 00 75 e5 8b

The line highlighted in blue was interesting. Looking through the rest of my system logs, I found a similar line signaling the start of my troubles. I found it 27 times in fact. In all cases, the memory addresses (the hex number at the end of the line) were all very close to one another. So more evidence of a memory issue?

I went and got a copy of Memtest86. It does really thorough tests on you memory, and works independently of any operating system. You put the bootable image right on a floppy and reboot. I decided to put it in my grub.conf. Here's what I have now:

default=0
timeout=10
splashimage=(hd0,0)/grub/splash.xpm.gz
title Red Hat Linux (2.4.20-8)
root (hd0,0)
kernel /vmlinuz-2.4.20-8 ro root=LABEL=/ hdc=ide-scsi
initrd /initrd-2.4.20-8.img

title Memtest86 Memory Tester
root (hd0,0)
kernel /memtest86.bin

On reboot, you select "Memtest86 Memory Tester" and it starts up and goes right into testing. It found a lot of errors, all in a small memory range like I figured. I took the first DIMM (I have two 256MB sticks) out and re-ran the tests again. No errors. Just for completeness' sake, I swapped them and re-ran the tests yet again. I got the same errors as before. Bad RAM it is then.

I've always bought memory from Crucial. Their prices aren't that much higher than anyone else, and they make great RAM. You also get a lifetime warranty and free shipping, so that's a bonus. And I love their memory selector. It knows about every motherboard in the world, and tells you exactly what memory you need. I called Crucial to see if this "lifetime" warranty really existed. I didn't have high hopes, but I figured it was worth a shot. Within 3 minutes of my first calling a guy answered and asked what sort of problem I had. I explained what was happening, what diagnostics I did, and what I thought the issue was. He listened and then told me that I'd be needing a new stick. Then he asked me, "This is the PC2100 DIMM from the Asus A7M266 motherboard, right?" He knew which brand of motherboard I had, what the serial number on the DIMM was, everything. It was sort of surreal, but saved me a lot of time.

He eventually transferred me to another department so they could get my return info. A lady came on and said that I'd be needing an RMA, right, and would I like to have them send me a replacement via FedEx 2nd day air (free of charge) while I send my bad RAM to them? I said I could wait, and that it would be fine if they sent the replacement when they got my defective memory. So then she says, "OK, is email good for you? I can send out return instructions and the RMA to you right now if you want." Is email good for me?! Like, yeah. She had my address and everything on file and ready to go. Before I hung up with her I had an email telling me exactly what I need to do.

I think I'll be buying all my memory from Crucial now, regardless of price. This is an example of exactly how customer service should work.

Comments for: This is what it sounds like when kernels cry

"I found it 27 times in fact."
A coincidence? I think not...

(That was about the only part of that paragraph I understood. )

Posted by suzi at June 5, 2003 7:57 PM

I'm impressed that you got that far, actually - I confess that I sort of, well, skimmed to the end where Wee began talking about Crucial's customer service - now there we're talking my language, professionally speaking... ;-)

Posted by Tess at June 6, 2003 10:06 AM

Well, it was all geek stuff. My computer basically got the Linux equivalent of a blue screen of death (see here: http://zem.squidly.org/bsod/ for some examples, or just install some Windows software) and I wound up tracing it to faulty RAM. I basically just wanted to document my sleuthing for future entertainment value. I'll see it and think something like "Uh, yeah; that's obvious, ya moron..."

Posted by wee at June 11, 2003 2:43 AM

Post a comment
Name:


Email Address:


URL:


Comments:


Remember info?