My daily use machine has a SATA disk drive, and it would not boot this morning. GRUB would begin, and then fail saying it could not find any initrd or kernel.
I was able to boot using a Live CD, but several of the ones I have didn't work. Time to purge my "rescue disk" collection, download the latest Trinity Rescue and Knoppix, etc.
The one that did work would not recognize that the HD even existed. So no fdisk, no mounting the partitions for backup, nothing. Doomed, I thought. Fried disk. Right.
So I unplugged it, opened up the case, wiggled the wires, made sure that the SATA cables were nice and tight, etc. I burned a new Debian Live "rescue" CD (laptops make great spare things to have around for just such emergencies) and booted up. This time it recognized the HD, and I was able to run fdisk. The partitions were still there, and visible. This is a very good thing. It means they might be mountable, to backup /home if nothing else. Note to self and anyone who will listen, backups are a great idea.
|yeah, me too.|
# fsck.ext3 -f /dev/sda4
You see, just "fsck" does NOT have a force option, "-f", or any other way to force it to ignore the journal of a journaled file system. "fsck" is just a front end to the file system specific program, like "fsck.ext3", "fsck.ext2", "fsck.cramfs", etc.
Several years ago, NOT knowing how to force "fsck" on an ext3 file system was the primary reason I didn't get a job with a VAR where I was living at the time. This is something everyone should keep in their Linux tool box.
Anyway, "# fsck.ext3 -f /dev/sda4" reported everything was just fine. This is not a result I was hoping for, since I'd much rather find something was actually BROKEN, than to learn nothing from the hardware chick. I mean, check.
I mounted the /home partition, and did a good backup. While it was copying files, there were many disk errors: "ATA bus error". I wish I had photographed the screen so I could put the full text of the errors here, and look them up. Unfortunately, it's not clear to me if the "ATA bus" means on the motherboard or the disk drive. This is an important distinction, since I'm actually on my third motherboard since getting this particular system. Yes, I've replaced the motherboard again since writing that blog entry. I also bought a much quieter case and a better power supply. I have a bunch of pictures, but I didn't think I needed to do yet another "replacing motherboard" story.
|AMD Phenom II 945|
Having finished the backup while watching hundreds if not thousands of these errors go scrolling by, I figured rather than do the logical thing of replacing the disk drive and seeing if the errors repeated, I'd try seeing if the HD would boot.
Sure enough, I'm writing this blog entry on the same disk image that was giving me nightmares just 7 hours ago. dmesg isn't showing any errors, either, which is really not a good thing. Something was broken earlier, something was causing errors, and if I don't find it and fix it those errors will happen again.
|Transistors should not look like that.|
Waiting until it really breaks, ignoring those little things like smoke and loud grinding noises, is what separates the hackers from the hoi polloi.
|I'm sorry, Dave, I can't do that.|
I wonder what our computers would tell us if we could really talk to them? Probably nothing good.
Peace, and remember, Practice Safe Hex.
P.S. don't miss Fsck Part Duh!
P.S. don't miss Fsck Part Duh!