As I posted earlier, I upgraded my computer yesterday. It was NOT fun. Now that the hard parts are mostly (I hope) over, I thought I’d rant about it.
As I said in the earlier entry, I’ve been feeling the need to do this for a while. I kept waiting for a weekend when I knew I didn’t have anything else important going on. I knew such weekend weren’t going to be that frequent with the Christmas season coming. In retrospect, I probably shouldn’t have deprived Maddy’s friends of access to her blog while she’d just had surgery and spent a weekend without access to her computer. But I had forgotten there were people outside of the afu and coffee crowds who wouldn’t be on the mailing lists.
I have two 80 Gb hard drives, one that’s the main drive on this computer, and one that has been sitting on a shelf for months and months. Back when my servers seemed to be crashing at a horrible frequency, I thought my 80Gb drive was failing so I ran out and bought this other one. It turned out swapping drives didn’t help (the problem was actually a combination of an inadequate power supply, and bad cooling – the machine still overheats and crashes if I run SETI@home on it), so the other drive became sort of an emergency backup. The drive on the shelf is actually marginally faster than the one in the box, so my plan was:
- Put the other drive in the box as hdc
- Partition hdc the same as hda (except with more space in a couple of partitions like /tmp and less in the unimportant /backup_1) and mkfs them. (Decided to continue to use reiserfs because it’s working fine now.)
- Boot the box into single user mode, and copy all the files from the partitions on hda to the new drive on hdc.
- Swap the drives around.
- Upgrade the installation on the new drive, now called hda, so that if anything went wrong I could swap the drives back and run the old installation (smartest idea I had all weekend).
My first snag was copying the files – evidently the timecaf files used by the news server are sparse, so if I didn’t pass the right argument to tar, they fill up the partition pretty quickly. For the record, a typical tar command ended up looking something like:
(cd /var; tar clBSf - .)|(cd /mnt/hdc3; tar xvvpBSf -)
Upgrading seemed to go fine – except I was dragged away forceably to go to dinner just before it was due to spit out CD 2 and ask for CD 3 (of 3, so if we’d been able to leave a few minutes later it would have all been done when we got back). So I did the CD 3 after I got back, and after the upgrade completed, rebooted (with the network plug out so that it didn’t fetch any mail down before it was ready).
As usual after upgrading, the first thing I did was to run /etc/cron.daily/slocate.cron and use locate to find all the .rpmnew and .rpmsave files sprinkled around to find all the configurations that needed to be upgraded. This is a tedious process at the best of times. This time, there were several big problems:
- The NUT ups software that I’d installed from source seemed to be interfering with the NUT ups software that RedHat installed for me. I think when upgrading config files I accidentally copied one of my original config files on top of one of the new ones.
- I couldn’t get postfix (the mail server) to even rebuild its alias database.
- The postgres database isn’t an automatic upgrade, you have to wipe it and reinstall from backups. Fortunately I have a nightly backup script that takes care of dumping the database.
- RedHat moved one of INN’s directories from one place to another, so it didn’t save my modified version of one of the scripts as an rpmsave, just removed the whole damn directory. Fortunately I still had the original on hdc.
At this point, I decided it was time to make sure I had all the security updates, just in case this was what was wrong with postfix. I made sure that /etc/rc.d/rc.local didn’t start up fetchmail and root’s crontab didn’t restart fetchmail (again, so that no mail got fetched until I knew I was ready for it), and rebooted while connected to the net.
Autorpm was giving problems. Every time it downloaded an rpm, it would finish and then say that it was unable to fetch it, attempt to fetch it again, and then claim it was corrupt. That was weird. At this point I also noticed that the rpm program itself was complaining about problems with the rpm database. I tried “rpm –rebuilddb” and that didn’t fix it.
At this point, I decided it was too late, and the install was irretrevably fucked. So I swapped the two drives around, and booted with the old one, and went to bed. That was about 1 am.
Next morning, and it’s time to try again. I decide to start from scratch the same as before, but thanks to a little conversation with a guy on-line, I think I realize what at least part of my problem is – a few weeks ago I decided I wanted the SMTP AUTH capabilities of the newest Postfix, so installed the version that was on FreshRPMs. And when I installed it, it dragged in a newer version of db4. RPM uses db4 internally, and when RedHat upgrades itself it puts in its own db4, even if it’s a downgrade from what’s installed. So this time, I decide to copy my config files, and uninstall postfix first. I also grab a new postgres database dump, and delete the NUT ups software. I make sure there aren’t any residual .rpmsave and .rpmnew files from previous upgrades hanging around too.
This time, the upgrade goes as well as before, and the updating of config files is every bit as boring as the previous time. One little hitch is that the “Control Panel” icon on root’s desktop doesn’t work. Neither do some of the redhat-config-* programs. Oh well, I’ve been configuring Linux systems for 10 years without them, I suppose I can keep doing it. This time, postfix installed and worked without a hitch. Postgres was restored, and it’s working. NUT was configured (although not the cgi scripts at first), and I tested the web stuff (static pages, viewing blogs, the picture gallery, the redirects to the navaid pages) and it all worked. A few little mailman tests proved that not only do I have to change all the existing aliases for my existing lists to use a new wrapper script, but I also have to add a whole bunch of new aliases for every single list. listname-subscribe, listname-unsubscribe, listname-help, etc. I could read news and mail. Everything looked ready to go live.
But first I wanted to get all those security patches, so once again I made sure fetchmail couldn’t run, connected to the network and ran autorpm. Except autorpm was doing the same thing it did before. I quickly stopped it, and verified that rpm didn’t have a corrupted database this time. I downloaded the newest autorpm, but that didn’t fix it until I uninstalled and reinstalled. Then it worked. It downloaded about a hundred upgraded rpms and installed them automatically. Once more through the config files looking for rpmsaves and rpmnews, and a reboot.
Now I was ready for the moment of truth. I sent a mail from inside and it was delivered out. So I crossed my fingers and run fetchmail. A metric buttload of mail was sucked down, and the local mail was delivered locally and the mailing lists were exploded correctly and sent out. Woo hoo. News was flowing, amanda was backing up, and everything looked right in the world. Oh oh, Vicki can’t run mutt – when she tries, it exits with a segmentation voilation. Ever tried to explain a segmentation violation to a non-programmer? Don’t bother. Just tell them it’s a software problem – you’ll both be happier. I poked around for half an hour, finding that sometimes I could open her spool file and other times it would SEGV. I tried clearing out her spool file, but every time it got to about 4 messages, she’d start getting SEGVs again. Frustrated, I went to bed, at around 11:30.
The next morning, this morning, there were problems. Of course. First off, I tried to run the weekly local backup script, and the machine froze up and had to be rebooted. Second problem, Vicki’s mutt problem. She’s reading mail with Thunderbird ok, but I was worried that maybe it was some conflict between imapd and mutt that was causing the problems. No such luck. Once again, it appeared that if mutt had to open another spool file first, it was fine, but if Vickis was the first one it had to open, it SEGVed. strace didn’t tell me much, and mutt’s binary had been stripped so gdb was no help. Vicki forwards her mail to her work account, and I go off to work, extremely late because of all the time I’ve wasted on this.
Meanwhile, I’ve discovered that I can’t post to my blog because of a missing perl module, HTML::Template. CPAN to the rescue there. I fix that, and am relieved to find that Maddy posted a gigantic blog entry later on.
I’ve also discovered that almost, but not entirely, all the mailing list posts that have been made since the upgrade are sitting in ~mailman/qfiles/shunt, a directory that didn’t exist before the upgrade. A little googling finds that it puts them there where there is some problem with posting, and a look at the log files finds that while the posts did go out to the lists, which is why I thought everything was fine, it’s having problems storing them into the archives. And the problem appears to be with old archive entries – specficially ones back when mailman had that bug where it gave some archives negative dates. I blow away the affected archives and rebuild them using bin/arch and the .mbox files, and use bin/unshunt, and all is well. Well, not 100% well – while rebuilding one of the archives, the computer freezes up again and I have to phone Laura to have her reboot it.
Meanwhile, I’m still looking at Vicki’s mutt problem. I’m convinced it’s some sort of file locking thing, and I probably need the latest kernel, but that’s not something I can do remotely. First I want to get a better idea of where it’s core dumping, so I download the source code and build it with the -g flag. Of course, that version doesn’t core dump. So I shrug and leave the source built one in /usr/local/bin while the rpm one is in /usr/bin. If the rpm ever gets upgraded and starts working again, it will be easy to remove the /usr/local/bin one.
And in other news, I can’t post to certain, let’s say “protected”, newsgroups. Once again, some googling and a few plaintiff cries for help on Usenet, and I discover that if you use nnrpd_auth.pl perl authentication, there is no way to tell it if a user is allowed to approve stuff or not. So it’s time to switch to a readers.conf based authentication, as I’d been planning for a while, but that also means I have to figure out how. It took a few false starts, but I finally got it working.
Sometime, I forget when, I also evidently got the UPS cgis working.
At this point, except for those two unexpected freezes this morning, the computer is working satisfactorally. I’m going to put in the new kernel as soon as I get home tonight. I think those control panel things still aren’t working, but I’ll try blowing away root’s kde and gnome config files to see if that helps. But that’s minor.
Man, I’m glad that’s over. Just don’t ask me how much work I got done at work today.
Aren’t you being a bit explicit with the poultry?